How to choose an LLM in 2026: the definitive guide

By LLMTest Team · Apr 22, 2026 · 36 min read guidemodel-selectionfundamentals
On this page

On this page

  1. The wrong mental model vs the right one
  2. The 7-step framework
  3. Step 1. Define the job honestly
  4. The task taxonomy
  5. Why the category changes the answer
  6. The sharpening test
  7. Step 2. List your real constraints
  8. Budget per request
  9. Latency p95
  10. Context you actually need
  11. Privacy floor
  12. Regulatory
  13. Modalities and ecosystem
  14. The output of Step 2
  15. Step 3. Score on the five dimensions that actually matter
  16. 1. Quality
  17. 2. Cost
  18. 3. Speed
  19. 4. Reliability
  20. 5. Ecosystem
  21. Step 4. Map the models and shortlist
  22. The eight families
  23. Step 5. Build the shortlist
  24. Worked example: the support-email classifier
  25. The "more than 4" problem
  26. The "zero survived" problem
  27. Step 6. Run the head-to-head
  28. Build the golden set
  29. Run the shortlist
  30. Grade blind
  31. The scorecard
  32. LLM-as-judge: when and when not
  33. Time budget
  34. Step 7. Route, don't pick
  35. The three routing patterns worth knowing
  36. What a quality gate looks like
  37. The cost math
  38. Re-evaluate on a schedule
  39. Try it on your job
  40. Ten traps that waste months
  41. TL;DR cheat sheet
  42. FAQ
  43. Where LLMTest fits

Last updated: April 2026. Specific models named below (Claude Opus 4.7, GPT-5.4, Gemini 3.1 Ultra, DeepSeek V3.2, Llama 4.1) reflect the frontier at the time of writing. The framework doesn't age. The names do. When a new model drops, swap it into the same slot and re-run the same seven steps.

There is no best LLM. There's only the best LLM for this job, at this budget, under this latency budget, with this failure mode. Every time someone tweets "just use GPT-5" or "Claude is obviously better", they're answering a question you didn't ask.

This is the process we use. It's the process we've watched solopreneurs use to cut their bills 60% without losing quality. And it's the process that survives every model release. Seven steps. Two of them most people skip. All seven matter.

By the end you'll know how to pick a model for anything. A support-reply bot, a SQL generator, a code reviewer, an OCR pipeline. And you'll stop re-picking every time Twitter gets excited about a release.

The wrong mental model vs the right one

Most people picking an LLM picture it like this: a leaderboard, a winner, a loser, a decision.

That model breaks the moment you actually ship. A leaderboard winner on MMLU can be twice the price, three times the latency, and worse at your specific task than a mid-tier model you dismissed. "Best" is not a property of the model. It's a property of the match between a model and your job.

The right mental model is a funnel:

flowchart LR
    A[Your job] --> B[Your constraints]
    B --> C[5 dimensions<br/>that matter]
    C --> D[Shortlist<br/>2-4 models]
    D --> E[Head-to-head<br/>on your data]
    E --> F[Route in production<br/>with fallback]
    F --> G[Re-evaluate<br/>quarterly]
    G -. new release or regression .-> A

Seven steps, one loop. The loop matters. Picking an LLM is not a one-time decision. Models update in place, prices change, your traffic mix shifts, new frontier releases drop every six weeks. The teams that stay on the right model are the ones that re-run the loop. Not the ones who pick cleverly once.

The 7-step framework

Step 1. Define the job honestly. One sentence: what goes in, what comes out, what "good" looks like. "Chatbot" is not a job. "Rewrite a customer's support question into three FAQ candidates, ≤60 tokens each, in the user's language" is a job. The narrower the job, the better every downstream step works.

Step 2. List your real constraints. Budget per request (not per month). p95 latency ceiling (the speed 95% of your requests beat, the bar that shapes what a user actually feels). Context size you actually need. Privacy floor (can the data leave your region?). Regulatory requirements. These are hard filters. They eliminate candidates before quality enters the picture.

Step 3. Score on five dimensions. Quality, cost, speed, reliability, ecosystem. Not one dimension. Not three. Five. Each has a concrete measurement method, which we'll cover. Benchmarks cover at most one and a half of them.

Step 4. Map the models and shortlist. Know the eight major model families, their tiers, and their current frontier. Apply your constraints. Output: 2–4 candidates. If you have more than 4, your constraints aren't tight enough. If you have 0, something in your constraints is wrong. Two shortcuts for this step: the LLM API pricing comparison and the capabilities matrix put every major model in two scannable tables.

Step 5. Run your own head-to-head. Build a golden set of 20+ real prompts from your actual use case. Run each candidate. Grade blind with a rubric. Do not skip this step. Do not let a leaderboard skip this step for you. The 90 minutes this takes will save you 90 days of picking wrong.

Step 6. Route, don't pick. Ship two models, not one. Cheap model by default, escalate to the expensive one on soft failures (low confidence, schema violation, policy flag). Add a fallback for provider outages. One model in production is a single point of failure and a single price ceiling.

Step 7. Re-evaluate on a schedule. Quarterly at minimum. Re-run the golden set against current frontier. Watch for silent regressions on in-place model updates (yes, they happen). Adjust routing. This is the step that separates teams who bleed $40k/year on the wrong model from teams who don't.

That's the whole framework. The rest of this guide is how to do each step well, with diagrams, worked examples, and the traps we've personally stepped into so you don't have to.


Step 1. Define the job honestly

If you can't write your job in one sentence, you're not picking an LLM yet. You're still picking a product.

"Chatbot" is not a job. "Summarizer" is not a job. "AI assistant" is really not a job. These are categories of thousands of jobs, each with different optimal models. Until you've collapsed your problem into a sentence that names the input, the output, and the success criterion, every model is a plausible answer and none of them is right.

A real job statement looks like this:

  • Classify incoming support emails into one of 8 categories plus confidence score, ≤80 tokens out, <800ms p95.
  • Given a Postgres schema and a natural-language question, emit a valid read-only SQL query that runs without error on the first try 95% of the time.
  • Rewrite a product description in the user's chosen tone (playful / technical / minimal), preserving all facts, ≤200 tokens.

Notice what's in each one: a precise input, a precise output shape, a measurable success threshold. Notice what's missing: the word "intelligent", the word "smart", any reference to "understanding". LLMs don't understand. They pattern-match at a level high enough that, for a narrow enough job, it looks like understanding. Your job is to make the job narrow enough that the pattern-match is reliable.

The task taxonomy

Most jobs collapse into one of eight task types. Knowing which one you're in tells you which dimensions to weight heavily in step 3.

flowchart TD
    Root(["LLM jobs"])
    Root --> Extract["Extract / Classify<br/>─────────<br/>sentiment, intent, PII,<br/>invoice fields, entities"]
    Root --> Transform["Rewrite / Transform<br/>─────────<br/>summarize, translate,<br/>tone-shift, format"]
    Root --> Generate["Generate from scratch<br/>─────────<br/>marketing copy, emails,<br/>creative, scaffolding"]
    Root --> Reason["Multi-step reason<br/>─────────<br/>math, proofs, planning,<br/>root-cause analysis"]
    Extract ~~~ Tool["Tool use / agent<br/>─────────<br/>function calls,<br/>browser, shell, APIs"]
    Transform ~~~ Multi["Multimodal<br/>─────────<br/>image, audio, video,<br/>OCR, diagrams"]
    Generate ~~~ Retrieve["RAG over docs<br/>─────────<br/>answer + cite from<br/>private corpus"]
    Reason ~~~ Code["Code write / review<br/>─────────<br/>generate, review,<br/>diff, debug"]

Why the category changes the answer

Two examples, same user, wildly different picks.

Job A: classify 2M support emails/month into 8 buckets. This is pure Extract/Classify. It does not need reasoning. It does not need a 200k context window. It does not need personality. What it needs: cheap, fast, reliable, structured output. You want Haiku 4.5 or Gemini 2.5 Flash, not Opus 4.7. Picking Opus here burns 20× the budget for zero measurable gain.

Job B: write a root-cause analysis from three logs, a customer complaint, and the last 48h of deploys. This is Reason plus Retrieve. It needs a big context window, careful reasoning, and ideally explicit chain-of-thought. Opus 4.7 with thinking, o4, or Gemini 3.1 Ultra earn their price tag here. Picking Haiku saves $0.04 per call and produces a confidently wrong answer 30% of the time. One bad RCA costs more than 1,000 good ones.

Same company, same engineer, same week. Two different models, because two different jobs. If you try to pick "a model for the company" you'll pick wrong for both.

The sharpening test

Take your job statement and ask: if I handed this to a junior engineer with no context, could they tell from the sentence alone whether the output is right or wrong?

  • "Summarize this document" fails. Right how? Length? Audience? Style?
  • "Summarize this document into 5 bullet points, each ≤15 words, covering the key actions a reader must take" passes. Either there are 5 bullets, each ≤15 words, each an action, or there aren't.

If your job fails the sharpening test, no model will reliably solve it. Not because the models are bad. Because the target is ambiguous and different calls will aim at different targets. Sharpen first, pick second.

Step 2. List your real constraints

Constraints eliminate models. That's their job. If you don't write them down, they silently eliminate models anyway, based on whichever one you saw on Twitter last.

There are six constraints worth writing down. Most of them have a number.

flowchart LR
    subgraph HARD[Hard constraints — eliminate models]
        direction TB
        C1[Budget per request<br/>$ per call]
        C2[Latency p95<br/>ms to first token + total]
        C3[Context needed<br/>tokens in + tokens out]
        C4[Privacy floor<br/>region, retention, training]
    end
    subgraph SOFT[Soft constraints — weight models]
        direction TB
        C5[Modalities<br/>text, image, audio, video]
        C6[Ecosystem<br/>JSON, tools, streaming, SDK]
    end
    HARD --> Filter[Candidate set]
    SOFT --> Filter

Budget per request

Not per month. Per request. The reason is arithmetic: a model that costs $0.002 per call and one that costs $0.03 per call are a 15× difference at scale, but you don't feel the blow-up until volume climbs. Decide the ceiling when your head is cool, not when the bill lands.

Rule of thumb for solopreneur AI features: if the user isn't paying you at least 10× the per-call cost, you don't have a business, you have a subsidy. A $0.03 call inside a free-tier product with 50 uses per signup is $1.50 of free compute per signup. Good luck.

Latency p95

"p95" is the 95th-percentile response time: the ceiling that 95% of your requests stay under. Put differently, only the slowest 5% of calls are allowed to be slower than your p95 number. It's the number that shapes what a user actually feels, because average latency hides the bad days. If your average is 1s but your p95 is 9s, one user in twenty is waiting nine seconds — and they're the ones who remember.

Two numbers to pin down: time to first token (TTFT) and total time. Chat-style UX cares about TTFT because the user sees streaming. Background pipelines care about total. For most solopreneur apps:

  • Interactive chat / autocomplete: TTFT <400ms, total <3s.
  • Form assistants / summarize-on-click: total <5s.
  • Async email/report generation: total <30s is fine.

If your UX budget is 2s total and your model takes 8s p95, that's an eliminated model, full stop, regardless of quality.

Context you actually need

Not context the model advertises. Context your prompt actually occupies at p95 input size. Measure it. Then add 30% for growth.

A model that claims 1M tokens but degrades badly past 200k is still a 200k model for your purposes. See the context window post for the full breakdown. Pick the smallest context window that fits your real p95 plus headroom. Bigger is not better. It's usually slower and often more expensive per token.

Privacy floor

Four tiers, roughly:

  1. Public OK. No constraint. Any provider.
  2. User data, no training. Must disable training on your data (Anthropic, OpenAI, Google all offer this contractually).
  3. Region-locked. Must stay in EU/US/customer region. Narrows to providers with regional endpoints.
  4. Self-hosted only. Open-weights only: Llama, Mistral, DeepSeek, Qwen.

Skipping this step is the #1 cause of "we have to rip out the AI" six months in. Decide upfront.

Regulatory

HIPAA, GDPR, SOC2, FedRAMP, HIPAA BAAs. Each knocks out some providers. Check the compliance page before you check the leaderboard.

Modalities and ecosystem

Does your job truly need image input, or is that a "nice to have"? Truly need audio? Native JSON mode? Tool use with parallel calls? Streaming? A Python SDK that doesn't require you to write five retry wrappers?

These are usually soft. They weight candidates rather than eliminate. But in specific jobs (anything with audio, for instance) a missing modality is a hard filter.

The output of Step 2

A six-line filled-in worksheet. Example for our support-email classifier:

Budget/request:  ≤ $0.002
Latency p95:     ≤ 800ms total
Context:         ≤ 4k in, ≤ 200 out
Privacy:         no-training, EU-region OK
Modalities:      text only
Ecosystem:       JSON mode required

With just those six lines, half the frontier market is eliminated before you look at quality. That's the point.

Step 3. Score on the five dimensions that actually matter

Quality. Cost. Speed. Reliability. Ecosystem. Five dimensions. Every serious LLM choice weighs all five. Most people weigh one (whichever one the last blog post they read was about) and wonder why their pick disappoints.

Here's the shape you're optimizing against. Two models, same job, very different profiles:

Quality Cost (inverse) Speed Reliability Ecosystem Claude Opus 4.7 (frontier) Claude Haiku 4.5 (budget)

Opus wins quality and ecosystem. Haiku wins cost and speed. Both are roughly tied on reliability. Neither one wins overall. The winner depends on which axes your job weights.

Let's walk each dimension.

1. Quality

What it is: the probability your model returns an output that passes your job's success criterion. Not IQ. Not MMLU. Not "feels smart".

How to measure it: build a golden set (Step 5 covers this in detail), run each candidate, grade blind against a rubric, report a pass rate and a 1–5 quality score. That's it. Anything else is someone else's golden set on someone else's job.

The benchmark trap: public benchmarks (MMLU, GPQA, HumanEval, SWE-Bench) measure quality on their jobs, which correlate with yours by somewhere between 0.2 and 0.7. That's enough to narrow a shortlist, not enough to pick. If the benchmark winner is 3% better but 10× more expensive, that 3% had better be on your golden set, not theirs.

Rough quality tiers (April 2026):

  • Frontier: Opus 4.7, GPT-5.4, Gemini 3.1 Ultra. Differ by <5% on most jobs; pick by price/latency/ecosystem.
  • Mid-tier: Sonnet 4.6, GPT-5.4 mini, Gemini 3.1 Pro, DeepSeek V3.2. 10–25% quality gap below frontier on hard reasoning; close-to-indistinguishable on simple tasks.
  • Budget: Haiku 4.5, GPT-5 nano, Gemini 2.5 Flash, Llama 4.1 70B. Wide quality spread. Measure, don't assume.

2. Cost

What it is: the actual dollars per completed request in your real workload. Not $/1M tokens on the pricing page.

The four hidden multipliers (covered in full in the three hidden LLM costs):

  1. Thinking tokens on reasoning models. Often 5–10× the visible output.
  2. Retries on malformed JSON, tool calls, or refusals. Multiplies per-call cost by 1.05–1.5×.
  3. Prompt bloat. Six months of "never do X" lines that add 30% to every call forever.
  4. Cache misses. Provider caches only pay off on specific access patterns.

Quick real-bill formula:

cost_per_request ≈
  (prompt_tokens × input_price)
  + (prompt_tokens × input_price × retry_rate)
  + (output_tokens × output_price)
  + (thinking_tokens × output_price)
  − (cached_prompt_tokens × input_price × 0.9)

The last two terms are where sticker price diverges from real bill. A "$3 per million input" model can bill you $8 per million in practice, and a "$15 per million" model with 80% cache hit can bill you $4.

Quality per dollar is the honest metric. Divide your measured quality score by measured $/request.

3. Speed

Two numbers, always:

  • Time to first token (TTFT): the user-visible latency for streaming UX.
  • Tokens per second (TPS): how fast the response streams.
  • Total = TTFT + (output_tokens / TPS).

What drives each:

  • TTFT is dominated by provider queue depth and model size.
  • TPS is dominated by model architecture (MoE models often stream faster; reasoning models stream slower overall because thinking blocks come first).

April 2026 ballparks: Haiku 4.5 streams around 180 TPS with 350ms TTFT. GPT-5.4 frontier around 85 TPS with 500ms TTFT. DeepSeek V3.2 around 60 TPS with 1100ms TTFT (but 8× cheaper). Groq-hosted Llama 4.1 70B hits 550 TPS. Nothing else in the closed-weights world gets close, if speed is your primary axis.

Speed is almost always a tradeoff, not a free lunch. A model that's 3× faster is usually 10–30% worse at hard reasoning. For classification it doesn't matter. For RCA it matters a lot.

4. Reliability

What it is: the probability your call actually returns a usable response on the first try. Includes: provider uptime, rate-limit behavior, regional failover, consistent structured output, refusal rate on legitimate queries.

How to measure it: once you're in production, log every non-200 and every "200 but unusable" (bad JSON, wrong schema, policy refusal, truncation). Compute weekly. Anything above 0.5% failure is worth a router with fallback (Step 6).

Provider reality, April 2026: all four frontier providers (Anthropic, OpenAI, Google, DeepSeek) have had >1h regional outages in the last six months. Single-provider is a reliability choice, one we recommend against for anything user-facing.

5. Ecosystem

The dimension most people forget until week three.

  • Structured output: native JSON mode, JSON Schema enforcement, grammar-constrained decoding.
  • Tool / function calling: single-call vs parallel, strictness, argument validation.
  • Streaming: SSE, WebSocket, chunked JSON. Not all providers do all three cleanly.
  • Caching primitives: prompt caching surface, cache TTL, cache keying.
  • SDK quality: retries, timeouts, typed responses, async support, observability hooks.
  • Multimodal plumbing: image/audio/video input handling, file APIs, embedding models.

Two models of equal quality and cost can be a week of engineering apart. Weight this.

Step 4. Map the models and shortlist

Once you know the job, the constraints, and the five dimensions, you can read the model map.

The map in April 2026 fits on one chart. Price (log scale) on the x-axis, quality tier on the y-axis, family as color.

Blended price per 1M tokens (log scale) Quality tier → Frontier Near-frontier Mid-tier Budget $0.30 $1 $5 $15 Value pocket Frontier wall Opus 4.7 Gemini 3.1 Ultra GPT-5.4 Sonnet 4.6 GPT-5.4 mini Gemini 3.1 Pro DeepSeek V3.2 Llama 4.1 70B Mistral Large 3 Qwen 3 72B Haiku 4.5 GPT-5 nano Gemini 2.5 Flash

Two things to notice.

The frontier wall is the cluster in the top-right: Opus 4.7, GPT-5.4, Gemini 3.1 Ultra. They're within 5% of each other on most jobs. Picking among them is a latency, ecosystem, and provider-preference decision. Not a quality decision.

The value pocket is the bottom-left: Haiku 4.5, GPT-5 nano, Gemini 2.5 Flash, and on the open-weights side DeepSeek V3.2. For 80% of production workloads (classification, extraction, simple rewrites, structured output) a model in this pocket does the job at 5–15% of frontier cost.

The middle is where most people land when they don't have a framework. "Sonnet is fine" is a reasonable answer, but it's rarely the right answer. Either your job is simple enough to drop to the pocket, or it's hard enough to climb to the wall. Middle-tier is usually a sign the job wasn't sharpened.

The eight families

  • Anthropic (Claude): Opus 4.7 / Sonnet 4.6 / Haiku 4.5. Strongest at coding, instruction following, tool use. Best caching economics.
  • OpenAI (GPT): GPT-5.4 / 5.4 mini / 5 nano. Widest ecosystem, best function-calling reliability, strongest multimodal.
  • Google (Gemini): 3.1 Ultra / Pro / 2.5 Flash. Best at long-context retrieval (real usable 1M), strong multimodal, cheapest frontier option.
  • DeepSeek: V4 Pro and V4 Flash (April 2026). MoE architecture; benchmarks show V4 Pro beats GPT-5.5 at 4.5x lower cost while staying near-frontier on quality. Privacy caveats for some buyers.
  • Meta (Llama): 4.1 405B / 70B / 8B. Open weights. Runs anywhere. Best self-host option if you have the GPUs.
  • Mistral: Large 3 / Medium 3 / small. EU-hosted, strong at European languages, open-weights small tier.
  • xAI (Grok): Grok 4. Uneven. Great at certain reasoning tasks, lower ecosystem maturity.
  • Qwen: Qwen 3 72B / 32B / 7B. Open weights, strong multilingual (especially CJK), cheap to self-host.

Step 5. Build the shortlist

Apply your Step 2 constraints to the model map. What drops out is your shortlist.

flowchart TD
    Start[All models in map] --> B{Budget/request<br/>≤ threshold?}
    B -- no --> X1[Eliminate]
    B -- yes --> L{Latency p95<br/>≤ threshold?}
    L -- no --> X2[Eliminate]
    L -- yes --> P{Privacy floor<br/>satisfied?}
    P -- no --> X3[Eliminate]
    P -- yes --> C{Context fits<br/>p95 + 30%?}
    C -- no --> X4[Eliminate]
    C -- yes --> M{Modalities<br/>supported?}
    M -- no --> X5[Eliminate]
    M -- yes --> E{Ecosystem<br/>features present?}
    E -- no --> X6[Eliminate]
    E -- yes --> Short[Shortlist: 2-4 candidates]

If more than 4 models survive, tighten the constraints. Usually budget or latency. If 0 survive, relax the least-important constraint by 20% and try again. Usually that's modalities (you probably don't need audio input), sometimes latency (is 800ms really a user-visible win over 1100ms?).

Worked example: the support-email classifier

Constraints recap:

Budget/request:  ≤ $0.002
Latency p95:     ≤ 800ms total
Context:         ≤ 4k in, ≤ 200 out
Privacy:         no-training, EU-region OK
Modalities:      text only
Ecosystem:       JSON mode required

Filtering the April 2026 model map:

  • Opus 4.7, GPT-5.4, Gemini 3.1 Ultra: eliminated on budget.
  • Sonnet 4.6, GPT-5.4 mini, Gemini 3.1 Pro: eliminated on budget (at 2M calls/month, they're 5–10× over).
  • Haiku 4.5, GPT-5 nano, Gemini 2.5 Flash: survive budget and latency, all have JSON mode, all offer no-training clauses.
  • DeepSeek V3.2: survives budget, fails latency (p95 ~1.2s on hosted API).
  • Llama 4.1 70B on Groq: survives latency (350 TPS is fast), survives budget, strong JSON mode.

Shortlist: Haiku 4.5, GPT-5 nano, Gemini 2.5 Flash, Llama 4.1 70B on Groq. Four models. Now you run the head-to-head.

The "more than 4" problem

If you emerge with 8 candidates, your constraints are mush. Go back to Step 2, put a harder number on budget or latency, and re-filter. Testing 8 models on a real golden set is a week of work; testing 3 is an afternoon. Every model past the fourth has a logarithmic return on information gained.

The "zero survived" problem

Usually one of three things:

  1. Privacy floor is too strict for the budget. Self-hosted-only plus <$0.001/call means you're renting your own GPUs, and that's a business decision, not a model-selection one.
  2. Latency budget is too tight. 200ms p95 total is impossible at frontier quality today. Adjust the UX, not the model.
  3. The context requirement is exotic. 800k tokens of input at <$0.005/call does not exist in April 2026. Re-architect with RAG.

Constraints can be wrong. The point of writing them down is that when they conflict, you see the conflict and negotiate it deliberately.

Step 6. Run the head-to-head

This is the step that separates teams who pick the right model from teams who pick the model they already had an account with.

The work is not hard. It's just work most people skip.

flowchart LR
    G[Golden set<br/>20+ real prompts] --> R[Run each<br/>shortlisted model]
    R --> O1[Outputs from<br/>model A]
    R --> O2[Outputs from<br/>model B]
    R --> O3[Outputs from<br/>model C]
    O1 --> Blind[Blind grading<br/>with rubric]
    O2 --> Blind
    O3 --> Blind
    Blind --> Score[Per-model<br/>scorecard]
    Score --> Decision[Pick primary<br/>+ secondary]
    Decision -.feedback.-> G

Build the golden set

Twenty prompts minimum. Forty is better. Each one drawn from real traffic. Real customer emails, real SQL questions, real product descriptions, real code diffs. Not made-up examples. Not cherry-picked hard cases. A representative slice: 60% typical, 25% edge cases, 15% genuinely hard.

For each prompt, write the expected outcome. Not the expected output (the model will phrase it differently), but the properties the output must have. For the support classifier: "should be labeled BILLING with confidence ≥0.8." For the SQL generator: "should join orders and customers on customer_id, should filter by status = 'paid', should return no more than 100 rows."

If you can't write the expected outcome, the prompt isn't in your golden set yet. It's a research question.

Run the shortlist

Same prompts, same system prompt, same temperature setting (0 for deterministic tasks, 0.3–0.7 for generative). Log every output. Don't read them yet.

Grade blind

Strip model IDs from the outputs. Grade against the rubric. Pass/fail plus a 1–5 quality rating. Two-person grading if you can; single-grader is fine if the rubric is tight.

Why blind: you have a preference. You've heard Opus is good. You've heard GPT is bad (or vice versa). That preference will leak into subjective grades at a rate of about 15–20%. Blind grading is the only fix.

The scorecard

Build a table. Rows: models. Columns: pass rate, avg quality, p95 latency, $/call, $/pass (dividing cost by pass rate, the honest cost metric), failure mode breakdown.

Model Pass rate Avg quality p95 latency $/call $/pass
Haiku 4.5 94% 4.2 680ms $0.0004 $0.00043
GPT-5 nano 89% 3.9 720ms $0.0003 $0.00034
Gemini 2.5 Flash 91% 4.0 540ms $0.0002 $0.00022
Llama 4.1 70B (Groq) 87% 3.8 260ms $0.0006 $0.00069

This is a real-looking scorecard for the support classifier. Gemini 2.5 Flash wins on $/pass by 2×. Haiku 4.5 wins on quality. Llama on Groq wins on latency. None of them is "best". But the shape of the table tells you what to do next:

  • If quality matters most → Haiku 4.5 primary.
  • If cost matters most → Gemini 2.5 Flash primary.
  • If latency matters most → Llama 4.1 70B primary.
  • If you want the best Pareto-frontier bet → Gemini 2.5 Flash primary, Haiku 4.5 as escalation fallback for low-confidence cases (Step 7).

LLM-as-judge: when and when not

Using a strong model (Opus 4.7, GPT-5.4) to grade candidate outputs is fine for scale. It is not a substitute for human grading on your first pass. Do the first 20–40 prompts by hand. Once you trust the rubric, let a judge-model scale it to 200.

LLM-as-judge fails when the rubric is vague, when the judge has a systematic preference for its own family's outputs (a real effect), and when the task is one the judge itself is bad at. Double-blind it: don't tell the judge which model produced which output.

Time budget

The full head-to-head for a typical solopreneur job is 2–4 hours. Build golden set: 60–90 minutes. Run shortlist: 20 minutes (scripted). Grade: 45–90 minutes. Scorecard: 15 minutes.

Four hours now saves you four months of paying for the wrong model. It is the single highest-ROI activity in this guide.

Doing it by hand is the honest way, and we recommend it the first time so you feel the tradeoffs. If you'd rather skip the four hours: LLMTest replays your real production traffic against every frontier and value-pocket model, blind-grades outputs with a dual-judge rubric (two strong models, outputs swapped to cancel position bias), and publishes the scorecard above — $/pass column and all — without you writing any grading code.

Step 7. Route, don't pick

Here is the move that most guides don't make. After all the work of step 5 and step 6, don't pick one model. Pick two, sometimes three, and route between them.

Single-model production is two kinds of fragile at once. It's a single price ceiling: every request pays the same price, even the 80% that a cheaper model would nail. And it's a single point of failure. When that provider has a 90-minute regional outage (and they all do, 2–3× per year), your feature is down.

Routing fixes both.

flowchart TD
    Req[Incoming request] --> Cache{Semantic<br/>cache hit?}
    Cache -- yes --> Return[Return cached]
    Cache -- no --> Primary[Primary model<br/>cheap + fast]
    Primary -. provider error<br/>or timeout .-> Fallback[Fallback model<br/>different provider]
    Primary --> Check{Output passes<br/>quality gate?}
    Fallback --> Check
    Check -- yes --> Success[Return to user]
    Check -- no --> Escalate[Escalate to<br/>frontier model]
    Escalate --> Success
    Success -.log.-> Metrics[(Metrics +<br/>golden set feed)]

The three routing patterns worth knowing

1. Downshift the whole flow to the cheapest model that still holds your quality bar. Most "routing" guides pitch per-request quality-gate escalation (send to cheap, escalate to frontier on gate failure). It sounds great and it's brittle in practice: writing a gate that's strict enough to catch bad outputs without false-positiving on good ones is harder than picking the right model in the first place. The move that actually works for most solopreneur workloads is simpler — replay your real traffic against the whole shortlist, measure $/pass on each, and replace the primary with the cheapest model whose pass rate stays within ~1 point of frontier. That's a one-time offline swap, not a runtime decision. Typical result on a classification workload: 60–90% bill cut, quality indistinguishable. (This is the core of what LLMTest's autopilot does — re-runs the comparison weekly, swaps only when the win is statistically real, and auto-reverts within 24h if the golden set regresses.)

2. Provider fallback for outages. Primary on provider A, secondary on provider B (different underlying infra). On hard errors (5xx, timeout, rate limit exhausted), retry once on the secondary. This single pattern moves you from 99.5% availability to 99.95% availability with no quality change.

3. Prompt caching for repeated context. If your prompts share a long system prompt or a long context block (RAG results, codebase, document), enable prompt caching at the provider (Anthropic, OpenAI, Google) and structure your prompts so the cacheable part is at the front. Anthropic and Google offer 90% discount on cached tokens; OpenAI is in the same neighborhood. On workloads with 5k+ tokens of shared prefix, this alone cuts the bill 50–70%. (This one's on you — it's a provider feature, not something a router adds.)

All three stack. A well-routed support classifier with cheap-default plus fallback plus provider-side caching runs at 15–25% of the sticker price of always-sending-to-Opus, with higher availability than any single provider.

What a quality gate looks like

The quality gate is the magic. It has to be cheap to compute and strict enough to catch bad outputs without false-positiving on good ones.

Common gates, in order of reliability:

  • Schema validation for structured output. JSON matches expected shape, required fields present, enums in allowed set. Cheap, deterministic.
  • Validator function for domain outputs. SQL parses and runs against an EXPLAIN, generated URL returns 200, generated email address validates.
  • Confidence self-report. Ask the model for a 0–1 confidence score alongside its output. Rough but works as a weak signal.
  • LLM-as-judge on a sample. Expensive, only worth it for high-stakes paths.

If you can't write a cheap quality gate, the cheap model probably isn't safe as primary for that job. Escalate to frontier as the default and keep the router for fallback.

The cost math

Take a classifier handling 2M calls/month:

  • Always-Opus: $0.015/call × 2M = $30,000/month.
  • Always-Haiku (picked by vibes): $0.0004/call × 2M = $800/month. Quality pass rate ~93%. Cheap, but you're eating a 6-point quality loss because nobody checked.
  • Swapped to Gemini 2.5 Flash after offline head-to-head: $0.0002/call × 2M = $400/month. Pass rate ~96% (the head-to-head picked it over Haiku on $/pass). One provider fallback on outage.

Gemini 2.5 Flash wasn't obvious from reading Twitter. It was obvious from running the scorecard in Step 5. That's the whole loop: measure, swap, re-measure next quarter. The difference between $30k/mo and $400/mo is one afternoon of work — or one weekly autopilot cycle.

Full walk-through with code in building an LLM fallback chain in 10 minutes.

Re-evaluate on a schedule

You picked your models. You built your router. You shipped. You are not done.

Three things change under you:

  1. New models release every 4–8 weeks. Sometimes one of them is better on your job at your price. You find out by re-running the golden set, not by reading tweets.
  2. In-place model updates silently regress. A provider upgrades claude-sonnet-4-6 to a new checkpoint and your pass rate drops 4 points. No changelog, no version bump. Happens. Watch for it.
  3. Your traffic mix shifts. What started as 60% typical / 40% edge cases becomes 30/70 as users learn what your product can do. The model that won on last year's distribution may lose on this year's.

The fix is a scheduled re-evaluation. Quarterly minimum. Monthly if your volume justifies it.

flowchart TD
    Prod[Production traffic] --> Sample[Sample new<br/>real prompts]
    Sample --> Refresh[Refresh golden set<br/>keep 50% evergreen]
    Refresh --> Rerun[Re-run shortlist<br/>+ any new frontier]
    Rerun --> Compare{Winner<br/>changed?}
    Compare -- no --> Log[Log, revisit next quarter]
    Compare -- yes --> Test[A/B 10% traffic<br/>for 2 weeks]
    Test --> Decide{Holds up in<br/>production?}
    Decide -- no --> Log
    Decide -- yes --> Route[Update router config]

The evergreen 50% of the golden set catches silent regressions. The refreshed 50% catches distribution drift. Together they tell you when to switch and, just as important, when not to, even though a shiny new model is on everyone's timeline.

Try it on your job

Before we hit the traps, put the framework on your own problem. Fill these in, in your head or on paper, and the rest of this post will click harder.

1. What's your job in one sentence?

Input, output, success criterion. If you can't finish this sentence, no model will reliably solve the problem. Go sharpen.

2. What's your hard ceiling on cost per request?

In dollars. If you don't know, pick a number you'd be embarrassed to exceed. $0.002 is a reasonable default for high-volume features; $0.05 for low-volume expensive ones.

3. What's your p95 latency budget, total?

In milliseconds. Interactive UX: under 3,000ms. Form-click: under 5,000ms. Async background: under 30,000ms. Be honest about which bucket you're in.

4. Can your data leave your region? Can it be used for training?

Two yes/no answers. Writes themselves. Either you know, or you need to ask legal before you pick.

5. Which task type dominates?

Extract/classify, transform, generate, reason, tool-use, multimodal, RAG, or code. If the answer is "several of those", you have several jobs. Pick a different model for each.

6. What's your current model, and what does it cost you per call right now?

If you don't know the second number, that's your first todo, not "pick a new model". Measure, then decide.

Based on your answers, where should you probably land?
  • Classify + extract, high volume, tight budget: value-pocket models (Haiku 4.5, Gemini 2.5 Flash, GPT-5 nano). Route with an escalation to a mid-tier model on gate failure.
  • Generative + creative, moderate volume, moderate budget: mid-tier (Sonnet 4.6, Gemini 3.1 Pro, GPT-5.4 mini) as primary, frontier as escalation.
  • Reasoning-heavy or agentic, low volume: frontier (Opus 4.7 with thinking, GPT-5.4, Gemini 3.1 Ultra). Route between two frontier models for fallback, not cost.
  • Self-hosting required: Llama 4.1 70B or DeepSeek V3.2 (weights), Mistral Large 3 for EU-hosted.
  • Long real-context (>300k usable): Gemini 3.1 Pro or Ultra. They're the only ones that don't go lossy past 200k.

This is a starting shortlist, not an answer. Now run Step 5.

Ten traps that waste months

After helping hundreds of solopreneurs audit their LLM stack, the same ten mistakes show up over and over. In no particular order, because each of them is individually capable of doubling your bill or halving your quality.

1. Picking by leaderboard. MMLU rank does not correlate with your pass rate on your golden set at better than 0.3. Picking by benchmark is picking with one eye closed.

2. Trusting advertised context windows. "1M tokens" is a marketing statement. Real usable context is often 20–40% smaller. Test yours at the sizes you actually use.

3. Ignoring thinking tokens. Reasoning models bill you for tokens you never see. A reasoning model on a classification task is a 10× bill inflation for zero quality gain.

4. Using a reasoning model for a classification job. Reasoning is for hard problems. 80% of your traffic is easy. Don't pay reasoning price on easy traffic.

5. No fallback. One provider equals one outage away from a production incident. Two providers equals an order of magnitude more availability for an afternoon of work.

6. Measuring once, never again. In-place updates regress silently. Quarterly re-eval is the cheapest insurance you'll ever buy.

7. Prompt bloat you never prune. System prompts grow by accretion: a line per bug, a line per complaint, never a line removed. Your six-month-old prompt is probably twice as long as it needs to be.

8. JSON retry storms. Soft-failing with a retry loop on bad JSON hides a 15–40% cost inflation. Validate, repair, or switch to native JSON mode. Don't retry blind.

9. "Multimodal" as a feature checkbox. Unless a specific job actually consumes images, audio, or video, picking a multimodal model "for the option" adds cost with no benefit.

10. Locking into one provider's SDK shape. Write against an OpenAI-shaped interface (LiteLLM, OpenRouter, the LLMTest proxy) so switching models is a config change, not a refactor. The model you pick today is not the model you'll want in 18 months.

TL;DR cheat sheet

The whole guide on one page:

STEP 1 · JOB One-sentence statement: input → output → success Pass the sharpening test. STEP 2 · CONSTRAINTS Budget/req · p95 latency Context · privacy floor Modalities · ecosystem STEP 3 · 5 DIMENSIONS Quality · Cost · Speed Reliability · Ecosystem Measure, don't assume. STEP 4 · SHORTLIST Apply constraints to map → 2-4 models. Not 8. Not 1. STEP 5 · HEAD-TO-HEAD Golden set (20+) Blind grade · rubric Scorecard · $/pass STEP 6 · ROUTE Cheap primary + escalate on gate fail + fallback on outage. STEP 7 · RE-EVAL Quarterly re-run. Catch regressions, drift, and new models. TRAP CHECKLIST ☐ Not picking by leaderboard ☐ Measured real context usage ☐ No reasoning model on a classifier ☐ Fallback configured ☐ Quality gate in place ☐ Prompt pruned in last 90 days ☐ JSON validated not blind-retried ☐ Provider-neutral SDK shape ☐ Quarterly re-eval on the calendar Landing zone by job type Classify / extract (high volume) → value pocket (Haiku 4.5, Gemini 2.5 Flash, GPT-5 nano) + escalation. Generate / transform (moderate) → mid-tier (Sonnet 4.6, Gemini 3.1 Pro, GPT-5.4 mini) + frontier fallback. Reason / agent (low volume, high stakes) → frontier (Opus 4.7, GPT-5.4, Gemini 3.1 Ultra) with cross-provider fallback.

Print it. Pin it. Re-run it each quarter.

FAQ

Should I just use GPT-5? For a prototype, sure. It's the lowest-friction path to shipping. For production, no. "Whichever model I have an account with" is how bills hit $40k/month with a 5× cheaper option on the table. Run Step 5.

Is open-source ready for production? Yes for mid-tier jobs, with caveats. Llama 4.1 70B, DeepSeek V3.2, Qwen 3 72B match mid-tier closed models on many tasks. Gaps: tool use (closed still ahead), multimodal (closed still ahead), operational burden (you're on the hook for uptime). Self-host only if privacy or cost math requires it.

How often should I switch models? As often as a better one ships — which, in 2026, is every few weeks. The catch: manual re-testing, prompt re-tuning, and guardrail re-validation are tedious, so most teams only get around to it once or twice a year and leave money (and quality) on the table between releases. That's the exact gap LLMTest closes: re-runs your golden set against every new frontier and value-pocket model automatically, scores the outputs against your rubric, and flags a switch the moment one actually beats your incumbent. Chase measured wins, just don't do the chasing by hand.

My volume is tiny. Does any of this apply? Steps 1, 2, and 5 do. Skip routing (Step 7) until you have enough volume to care. At <10k calls/month, just pick one good model and move on.

What about reasoning models, when are they worth it? When the task has multiple steps that depend on each other: root-cause analysis, multi-file code changes, planning, proofs, agent workflows. Not for: classification, extraction, rewrites, single-shot generation, anything with a clean deterministic success criterion. The rule of thumb: if a competent human does it in one read-through, don't use a reasoning model.

What if my app needs images and text? Pick a multimodal frontier (GPT-5.4, Gemini 3.1 Ultra, Opus 4.7 all handle images). Watch the token accounting carefully: images cost more than you expect, and pricing varies 3× between providers.

Should I fine-tune? Almost never, as a solopreneur. Prompt engineering plus RAG covers 95% of the cases where people reach for fine-tuning, at 1% of the ongoing cost. Fine-tune only after you've exhausted prompting and have >50k high-quality labeled examples.

Where LLMTest fits

Steps 5 and 7 are where we live. Step 6 is partial (we handle the error-fallback half; the quality-gate half is on you).

Concretely:

  • Step 5 — automated head-to-head. Point your code at our OpenAI-shaped proxy (https://llmtest.io/v1, one line change). We log your real traffic per flow and replay samples through every frontier and value-pocket model on demand. Outputs are blind-graded by two judge models with swapped positions to cancel position bias, and you get the exact scorecard shape shown in Step 5 — pass rate, quality score, $/call, $/pass, failure-mode breakdown.
  • Step 6 — provider fallback on errors. If your primary returns 429, 5xx, or times out, we retry on a configured fallback automatically. We also auto-repair broken JSON when you set response_format: { type: "json_object" }. Runtime quality-gate escalation is still on you — gates are too job-specific to ship generically.
  • Step 7 — autopilot re-eval with safety gates. Weekly, autopilot re-runs the head-to-head against any new models, and if a candidate statistically beats your current primary (Wilson lower-bound on pass rate, ≥80% judge agreement, ≥20% cost reduction, no length-bias artifact, no golden-set regression) it swaps the flow over. If the production golden set regresses in the next 24 hours, it auto-reverts and emails you. This is the step that's genuinely tedious to do by hand and the reason most teams skip it.

You write the job statement and seed a golden set. We do the weekly discipline.

If you're picking an LLM today, run the framework above manually. You'll get most of the value. If you ship an AI feature and don't want to re-run it every quarter, try LLMTest. Keep your existing provider. Top up $5 in credits and cancel whenever.

Either way: don't pick by leaderboard. Pick by your own scorecard. That's the whole game.

Ship LLM features without burning your budget.

LLMTest proxies your OpenAI / Anthropic calls, tracks cost per feature, and auto-rewrites prompts to be cheaper while holding quality. Free to start.

Create a free account

Related reading

1 token is not 1 word: LLM conversion rates that predict your bill
Apr 27, 2026 · 6 min read
What is RAG? The 3 components and when not to use it
Apr 22, 2026 · 6 min read
Context windows explained: why your 128k model only gives you 100k
Apr 21, 2026 · 6 min read