Most AI features talk to exactly one model. Pick a provider, add an API key, ship it. The cost only becomes obvious when the bill lands.
A model at $5 per million input tokens is massively overqualified for "summarize this 200-word paragraph." If 60–70% of your prompts are routine tasks (classification, simple Q&A, short rewrites), you're paying frontier prices for work a $1/M model handles just as well.
Prompt routing fixes the mismatch. Send each request to the cheapest model that can handle it. When the cheap model can't, escalate silently. Your application calls one function; the bill reflects actual complexity.
The pattern in three steps
- Send the prompt to a budget model (Haiku 4.5, GPT-4o-mini).
- Evaluate the output against a quality signal.
- If the signal passes, return it. If it fails, resend the same prompt to a frontier model.
The evaluation step is what makes this work in practice. Three signals that hold up in production:
Self-reported confidence. Ask the model to rate its own confidence at the end of its reply. Budget models flag uncertainty reliably when you prompt for it explicitly. They know when they're guessing.
Schema validation. If your feature needs structured output, validate the schema on the cheap model's response. Malformed JSON forces escalation without any extra API calls. This catches most cases where a smaller model hit its reasoning ceiling.
Hard failures. Empty response, refusal, or timeout. These always escalate regardless of the threshold.
The code
A complete router in plain Node.js, using the LLMTest proxy:
// router.js
const BASE = "https://llmtest.io/v1";
async function chat(model, messages) {
const r = await fetch(`${BASE}/chat/completions`, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.LLMTEST_API_KEY}`,
},
body: JSON.stringify({ model, messages }),
});
if (!r.ok) throw new Error(`HTTP ${r.status}`);
const { choices } = await r.json();
return choices[0].message.content;
}
async function routePrompt(prompt) {
const cheap = "anthropic/claude-haiku-4-5";
const frontier = "anthropic/claude-opus-4-8";
const THRESHOLD = 3;
const out = await chat(cheap, [
{
role: "system",
content:
"Answer the question. End your reply with CONFIDENCE: N (1 = uncertain, 5 = certain).",
},
{ role: "user", content: prompt },
]);
const score = parseInt(out.match(/CONFIDENCE:\s*([1-5])/i)?.[1] ?? "1", 10);
const answer = out.replace(/\nCONFIDENCE:\s*[1-5]\s*$/i, "").trim();
if (score >= THRESHOLD) return { answer, model: cheap, escalated: false };
return {
answer: await chat(frontier, [{ role: "user", content: prompt }]),
model: frontier,
escalated: true,
};
}
export { routePrompt };
THRESHOLD = 3 means: if Haiku rates itself 3, 4, or 5 out of 5, trust the answer and return it. A 1 or 2 triggers escalation. Across mixed production traffic, Haiku typically escalates 20–30% of the time. The exact rate depends entirely on your prompts.
Swap BASE to the Anthropic URL directly if you're not routing through a proxy. The function signature stays the same either way.
Picking your model tiers
Verified 2026 input prices:
| Tier | Model | Input price |
|---|---|---|
| Budget | anthropic/claude-haiku-4-5 |
$1.00/M |
| Budget | openai/gpt-4o-mini |
$0.15/M |
| Frontier | anthropic/claude-opus-4-8 |
$5.00/M |
| Frontier | openai/gpt-5 |
$1.25/M |
With Haiku and Opus: routing 75% of calls to Haiku and 25% to Opus brings the blended input cost to $2.00/M, against a $5.00/M baseline if everything went to Opus. That's a 60% reduction.
With GPT-4o-mini and GPT-5: the same 75/25 split lands at $0.43/M blended, versus $1.25/M on GPT-5 alone. GPT-4o-mini is the most aggressive budget option. For short summarization and classification-heavy workloads, it can handle 90% of traffic on its own. If latency and throughput factor into your tier selection, the speed and cost ranking of all five sub-$1/M models covers tokens per second, TTFT, and the workloads where each breaks.
Calibrating the threshold
The most common mistake is setting the threshold too high. At THRESHOLD = 4, Haiku escalates 40–50% of calls and you lose most of the cost benefit. The budget tier has to win on most prompts to matter.
Calibrate on a real sample of your traffic before deploying. Run 50–100 prompts through the full router with model and escalated logged per request. Look for categories where the cheap model over-escalates. Usually it's one or two distinct task types. Handle those with a hardcoded tier override that skips the confidence check entirely.
A self-assessed 4 or 5 from Haiku is generally reliable. A 3 is borderline and worth watching. For structured-output use cases where you need something deterministic, skip confidence scoring and use schema validation as the escalation trigger: zero extra calls, catches the failure mode that matters most.
The LLMTest fallbacks documentation covers a complementary approach: a judge-based quality gate that runs after each response and re-routes when the score falls below your configured threshold. The judge catches subtle failures (plausible but wrong answers) that self-confidence scoring misses.
How this fits with a fallback chain
Routing and fallbacks solve different problems. Routing selects which tier to try based on expected complexity. Fallbacks handle the case where a provider is down or throttling within a tier. Production setups typically need both.
Building an LLM fallback chain in 10 minutes covers the second layer, including soft failures at HTTP 200 (malformed JSON, empty responses) that a plain availability retry won't catch.
For teams running through LLMTest's proxy, the dashboard shows per-request model and cost breakdowns across both tiers automatically. If you're tracking down where the rest of your bill comes from, the three hidden LLM costs covers the categories (thinking tokens, JSON retries, prompt bloat) that routing alone won't fix.