How to route LLM prompts in 2026: cheap first, escalate on fail

By LLMTest Team · Jun 14, 2026 · 4 min read infraroutingcost-optimizationnodejs
On this page

On this page

  1. The pattern in three steps
  2. The code
  3. Picking your model tiers
  4. Calibrating the threshold
  5. How this fits with a fallback chain

Most AI features talk to exactly one model. Pick a provider, add an API key, ship it. The cost only becomes obvious when the bill lands.

A model at $5 per million input tokens is massively overqualified for "summarize this 200-word paragraph." If 60–70% of your prompts are routine tasks (classification, simple Q&A, short rewrites), you're paying frontier prices for work a $1/M model handles just as well.

Prompt routing fixes the mismatch. Send each request to the cheapest model that can handle it. When the cheap model can't, escalate silently. Your application calls one function; the bill reflects actual complexity.

The pattern in three steps

  1. Send the prompt to a budget model (Haiku 4.5, GPT-4o-mini).
  2. Evaluate the output against a quality signal.
  3. If the signal passes, return it. If it fails, resend the same prompt to a frontier model.

The evaluation step is what makes this work in practice. Three signals that hold up in production:

Self-reported confidence. Ask the model to rate its own confidence at the end of its reply. Budget models flag uncertainty reliably when you prompt for it explicitly. They know when they're guessing.

Schema validation. If your feature needs structured output, validate the schema on the cheap model's response. Malformed JSON forces escalation without any extra API calls. This catches most cases where a smaller model hit its reasoning ceiling.

Hard failures. Empty response, refusal, or timeout. These always escalate regardless of the threshold.

The code

A complete router in plain Node.js, using the LLMTest proxy:

// router.js
const BASE = "https://llmtest.io/v1";

async function chat(model, messages) {
  const r = await fetch(`${BASE}/chat/completions`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${process.env.LLMTEST_API_KEY}`,
    },
    body: JSON.stringify({ model, messages }),
  });
  if (!r.ok) throw new Error(`HTTP ${r.status}`);
  const { choices } = await r.json();
  return choices[0].message.content;
}

async function routePrompt(prompt) {
  const cheap = "anthropic/claude-haiku-4-5";
  const frontier = "anthropic/claude-opus-4-8";
  const THRESHOLD = 3;

  const out = await chat(cheap, [
    {
      role: "system",
      content:
        "Answer the question. End your reply with CONFIDENCE: N (1 = uncertain, 5 = certain).",
    },
    { role: "user", content: prompt },
  ]);

  const score = parseInt(out.match(/CONFIDENCE:\s*([1-5])/i)?.[1] ?? "1", 10);
  const answer = out.replace(/\nCONFIDENCE:\s*[1-5]\s*$/i, "").trim();

  if (score >= THRESHOLD) return { answer, model: cheap, escalated: false };

  return {
    answer: await chat(frontier, [{ role: "user", content: prompt }]),
    model: frontier,
    escalated: true,
  };
}

export { routePrompt };

THRESHOLD = 3 means: if Haiku rates itself 3, 4, or 5 out of 5, trust the answer and return it. A 1 or 2 triggers escalation. Across mixed production traffic, Haiku typically escalates 20–30% of the time. The exact rate depends entirely on your prompts.

Swap BASE to the Anthropic URL directly if you're not routing through a proxy. The function signature stays the same either way.

Picking your model tiers

Verified 2026 input prices:

Tier Model Input price
Budget anthropic/claude-haiku-4-5 $1.00/M
Budget openai/gpt-4o-mini $0.15/M
Frontier anthropic/claude-opus-4-8 $5.00/M
Frontier openai/gpt-5 $1.25/M

With Haiku and Opus: routing 75% of calls to Haiku and 25% to Opus brings the blended input cost to $2.00/M, against a $5.00/M baseline if everything went to Opus. That's a 60% reduction.

With GPT-4o-mini and GPT-5: the same 75/25 split lands at $0.43/M blended, versus $1.25/M on GPT-5 alone. GPT-4o-mini is the most aggressive budget option. For short summarization and classification-heavy workloads, it can handle 90% of traffic on its own. If latency and throughput factor into your tier selection, the speed and cost ranking of all five sub-$1/M models covers tokens per second, TTFT, and the workloads where each breaks.

Calibrating the threshold

The most common mistake is setting the threshold too high. At THRESHOLD = 4, Haiku escalates 40–50% of calls and you lose most of the cost benefit. The budget tier has to win on most prompts to matter.

Calibrate on a real sample of your traffic before deploying. Run 50–100 prompts through the full router with model and escalated logged per request. Look for categories where the cheap model over-escalates. Usually it's one or two distinct task types. Handle those with a hardcoded tier override that skips the confidence check entirely.

A self-assessed 4 or 5 from Haiku is generally reliable. A 3 is borderline and worth watching. For structured-output use cases where you need something deterministic, skip confidence scoring and use schema validation as the escalation trigger: zero extra calls, catches the failure mode that matters most.

The LLMTest fallbacks documentation covers a complementary approach: a judge-based quality gate that runs after each response and re-routes when the score falls below your configured threshold. The judge catches subtle failures (plausible but wrong answers) that self-confidence scoring misses.

How this fits with a fallback chain

Routing and fallbacks solve different problems. Routing selects which tier to try based on expected complexity. Fallbacks handle the case where a provider is down or throttling within a tier. Production setups typically need both.

Building an LLM fallback chain in 10 minutes covers the second layer, including soft failures at HTTP 200 (malformed JSON, empty responses) that a plain availability retry won't catch.

For teams running through LLMTest's proxy, the dashboard shows per-request model and cost breakdowns across both tiers automatically. If you're tracking down where the rest of your bill comes from, the three hidden LLM costs covers the categories (thinking tokens, JSON retries, prompt bloat) that routing alone won't fix.

Start routing across providers from a single endpoint.

Ship LLM features without burning your budget.

LLMTest proxies your OpenAI / Anthropic calls, tracks cost per feature, and auto-rewrites prompts to be cheaper while holding quality. Free to start.

Create a free account

Related articles

Fastest LLMs under $1/M tokens in 2026: speed and cost ranked
Five LLMs under $1/M input tokens ranked by throughput and quality in 2026. Gemini 2.5 Flash leads on tokens per second; DeepSeek V4 wins on output cost.
Build an LLM fallback chain in 10 minutes
One model going down shouldn't take your AI feature with it. Here's how to build a fallback chain using LiteLLM, OpenRouter, and LLMTest.