Build an LLM fallback chain in 10 minutes

Your AI feature went down last night. Not because your code was wrong. The model API returned a 503, your retry logic hit the rate limit on the second attempt, and by the time you woke up, half your users had seen an error modal. A fallback chain would have caught all of that without you doing anything.

Here's how to build one in three flavors: LiteLLM for Python shops, OpenRouter for any stack, and LLMTest for when you need a quality gate on top of availability.

The four failure modes you're probably not handling

Most retry logic only catches one of these.

Hard errors (4xx/5xx HTTP status): The API rejects the call outright. Rate limits (429), server errors (500/503), timeouts. Easy to detect, straightforward to retry.

Soft failures (200 OK, garbage output): The API returns HTTP 200 but the response is malformed JSON, an empty string, a refusal, or a hallucinated structure. You try to parse it, crash, and only then find out something went wrong.

Quality degradation: The model answers, but quality has dropped. Maybe the provider is running on overloaded hardware, maybe there was a silent rollback on their end. You won't catch this without running quality checks on each response.

Model retirement: The model ID you're calling got deprecated. Usually there's a grace period with a warning header, but not always.

A fallback chain handles at least the first two. LLMTest handles all four.

LiteLLM: one config, automatic retries

LiteLLM's Router is the fastest path to fallback for Python apps:

import asyncio
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "primary",
            "litellm_params": {
                "model": "openai/gpt-5",
                "api_key": "sk-...",
            },
        },
        {
            "model_name": "primary",
            "litellm_params": {
                "model": "anthropic/claude-opus-4-7",
                "api_key": "sk-ant-...",
            },
        },
        {
            "model_name": "primary",
            "litellm_params": {
                "model": "google/gemini-2.5-pro",
                "api_key": "...",
            },
        },
    ],
    fallbacks=[{"primary": ["primary", "primary"]}],
    retry_after=1,
    num_retries=2,
)


async def call_model(messages):
    return await router.acompletion(model="primary", messages=messages)


asyncio.run(call_model([{"role": "user", "content": "Hello"}]))

This handles rate limits and 5xx errors automatically. One gap: soft failures pass through. If GPT-5 returns malformed JSON, LiteLLM treats it as a successful call and hands you the broken output. Add your own validation wrapper if you're doing structured extraction.

Also worth checking before you add fallback candidates: models from different providers have different usable context window sizes. If your prompts are long, a fallback with a smaller context budget might silently truncate what the primary handled fine.

OpenRouter: one extra body parameter

If you're already calling OpenRouter as a gateway, fallback is one parameter away:

// openrouter-fallback.js
async function callWithFallback(messages) {
  const res = await fetch("https://openrouter.ai/api/v1/chat/completions", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.OPENROUTER_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "openai/gpt-5",
      messages,
      route: "fallback",
      models: [
        "openai/gpt-5",
        "anthropic/claude-opus-4-7",
        "google/gemini-2.5-pro",
      ],
    }),
  });
  return res.json();
}

The route: "fallback" field tells OpenRouter to try each model in models order on a hard failure. Zero code overhead if you're already using OpenRouter as your gateway. The limits are the same as LiteLLM: soft failures still land in your app as successful 200s.

LLMTest: availability fallback plus a quality gate

The two approaches above route around hard availability failures. LLMTest adds quality-aware routing on top, which handles the "200 OK with garbage" case.

The proxy sits between your code and the model APIs. When a response comes back, it runs a lightweight judge pass (cached for common query patterns). If the primary model answers but scores below your configured quality threshold, the proxy re-routes silently and returns the better result. How the routing logic works is in the LLMTest fallback docs; the full proxy architecture is at /docs/proxy.

The integration is a one-line base URL change:

// llmtest-client.js
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.llmtest.io/v1",
  apiKey: process.env.LLMTEST_API_KEY,
});

async function callModel(messages) {
  return client.chat.completions.create({
    model: "gpt-5",
    messages,
  });
}

The fallback priority and quality threshold are configured in the LLMTest dashboard. Your call sites stay unchanged when you update routing logic.

Three things none of them handle by default

Regardless of which approach you choose, these still need handling on your side:

Billing soft-caps. Some providers silently return shorter or lower-quality responses when you approach a spending limit. There's no HTTP signal for this. You catch it by monitoring output quality or length over time. It's the same category of hidden cost as the three LLM costs nobody talks about: the ones that don't surface in dashboards until they become incidents.

Refusals inside a 200. If a model refuses a request ("I can't help with that"), the HTTP status is still 200. None of the three systems above treat a refusal as a fallback-triggerable failure unless you add a refusal-detection wrapper to your parsing layer.

Cascading retries. If all models in your chain are failing simultaneously (provider-wide incident, or they all share the same upstream), a naive retry loop will exhaust every option before giving up. Cap total attempts across the entire chain, not per-model.

Which approach fits your setup

LiteLLM: Python stack, just need availability fallback. The Router handles it with minimal config and no additional service to run.

OpenRouter: Any language, already using OpenRouter, need failover with no extra infra. One body parameter away.

LLMTest: Customer-facing output, structured extraction, or anything where a degraded model response causes real downstream damage. The quality gate catches what the other two miss.

Set up fallback routing and quality-aware re-routing in one place: LLMTest takes about a minute to wire up.