How to handle LLM rate limits: 4 production-tested patterns

By LLMTest Team · May 20, 2026 · 5 min read infrarate-limitsreliabilityvibe-coders
On this page

On this page

  1. 1. Exponential backoff with full jitter
  2. 2. Token budget pre-check
  3. 3. Circuit breaker
  4. 4. Provider failover on sustained rate limits
  5. Why retries make outages worse

Your retry loop fires. It hits a rate limit. It fires again. It hits the same limit. After four attempts it gives up, and your user sees an error modal. This is not a rate limit problem. This is a retry logic problem: the API was telling you to slow down and you kept pushing.

Rate limits are how providers protect themselves from being overloaded. Hitting one is normal; thrashing against one is not. Here are four patterns that work in production, in roughly the order you should add them.

1. Exponential backoff with full jitter

Most apps already have some form of retry logic. The problem is usually the implementation.

Naive backoff doubles the wait between retries: 1s, 2s, 4s, 8s. When a spike of requests hits a rate limit at the same moment, say at the top of a minute when the provider's counter resets, every client waits the same intervals and retries in lockstep. The thundering herd hits the API again in a wave.

Full jitter fixes this by randomizing each delay within the window:

async function withBackoff(fn, options = {}) {
  const { maxRetries = 4, baseDelayMs = 500, maxDelayMs = 30_000 } = options;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxRetries) throw err;

      const isRetriable = err.status === 429 || err.status === 503 || err.status === 502;
      if (!isRetriable) throw err;

      // Use the server's Retry-After hint when present
      const retryAfterMs = err.headers?.['retry-after']
        ? parseInt(err.headers['retry-after'], 10) * 1000
        : 0;

      // Full jitter: random(0, min(maxDelay, base * 2^attempt))
      const cap = Math.min(maxDelayMs, baseDelayMs * Math.pow(2, attempt));
      const jitteredDelay = Math.random() * cap;
      const delayMs = Math.max(retryAfterMs, jitteredDelay);

      await new Promise(resolve => setTimeout(resolve, delayMs));
    }
  }
}

Two details matter here. First, error classification: 429 and 5xx errors are retriable; 400 (bad request), 401 (auth failure), and 422 (invalid payload) are not. Retrying a 400 wastes money. The problem is the request, not the provider. Second, the Retry-After header: when Anthropic or OpenAI include one, that is the API telling you exactly when its capacity will free up. Use it as your floor, not a ceiling you override.

The property names (err.status, err.headers) vary by HTTP client. OpenAI's Node SDK uses this shape directly; with raw fetch you inspect response.status and response.headers.get('retry-after') before throwing.

2. Token budget pre-check

Most providers enforce two separate limits: requests per minute (RPM) and tokens per minute (TPM). Backoff handles RPM. TPM behaves differently. A single large request can exhaust your minute's token budget in one call, and the 429 only arrives after the round trip.

Pre-checking token usage before sending lets you defer requests that would blow the budget rather than discovering the failure after it happens:

function estimateTokens(messages) {
  const text = messages.map(m => m.content ?? '').join(' ');
  return Math.ceil(text.length / 4); // ~4 chars per English token
}

class TokenBudget {
  constructor(tpm) {
    this.tpm = tpm;
    this.used = 0;
    this.windowStart = Date.now();
  }

  canSend(estimatedTokens) {
    const now = Date.now();
    if (now - this.windowStart >= 60_000) {
      this.used = 0;
      this.windowStart = now;
    }
    return this.used + estimatedTokens <= this.tpm;
  }

  record(tokens) {
    this.used += tokens;
  }
}

The 4-chars-per-token estimate is a rough approximation that works well enough for pre-flight checks. Real tokenizers differ by model and language, so add a 20% margin if your prompts include non-English text or code.

Prompt caching changes this calculation too: cached input tokens count at 10% of their original weight toward Anthropic's TPM. If you cache a large system prompt across requests, the effective token load per call drops significantly. The specifics of how this affects your break-even are in our prompt caching cost breakdown.

3. Circuit breaker

Backoff assumes the failure is temporary and the next attempt might succeed. That assumption breaks during a sustained provider incident or a prolonged rate limit. A circuit breaker tracks consecutive failures and stops sending requests to a struggling endpoint entirely, rather than hammering it until your users notice.

class CircuitBreaker {
  constructor(options = {}) {
    this.threshold = options.threshold ?? 5;
    this.cooldownMs = options.cooldownMs ?? 60_000;
    this.failures = 0;
    this.state = 'closed';
    this.openedAt = null;
  }

  async call(fn) {
    if (this.state === 'open') {
      if (Date.now() - this.openedAt < this.cooldownMs) {
        throw new Error('Circuit open: provider is in cooldown');
      }
      this.state = 'half-open';
    }

    try {
      const result = await fn();
      this.failures = 0;
      this.state = 'closed';
      return result;
    } catch (err) {
      this.failures++;
      if (this.failures >= this.threshold) {
        this.state = 'open';
        this.openedAt = Date.now();
      }
      throw err;
    }
  }
}

Three states: closed (normal), open (reject all requests for cooldownMs), half-open (send one probe to check if the provider recovered). One circuit breaker per provider. An Anthropic outage should not cause your OpenAI calls to fail because they share a breaker.

The threshold and cooldownMs values need tuning for your traffic. Five failures and a 60-second cooldown is a reasonable starting point for most side projects. At scale, you track the failure rate (failures as a percentage of attempts in a rolling window) rather than raw count.

4. Provider failover on sustained rate limits

When a circuit opens, you need somewhere to send the request. Rate-limit-triggered failover differs from availability failover: the provider is up, just throttled. That distinction matters for routing logic:

  • 429 (rate limited): route to another provider, or queue for the next window. Do not retry the same provider immediately.
  • 503 (unavailable): route to another provider now and let the circuit breaker count the failure.
  • 400 or 422: do not route anywhere else. The problem is the request itself.

The pattern that works: maintain a priority-ordered list of providers. Skip any whose circuit is open, or whose Retry-After timestamp has not yet passed. The LLM fallback chain guide covers building the routing layer in detail, with LiteLLM, OpenRouter, and LLMTest configs.

The rate-limit-specific addition is per-provider state. Track which providers are currently throttled and for how long, so your router skips them automatically without needing another failed call to discover they are still at capacity.

Why retries make outages worse

Most production LLM incidents are self-inflicted. A slow request causes a timeout. The retry fires immediately and hits a now-higher rate limit. The queue backs up. The retries multiply.

The hidden costs behind LLM retries covers the billing side: a 5% JSON failure rate with two retries per failure costs 10% more than your headline token rate, and that is before rate-limit retries stack on top.

These four patterns work together as a stack. Backoff with jitter handles the short-term spike. Token budget pre-checks stop you from burning your TPM in one call. Circuit breakers prevent thrashing against a struggling provider. Failover keeps requests moving when a single provider is rate-limited.

Adding them in this order keeps the scope manageable. Most apps get 80% of the benefit from the first two alone.


LLMTest routes around rate limits automatically as part of its proxy layer. Get started with the quickstart and see real-time visibility into which providers are throttling you.

Ship LLM features without burning your budget.

LLMTest proxies your OpenAI / Anthropic calls, tracks cost per feature, and auto-rewrites prompts to be cheaper while holding quality. Free to start.

Create a free account

Related articles

Prompt caching breaks even at 1.3 requests. Here's the math.
Prompt caching cuts LLM costs 90% on Anthropic and 50% on OpenAI, but only when your workload fits. Here's the exact break-even math per provider.
Build an LLM fallback chain in 10 minutes
One model going down shouldn't take your AI feature with it. Here's how to build a fallback chain using LiteLLM, OpenRouter, and LLMTest.