Semantic caching for LLMs: 3 approaches and where each breaks

By LLMTest Team · May 25, 2026 · 6 min read infracachingcostproduction
On this page

On this page

  1. Approach 1: Prompt-hash caching
  2. Approach 2: Embedding-based caching
  3. Approach 3: Hybrid caching
  4. Choosing an approach
  5. Wiring it into your LLM stack

Semantic caching sits between your application and the LLM API. On each request it asks: "have we answered something like this before?" When a match exists, the user gets a response in 20ms instead of 3-15 seconds, and you pay nothing in tokens. When there's no match, the request passes through as normal.

Real production hit rates land between 20% and 45% for mixed-workload APIs. FAQ bots with predictable question patterns can reach 60-70%. The problem is not whether semantic caching works. Each implementation approach has a specific, non-obvious failure mode. Here's what each one is, and what to do about it.

Approach 1: Prompt-hash caching

How it works. Hash the full prompt string (SHA-256 is fine), look up the hash in Redis, and return the cached response on a hit. On a miss, call the LLM, cache the result, and return it.

import { createHash } from 'node:crypto';

async function hashCache(prompt, llmFn, redis, ttlSeconds = 3600) {
  const key = 'llm:' + createHash('sha256').update(prompt).digest('hex');
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);
  const response = await llmFn(prompt);
  await redis.setex(key, ttlSeconds, JSON.stringify(response));
  return response;
}

Where it breaks. Chat applications prepend conversation history to every message. Two different users asking "what's the return policy?" at turn 20 of unrelated conversations produce completely different prompt strings, so they get different hashes and zero cache hits. Open-ended chat workloads typically land at 10-15% hit rate with this approach.

A common fix is to hash only the user's final message, not the full conversation. That drives hit rates up, but then the cached answer ignores context: "what's the price?" means something different in a conversation about Claude than in one about GPT. You end up returning wrong answers with confidence.

Where it actually works well. Narrow-query FAQ bots with a bounded question set. If users ask one of 200 standard questions through a support widget, normalize the input (lowercase, strip punctuation and filler words) and hash the result. Hit rates of 60-70% are achievable. Beyond that context, this approach runs out of ceiling fast.

Approach 2: Embedding-based caching

How it works. Convert the incoming prompt to a vector using an embedding model (text-embedding-3-small from OpenAI costs roughly $0.02 per million tokens). Store that vector in a database with approximate nearest-neighbor search: Redis Vector, pgvector, or Pinecone all work. On each request, embed the prompt, find the nearest cached vector, and return the cached response if cosine similarity exceeds a threshold (typically 0.90 to 0.95).

Where it breaks. Three distinct failure modes show up in production.

The threshold grey zone. At a threshold of 0.92, valid paraphrases get cache misses. "How do I cancel my subscription?" and "What's the process to cancel?" score around 0.88 cosine similarity with most general-purpose embedding models, forcing a fresh LLM call. Drop the threshold to 0.82 to catch those and you start seeing false hits: research on production embedding caches shows that queries as semantically opposite as "sort ascending" and "sort descending" can score 0.84 in vector space, causing the cache to return the wrong answer. There is no single global threshold that eliminates both failure modes simultaneously.

Stale data. A user asks "What's Claude Sonnet 4.6's price?" You cache the answer. Anthropic changes pricing. That cached response is now confidently wrong and will stay wrong until the TTL expires. Most teams set a global TTL (24 hours is common) that is either too short to generate meaningful cost savings or too long to prevent stale answers. The right model is per-category TTLs: short windows (1-4 hours) for time-sensitive facts like pricing or availability, much longer windows (days or indefinite) for stable content like mathematical answers or definitional explanations.

Multi-turn context poisoning. In a 20-turn conversation, the embedding is dominated by the conversation history, not just the user's last question. Two users deep in different but similarly-themed conversations can hit cosine similarity above 0.87 and swap each other's cached responses. A user asking a follow-up about Python 3 gets an answer cached from a conversation about Python 2.

Approach 3: Hybrid caching

How it works. Run a hash lookup first. If no exact match exists, fall back to vector similarity search. This catches exact repeats cheaply and handles paraphrases without over-tuning the similarity threshold. In a well-tuned hybrid setup, the hash layer catches 15-25% of traffic at essentially zero marginal cost; the embedding layer handles another 20-35% of the traffic that hashing misses.

Where it breaks. Two failure modes specific to the two-layer architecture.

Embedding model drift. When you upgrade your embedding model (from text-embedding-ada-002 to text-embedding-3-large, for example), all cached vectors are in the old model's latent space. New queries generate vectors in the new space, and cosine similarity between old and new vectors is meaningless. You get both false hits and false misses until you reindex the entire cache. For a large production cache with millions of entries, that reindex takes time and wipes out your hit rate during the transition.

Version your cache keys to prevent this: include the embedding model name and version as a prefix (e.g., emb:v2:<hash> vs emb:v1:<hash>). Old entries expire naturally; new entries build up in the versioned namespace.

Miss-path overhead. On every cache miss, the hybrid approach runs both the hash lookup and the vector search before calling the LLM. For workloads where 80% of requests are novel (low reuse), that adds 20-50ms of latency and embedding API cost to every call that doesn't benefit from caching at all. At 10 million requests per month with a 20% hit rate, you're running 8 million embedding calls that produce no savings. At $0.02/M tokens for a typical embedding model, that's $160/month in overhead. Whether that's worth it depends entirely on what you save on the 20% that do hit.

Choosing an approach

Workload Recommended approach Typical hit rate
FAQ bot, bounded question set Prompt-hash (normalized input) 60-70%
Support chatbot, open-ended Embedding, threshold 0.90-0.93 25-40%
High-volume API, mixed queries Hybrid 30-50% combined
Multi-turn chat Hybrid, cache user turn only 20-35%

The most impactful single decision in any embedding-based cache is not the similarity threshold; it's TTL per query category. Group cached responses into "stable" (mathematical answers, code, definitional content) and "dynamic" (pricing, real-world events, current availability). Apply a long TTL to the first group, a short one to the second, and skip caching entirely for queries where the answer is inherently session-specific. That one change cuts the stale-data failure mode more than any threshold tuning will.

Wiring it into your LLM stack

Semantic caching sits one layer in front of your LLM gateway, not behind it. You check the cache, and on a miss you pass the request to your API proxy (LLMTest, LiteLLM, or OpenRouter), which then handles failover, retries, and JSON repair before the call reaches the upstream model. That order matters: you want provider failures that could corrupt your cache with partial responses to be caught and recovered by the gateway before anything gets written to the cache store.

The LLMTest proxy handles failover and JSON recovery automatically once you point your app at its endpoint. The fallbacks documentation explains how provider-level failures are routed so your application never sees a partial or invalid response.

For the token-level caching that happens inside an LLM call rather than in front of it, prompt caching break-even analysis shows the exact math for when provider-side caching pays off. The LLM rate limits guide covers the capacity pressure that makes semantic caching worth adding in the first place: fewer API calls means lower TPM consumption, which is your first defense against rate-limit-induced fallback chains.

If you want a gateway that sits in front of your semantic cache and handles the LLM call side with automatic failover built in, LLMTest's proxy takes about 5 minutes to set up.

Ship LLM features without burning your budget.

LLMTest proxies your OpenAI / Anthropic calls, tracks cost per feature, and auto-rewrites prompts to be cheaper while holding quality. Free to start.

Create a free account

Related articles

How to handle LLM rate limits: 4 production-tested patterns
Four production patterns for LLM rate limits: jitter, token pre-checks, circuit breakers, and provider failover. Backoff alone won't save you in 2026.
Prompt caching breaks even at 1.3 requests. Here's the math.
Prompt caching cuts LLM costs 90% on Anthropic and 50% on OpenAI, but only when your workload fits. Here's the exact break-even math per provider.