Every major provider now has prompt caching. Anthropic pitches a 90% discount on cached tokens. OpenAI makes it automatic, no setup required. The pitch is hard to argue with: pay full price once, then pay almost nothing on repeat calls. But "almost nothing" has a catch. With Anthropic, you pay 25% extra on the first call to write the cache. That write cost is where the math gets interesting.
The break-even formula
With Anthropic's 5-minute TTL, cache writes cost 1.25× the base input price. Cache reads cost 0.1× (a 90% discount). To find the break-even: how many requests need to hit the same cache entry before you come out ahead?
Call the base cost for your cached tokens P. Without caching, k requests cost k × P. With caching:
- First request: 1.25 × P (write)
- Requests 2 through k: 0.1 × P each (reads)
- Total with caching: P × (1.15 + 0.1k)
Set those equal and solve: k ≈ 1.28. Two requests using the same cache entry already saves you money. You don't need high traffic to make caching worth it, you just need reuse.
For the 1-hour TTL option (which costs 2× base to write rather than 1.25×), break-even rises to 2.1 requests. Still low, but it matters for workloads with sparse traffic.
OpenAI's model is structurally different: caching is automatic with no write premium, and the read discount is 50% rather than 90%. There's no break-even calculation to run; any cache hit saves you money. The trade-off is a smaller ceiling.
Where caching pays off
RAG-style retrieval systems. Suppose your app retrieves a 50,000-token document and passes it to Claude Sonnet 4.6 ($3/M input, $0.30/M cache read) with 200 queries per day arriving in bursts:
- Without caching: 200 × 50,000 tokens × $3/M = $30/day
- With caching (1 write, 199 reads within a burst): 50,000 × $3.75/M + 199 × 50,000 × $0.30/M = $3.17/day
- Savings: 89%
The key assumption is "in bursts": all 200 queries arriving within the same 5-minute window. If requests arrive one per hour, that assumption collapses (more on that below).
Chatbots with a long system prompt. A 5,000-token system prompt with 1,000 conversations per day at three turns each means 3,000 calls carrying the same prefix. If roughly 10 conversations land within any 5-minute window, about 100 cache writes will cover all 3,000 calls. At Sonnet 4.6 pricing: $1.88 in writes + $4.35 in reads = $6.23/day, versus $45/day without caching. That's 86% off from one code change.
Large tool definitions for agentic apps. If your system sends 20 tool definitions on every call, those definitions can hit the minimum token threshold on their own and they're completely static. Cache them. This is one of the most overlooked caching wins in agentic systems.
Where it costs you more
Below the minimum token threshold. Anthropic won't cache prompts shorter than 4,096 tokens for Opus 4.7 and Haiku 4.5, or 2,048 tokens for Sonnet 4.6. If your system prompt is 1,200 tokens, no cache entry is created, no error is raised, and cache_creation_input_tokens in the response stays at 0. You added cache_control and got nothing. This is the most common confusion when developers first try prompt caching.
Spread-out traffic. Anthropic's 5-minute TTL is generous for bursty workloads, painful for sparse ones. One API call every 10 minutes means every request is a cache write, never a read. You pay 25% extra on every single call for zero benefit. The 1-hour TTL looks like the fix, but it costs 2× to write. With one call every 10 minutes, six requests fit in an hour: you save on the last five but paid double to write. The math works out slightly positive, but barely.
Highly dynamic prefixes. Caching works at the prefix level: the cached content must appear at the start of your prompt, unchanged. If your system prompt includes per-user data (name, account tier, conversation history), there's no stable prefix to cache. The cached portion has to be fully static. One pattern that works: put a large static block of instructions first, then append the dynamic per-user content after the cache breakpoint.
One-off scripts and batch jobs. A script that calls the API 10 times with the same large context will pay 1.25× on the first call and 0.1× on the next 9, roughly 87% total savings. Fine. But a batch job that runs once a week with different documents on each run will write a cache entry that never sees a read before it expires. Pure write cost, no savings.
Setting it up
For Anthropic, you opt in explicitly by adding cache_control to the content blocks you want cached:
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: [
{
type: 'text',
text: yourSystemPrompt, // must be >= 2,048 tokens for Sonnet 4.6
cache_control: { type: 'ephemeral' },
},
],
messages: [{ role: 'user', content: userMessage }],
});
// Verify cache usage in the response:
const { cache_read_input_tokens, cache_creation_input_tokens } = response.usage;
A cache_read_input_tokens value above zero confirms a cache hit. If it stays at 0 despite setting cache_control, count your system prompt tokens; you're likely below the threshold. The conversion from words to tokens isn't straightforward; our breakdown of how LLM tokenizers count tokens across different content types gives you the conversion rates you need to estimate your prompt size.
For OpenAI, there's nothing to configure. Caching is on by default for prompts over roughly 1,024 tokens. Check prompt_tokens_details.cached_tokens in the response to confirm hits.
The decision rule
Add caching when your static prefix is above the minimum threshold, multiple requests reuse that same prefix within the TTL window, and the prefix is large enough that even a 90% discount generates meaningful dollar savings.
Skip caching when your traffic is sparse relative to the TTL, your prompts are mostly dynamic, or your system prompt is under the minimum for your chosen model.
Track cache_read_input_tokens and cache_creation_input_tokens in production. A cache hit rate below 50% on a workload you expected to benefit from caching usually means either the TTL is too short for your request cadence or more of your prompt is dynamic than you thought.
If you route through LLMTest's proxy, cache hit rates and cost-per-call breakdowns appear in your dashboard automatically. The billing surprises that tend to show up before developers reach prompt caching (thinking tokens, JSON retries, and prompt bloat) are covered in our guide to the hidden costs most LLM dashboards don't show.