The idea behind prompt caching is simple: if you're sending the same large chunk of text on every API call (a system prompt, a reference document, a set of tool definitions), the provider can skip reprocessing it after the first time and charge you less. Three major providers support this today. Each does it differently enough that the "which one" decision deserves more than a checkbox comparison.
What prompt caching actually does
An LLM API call is expensive partly because the model processes every input token from scratch, every time. Prompt caching short-circuits that for the stable prefix of your prompt. On a cold call, the provider processes the prefix, stores the resulting KV tensors, and charges you at a write rate. On subsequent calls that share the same prefix, the provider reads from the tensor cache instead of reprocessing, and charges you at a much lower read rate.
The savings show up in your API response under usage fields (exact names differ per vendor). The magnitude is real: on a 10,000-token system prompt with 1,000 daily calls, you're paying for one full prompt and 999 discounted reads. On a batch pipeline with a 50,000-token document, the difference can be 80-90% off that document's cost.
How Anthropic does it: explicit and surgical
Anthropic requires you to mark what you want cached using a cache_control parameter on individual content blocks:
{
"system": [
{
"type": "text",
"text": "Your long system prompt here...",
"cache_control": { "type": "ephemeral" }
}
]
}
Two TTL options: 5 minutes (default) or 1 hour. The cost structure for Sonnet 4.6 ($3/M input):
| Action | Rate | Price per 1M tokens |
|---|---|---|
| Normal input | 1× | $3.00 |
| Cache write (5-min) | 1.25× | $3.75 |
| Cache write (1-hr) | 2× | $6.00 |
| Cache read | 0.10× | $0.30 |
The write premium is what makes Anthropic's caching different from the others. You pay 25% more on the first call that populates a 5-minute cache entry, and 2× for the 1-hour entry. That write cost recovers after about 1.3 reads at 5-min TTL. Two requests hitting the same cache entry and you're already ahead. The detailed break-even math, including the formula and a RAG workload worked example, is in the prompt caching break-even analysis.
What shows up in your API response usage object:
cache_creation_input_tokens: tokens processed and written to cache (charged at 1.25× or 2×)cache_read_input_tokens: tokens fetched from cache (charged at 0.10×)input_tokens: tokens after your last breakpoint, processed normally
You can have up to 4 breakpoints per request, which matters when parts of your prompt change at different rates. Tool definitions update less often than conversation history; you can cache the definitions at one breakpoint and not the conversation.
The minimum token threshold to get a cache entry: 512 for Claude Fable 5, 1,024 for Opus 4.8 and Sonnet 4.6, 4,096 for Haiku 4.5. Below the threshold, no error appears; cache_creation_input_tokens silently returns 0 and you're billed at normal input rates. This no-op trips up developers more than anything else when first enabling caching.
How OpenAI does it: automatic with no write cost
OpenAI's prompt caching is fully automatic on GPT-5.5, GPT-5.4, and GPT-4o-class models. You don't add markers or change your request structure. Any prompt with a stable prefix over 1,024 tokens becomes cache-eligible; the provider detects prefix matches server-side and applies the discount.
For GPT-5.5 ($5/M input): cached input costs $0.50/M (a 90% discount, identical in magnitude to Anthropic's read rate). No write premium.
The trade-off is observability. You can't choose exactly what gets cached or request a longer TTL. You can't pre-warm the cache before traffic hits. The discount shows up as a lower effective cost-per-call, not as a separate cache_read_input_tokens field you can monitor directly. To check whether caching is firing, you'd compare cost-per-call before and after your prefix stabilizes, or look at aggregate billing data.
For workloads where simplicity matters more than optimization tuning (prototypes, apps with a stable system prompt you're not trying to micro-optimize), OpenAI's approach requires no code changes.
How Google Gemini does it: you rent the cache by the hour
Gemini's context caching works at a different architectural level. Instead of adding a parameter to your regular call, you create a cache object first via a separate API resource:
const cache = await ai.caches.create({
model: 'gemini-2.5-pro',
contents: [{ role: 'user', parts: [{ text: largeSystemContext }] }],
ttl: '3600s',
});
// cache.name is then referenced in subsequent generation calls
The key difference is the billing model: you pay a per-token, per-hour storage fee regardless of whether any request actually reads the cache. A cache sitting unused overnight still charges you. This is fundamentally different from Anthropic (pay only on write and read) and OpenAI (pay nothing extra at all).
Cache reads are discounted significantly, roughly 75-90% off base input price depending on the model. But you need to add the storage cost to the math before deciding whether caching makes sense for your workload.
Minimum context size for Gemini caching is higher than the other providers. Older Gemini 1.5-series models required at least 32,768 tokens; newer 2.5-series models have lower thresholds but still require more than the ~1,024-token minimums on Anthropic and OpenAI.
Side by side: what shows up on your bill
| Anthropic | OpenAI | Google Gemini | |
|---|---|---|---|
| Activation | Explicit cache_control |
Automatic | Separate cache API object |
| Write cost | 1.25× (5-min) or 2× (1-hr) | None | None |
| Read cost | 0.10× input (90% off) | 0.10× input (90% off) | ~0.10–0.25× input (varies by model) |
| Storage cost | None | None | Per token per hour |
| TTL options | 5-min or 1-hr | Provider-managed | Configurable, default 1hr |
| Minimum tokens | 512–4,096 (model-dependent) | 1,024 | 32K+ (older models); lower on 2.5 series |
| Bill field | cache_creation_input_tokens / cache_read_input_tokens |
Lower effective input cost | Cache storage charge + read charge |
| Pre-warming | Yes (max_tokens: 0) |
No | Yes (create object in advance) |
Which approach fits which workload
Use Anthropic's explicit caching when you want control and transparency. RAG pipelines with a large static document prefix, chatbots with a long system prompt, agentic tools with stable tool definitions: all have a clear stable prefix you can mark once. The write premium is minor; the 90% read discount compounds fast. The cache_creation_input_tokens / cache_read_input_tokens fields give you direct observability without instrumentation. For a real-world example of what this does to a monthly bill, the Claude in production cost breakdown shows a $797/month bill cut to $127 with caching plus batch API.
Use OpenAI's automatic caching when you want simplicity over fine-grained optimization. No code changes, no minimum token configuration to manage, just a lower bill when your prefix repeats. If you're on GPT-5.5 or GPT-4o and you're not trying to squeeze the last 10% out of caching efficiency, automatic is fine.
Use Google's context caching when you have very large context (100K+ tokens) that you'll query heavily over a multi-hour window. The per-hour storage fee becomes trivially small compared to the token savings when cache utilization is high. It's the wrong choice for sparse or unpredictable traffic patterns, because you're paying for the cache's existence whether it earns its keep or not.
Prompt caching is provider-side and applies inside individual API calls. For a complementary layer that eliminates API calls entirely on repeated queries, semantic caching in production covers the three application-level approaches and their failure modes.
The LLMTest proxy surfaces cache hit rate and cost-per-call across all providers in one dashboard, without any per-provider instrumentation code.
FAQ
What is prompt caching?
Prompt caching is a provider feature that stores the processed KV tensors for a stable prefix of your prompt. On subsequent calls with the same prefix, the provider reuses the stored result instead of reprocessing, and charges you a discounted rate for those tokens, typically around 90% off base input price.
Does prompt caching change my response quality?
No. The model produces identical outputs whether the prefix was cached or processed fresh. Caching operates on the internal compute representation, not the model's behavior or generation process.
What happens if my prompt is below the minimum token threshold?
Nothing breaks, but no cache entry is created. With Anthropic, cache_creation_input_tokens returns 0 and you're billed at normal rates. OpenAI won't apply the discount. No error is raised in any case, which is why developers often add cache_control and then wonder why costs haven't dropped.
How do I check if caching is actually working?
With Anthropic, check cache_read_input_tokens in the response usage object: a non-zero value means a cache hit occurred. With OpenAI, compare effective cost-per-call over time; the price should drop once your prefix stabilizes and the cache warms up. With Gemini, cache reads and storage charges appear as separate line items in your billing console.
Can I cache tool definitions, not just the system prompt?
Yes, on all three providers. Tool definitions are often the highest-ROI caching target because they're large, completely static between requests, and typically above the minimum token threshold on their own. With Anthropic, place cache_control on the last tool definition block. With OpenAI, caching activates automatically. With Gemini, include the tool config in the cached context object.
Does Google's storage cost make caching uneconomical for low-traffic apps?
It can. If you create a large Gemini cache and query it only a few times before the TTL expires, the storage fee may exceed what you saved on reads. Google's context caching is most cost-effective when utilization is high: many reads from the same stored context over a short, predictable window. For low-volume or unpredictable traffic, Anthropic or OpenAI carry less financial risk.
Try LLMTest free to log and compare cache hit rates and actual token costs across providers on your real prompts.