Claude in production 2026: real bill from $797 to $127

You build a documentation assistant. Users ask questions; your app passes a 15,000-token system prompt (the full docs plus a few-shot persona) plus the question to Claude Sonnet 4.6 and returns the answer. At 500 API calls a day, the bill lands at $797/month. That number feels wrong, so you open the Anthropic console and it is right.

Here is how to get it to $127/month with two configuration changes and no architecture rewrite.

The baseline

Scenario: 500 API calls/day on Sonnet 4.6 ($3/1M input, $15/1M output), each call sending:

15,000 tokens: static system prompt (docs, persona, examples)
200 tokens: the user's question
500 tokens output

Token type	Daily volume	Rate	Daily cost
Input (all)	7,600,000	$3/1M	$22.80
Output	250,000	$15/1M	$3.75
Total			$26.55/day

Monthly: $796.50. That 15,000-token system prompt is billed on every single request. It never changes, yet you pay full price for it 500 times a day.

Optimization 1: 5-minute prompt caching

Claude's caching works by tagging the static portion of your prompt with cache_control. Anthropic stores it for five minutes; any request arriving within that window pays $0.30/1M for those cached tokens instead of $3/1M. That is a 90% reduction on every token that repeats.

Writing the cache costs 1.25x the standard input rate ($3.75/1M for Sonnet 4.6). As the detailed break-even math shows, that write cost recovers after just 1.3 cache reads. Two requests sharing the same cache entry and you are already ahead.

At 500 calls/day spread across business hours, you average about one request every three minutes. That sits inside the five-minute TTL for most of the day, but overnight gaps and burst patterns burn more writes. A conservative estimate: 80% cache hit rate.

Token type	Daily volume	Rate	Daily cost
Cache writes (20%)	1,500,000	$3.75/1M	$5.63
Cache reads (80%)	6,000,000	$0.30/1M	$1.80
Non-cached input	100,000	$3/1M	$0.30
Output	250,000	$15/1M	$3.75
Total			$11.48/day

Monthly: $344.40, a 57% reduction. But you can do better.

Optimization 2: switch to the 1-hour cache

Anthropic shortened the default cache TTL from 60 minutes to 5 minutes in early 2026. That change quietly raised effective costs for any app with longer idle periods between requests, including overnight gaps.

The 1-hour TTL is still available as a paid option: cache writes cost 2x the standard input rate ($6/1M for Sonnet 4.6). More expensive per write, but you write far less often. If your traffic is steady during business hours, the 1-hour cache is worth the premium.

At 95% cache hit rate (achievable with consistent daytime usage patterns):

Token type	Daily volume	Rate	Daily cost
Cache writes (5%)	375,000	$6/1M	$2.25
Cache reads (95%)	7,125,000	$0.30/1M	$2.14
Non-cached input	100,000	$3/1M	$0.30
Output	250,000	$15/1M	$3.75
Total			$8.44/day

Monthly: $253.20, 68% less than the baseline. One cache write every 20 calls instead of every five.

Optimization 3: the batch API

If same-hour delivery is acceptable (internal tools, scheduled pipelines, overnight report generation), Anthropic's Message Batches API cuts all token costs by 50%. Most batches complete in under an hour. The 50% discount applies to cache writes, cache reads, and output tokens alike.

Stacked on the 1-hour cache at 95% hit rate:

Token type	Daily volume	Rate	Daily cost
Cache writes (5%)	375,000	$3/1M	$1.13
Cache reads (95%)	7,125,000	$0.15/1M	$1.07
Non-cached input	100,000	$1.50/1M	$0.15
Output	250,000	$7.50/1M	$1.88
Total			$4.23/day

Monthly: $126.90, an 84% reduction from $796.50.

Same 500 calls. Same prompts. Same output quality. The only tradeoff: responses queue and return within the hour rather than in real time.

The full picture

Approach	Monthly cost	vs. baseline
No optimization	$796.50	baseline
5-min cache, 80% hit rate	$344.40	-57%
1-hour cache, 95% hit rate	$253.20	-68%
1-hour cache + batch	$126.90	-84%

The root cause of the $797 baseline is the same hidden cost that drives most inflated LLM bills: prompt bloat. A 15,000-token static prefix repeated on every request is the problem. Caching is the exact countermeasure.

Subscription vs API

Claude's subscription plans cover claude.ai and Claude Code usage — not raw API calls to your application. If you are building a product, you are on the pay-per-token API regardless of which subscription plan you hold.

Plan	Price	Who it suits
Claude Free	$0/mo	Occasional personal use, message-limited
Claude Pro	$20/mo	~45 Sonnet messages per 5-hour window; personal writing and coding
Claude Max 5x	$100/mo	~225 messages/5-hour window; daily heavy Claude Code sessions
Claude Max 20x	$200/mo	~900 messages/5-hour window; all-day agentic coding workflows
API direct	Pay-per-token	Building products, batch workloads, custom integrations

Break-even for solo developers using Claude Code. At Max 5x ($100/month), you get roughly 225 Sonnet-equivalent messages per five-hour window. Running those through the API directly at Sonnet 4.6 rates costs about $0.04 to $0.08 per typical coding exchange (a few thousand input tokens, a few hundred output). At $0.06 average, 225 messages costs roughly $13 per five-hour window. Max 5x becomes cost-effective once you are consistently hitting your Pro limit and your personal coding workflow would otherwise cost over $100/month via the API.

For product builders shipping to real users: no subscription covers production API calls. You need the API, and optimizing it is the only lever you have.

See Anthropic's pricing page for current plan details and any changes to included usage limits.

One note on Opus 4.7

If you are on Opus 4.7 for higher output quality, its new tokenizer produces roughly 35% more tokens from the same input text compared to earlier models. The per-token rate is $5/$25 (input/output) vs $3/$15 for Sonnet 4.6, so the effective cost gap is larger than the rate difference alone suggests. The documentation assistant scenario above would start at roughly $4,200/month on Opus 4.7 with no optimization. Caching matters even more at that baseline.

Start with Sonnet 4.6 and test whether Opus-level output quality is actually needed for your use case. The LLMTest proxy gives you real cost-per-call data across both models so the decision is based on your actual prompts, not estimates.

If you're still working out what to charge for the feature you're optimizing, pricing an AI feature from the margin back shows how to set a per-request token budget that keeps you profitable before you commit to a model.