The three LLM costs nobody talks about (and how to find yours)

By LLMTest Team · Apr 21, 2026 · 4 min read costprompt-engineeringvibe-coders
On this page

On this page

  1. 1. Thinking tokens
  2. 2. Retries on broken JSON
  3. 3. Prompt bloat
  4. How to audit your own app

If you're building a side project on top of an LLM, your monthly bill has three line items the dashboard won't show you. They're the reason "I switched to a cheaper model" rarely cuts costs as much as you expect.

This is a quick tour of all three, how to spot them in your own app, and what to do once you do. If you want the headline rates first, check our LLM API pricing comparison.

1. Thinking tokens

Reasoning models — o1, o3, Claude Sonnet 3.7 with thinking, Gemini 2.0 Flash Thinking — bill you for tokens you never see. The model "thinks" for a while, emits a block of hidden reasoning, and then gives you the actual answer. Both halves cost money. The thinking half is often 5–10x longer than the answer.

If you switched from a non-reasoning model to a reasoning one because benchmarks said it was smarter, check your bill. A single "what's the capital of France" style call can cost 3,000 tokens when 20 would have done.

What to do: for anything that doesn't actually need multi-step reasoning — classification, extraction, short rewrites, simple Q&A — use a non-reasoning model. Reasoning models earn their keep on hard problems, not on the 80% of your traffic that's trivial.

2. Retries on broken JSON

If you use response_format: json_object or just prompt "return JSON" and parse the output yourself, you're probably retrying more than you think. LLMs hallucinate trailing commas, drop closing braces, wrap valid JSON in prose, or invent keys that don't exist in your schema.

Most apps handle this with a retry loop: catch the parse error, re-ask the model, hope it works the second time. Each retry is another full-price call.

A rough rule: if you're seeing 5% JSON failures and retrying once, your JSON-mode features cost about 5% more than you think. If you're retrying up to 3 times, it's worse — a small percentage of requests eat 3–4x the tokens.

What to do: log every retry. If you're using LLMTest, JSON repair is automatic — the proxy fixes broken JSON before the retry counter even increments. If you're not, at minimum add a schema validator and fail fast on obviously-irrecoverable output instead of letting the model try three more times. Once JSON retries are under control, the next layer is provider-level fallback: routing to a secondary model when the primary returns hard errors or soft failures.

3. Prompt bloat

The prompt you shipped six months ago is probably twice as long as it needs to be.

It happens the same way every time. Something goes wrong. You add a line to the system prompt: "Never respond with markdown." Something else goes wrong. You add another: "Always include a greeting." A week later: "Refuse to discuss competitors." Three months in, your system prompt is a museum of old bugs, and every single call pays for all of them.

Because LLMs are probabilistic, you can't easily tell which lines are actually doing work. Removing one usually doesn't break anything visibly — until it does, six customers later.

What to do: once a quarter, take your longest-lived prompt, strip it down to the minimum you think should work, and A/B it against production on real traffic. If the shorter version wins on quality and costs 30% less, ship it. If it loses, you learned which lines actually matter. This is exactly what autopilot does automatically — it tests shorter variants every week and only applies the ones that pass a quality gate. On the flip side, when a system prompt can't be shortened without hurting quality, caching that static prefix converts those tokens into nearly free reads on every call after the first.

How to audit your own app

If you don't want to wait for LLMTest to tell you, here's the manual version:

  1. Pull last month's usage data from your provider dashboard. Note total tokens.
  2. Look up one reasoning model call. If thinking tokens are above 50% of the total for that call, you have a thinking-tokens problem.
  3. Grep your code for JSON.parse on LLM output. For each hit, count how often it throws. That's your retry rate.
  4. Open your longest system prompt. If it's over 500 tokens, you have a prompt bloat problem. It's very rare that you actually need 500 tokens of instructions.

Each of these is worth 10–30% savings if you act on it. Stack two and you've cut your bill in half without changing a single model ID.


If you want this done for you — the proxy, the logging, the automatic prompt rewrites — LLMTest does all three. Free to start. Cancel anytime.

Ship LLM features without burning your budget.

LLMTest proxies your OpenAI / Anthropic calls, tracks cost per feature, and auto-rewrites prompts to be cheaper while holding quality. Free to start.

Create a free account

Related reading

Prompt caching breaks even at 1.3 requests. Here's the math.
Apr 27, 2026 · 5 min read
Best LLM for SQL generation in 2026: GPT-4o-mini wins clean
May 1, 2026 · 7 min read
DeepSeek V4 Pro review: beats GPT-5.5 and costs a fifth of Opus 4.7
Apr 29, 2026 · 6 min read