Context windows explained: why your 128k model only gives you 100k

Q: How do I count tokens before sending a request?

Use `tiktoken` for OpenAI models, `client.messages.count_tokens(...)` for Anthropic, or Gemini's `countTokens` endpoint. All three give exact counts with no cost.

If your AI feature sometimes forgets what the user said three messages ago, truncates a long PDF, or silently drops half the codebase you pasted into Cursor, the culprit is almost always the context window. It's the single number on a model spec sheet that quietly decides whether your feature works at all.

This is what the context window actually is, what 128k tokens really fits, and why the advertised size is not the size you get.

The short version

A context window is how much text the model can see in a single call. Input and output share the same budget. If the window is 128k tokens and your prompt uses 126k, the model has 2k left to answer with. It is a working-memory cap. When you overflow it, older messages fall out (if your framework handles it) or the call errors (if it doesn't).

Everything the model "knows" about your specific situation (the system prompt, the chat history, the document you attached, the code you pasted, the tool outputs from the last step) gets squeezed into this one budget. Nothing persists between calls unless you put it back in manually.

What a token is, in pages

Tokens are chunks of text roughly 3–4 characters long on average for English. One English word is usually 1.3 tokens, though conversion rates shift for code and non-English text. Here's what that means in practice:

Content	Approximate tokens
1 page of English prose	400–500
1 page of code (dense, narrow margins)	200–400
1 page of JSON	300–700 (varies wildly with nesting)
1-hour podcast transcript	9,000–12,000
A 150-page paperback	60,000–75,000
The King James Bible (~783k words)	~1,000,000
A mid-sized TypeScript monorepo	~500,000 to 2,000,000

A 128k context window therefore holds roughly 260 pages of prose, or a good chunk of a mid-sized codebase. A 1M window holds most of a long novel with your prompt around it.

Where the 2026 models land

Context size is one of the most-marketed specs, so every major provider advertises aggressively. The shortcut for picking by this dimension is our LLM capabilities matrix, which sorts every major model by context window and lines up the other capabilities next to it. As of April 2026:

GPT-5.5: 1M total (reviewed April 2026)
GPT-5: 400k total, 128k output
Claude 4.7 (Sonnet): 1M total
Gemini 2.5 Pro: 2M total
DeepSeek V3: 128k
Llama 4 Maverick: 10M (on paper)

"On paper" matters. Most long-context benchmarks (including Anthropic's own work on retrieval degradation over long windows) show quality dropping well before the advertised limit. You can browse the full list with pricing and context on the LLMTest models directory. Context size also matters when you're picking fallback models: if a secondary model has a smaller window than your primary, prompts near the limit may silently truncate. See how to build a fallback chain for what to check before assuming a fallback is a drop-in replacement.

The usable window is smaller than the advertised one

Every spec sheet lists the maximum context the model accepts. That is not the size you actually get to use. Subtract:

Your system prompt. Often 200–2,000 tokens. It grows over time, a problem we covered in the three LLM costs nobody talks about.
Chat history. If your app keeps the last N turns, that's N times ~300 tokens each.
Tool definitions and schemas. Function-calling specs, JSON schemas, and MCP tool signatures can easily be 1,000–5,000 tokens.
The response budget. The model has to fit its answer inside the same window. If you're asking for a 1,000-token answer, reserve 1,000 tokens.
Quality degradation past a certain fill level. Benchmarks like RULER and Anthropic's needle-in-a-haystack tests show that models are noticeably sharper using the first 30–50% of the window than the last stretch.

On a nominal 128k model, a realistic usable budget for document-heavy tasks is closer to 100k. Push past that and you start seeing the model miss details, contradict itself, or re-read the same section twice.

When bigger is worth paying for

Larger context windows don't always cost more per token, but they tempt you to send more tokens, which is the real bill. A few tasks where the bigger window pays for itself:

Summarizing or extracting from long documents in one shot (contracts, research papers, meeting transcripts) where chunking would lose cross-references.
Whole-codebase analysis, where "find every call site of this function and tell me which ones need updating" is dramatically better with the tree in context than with retrieval chunks.
Long conversations where early turns matter: coaching bots, multi-step workflows, anything with "remember what the user said in turn 3".
Attachments plus reasoning (pasting a 40-page spec AND asking a multi-step question about it).

For everything else, a smaller window with smart context selection almost always wins.

When smaller wins

If you're mostly doing classification, extraction, short chat, or code review on a single file, a 32k model is usually cheaper and sharper than a 200k one. The bigger window is paying for capacity you're not using, and it tempts you to stuff in context you shouldn't.

A specific pattern: retrieval-augmented generation. Instead of cramming a 200-page manual into the window, you index it, retrieve the three most relevant sections per query, and feed those into a cheap 32k model. Output quality is usually equal or better (the model isn't drowning in irrelevant text), and in most real workloads cost drops 5–10x.

How to check your own

If you're using the OpenAI SDK, every response object has a usage field:

const r = await client.chat.completions.create({ model, messages });
console.log(r.usage);
// { prompt_tokens: 3412, completion_tokens: 287, total_tokens: 3699 }

Run this on your real traffic for a week. The 95th percentile of prompt_tokens tells you what your context actually looks like in production, almost always smaller than you'd guess. If your p95 is 8,000 tokens, you're paying for a 128k window and using 6% of it.

For counting tokens BEFORE you make the call, use tiktoken for OpenAI-compatible tokenizers, Anthropic's client.messages.count_tokens(...) SDK method, or Gemini's countTokens API. All three are fast and free.

FAQ

How many tokens is a page? About 400 to 500 for a standard page of English prose. Code is denser: 200 to 400 tokens per page depending on formatting. JSON varies wildly with nesting, from 300 to 700.

Is a bigger context window always better? No. Larger windows tempt you to send more tokens, and quality often degrades toward the end of a filled window. For classification, extraction, or short chat, a smaller window is usually cheaper AND sharper.

Do input and output share the same context window? Yes. If the window is 128k and your prompt uses 126k, the model has only 2k left to respond with. Always reserve headroom for the output you expect.

How do I count tokens before sending a request? Use tiktoken for OpenAI models, client.messages.count_tokens(...) for Anthropic, or Gemini's countTokens endpoint. All three give exact counts with no cost.

What happens when I exceed the context window? The API returns an error. Frameworks like LangChain or LlamaIndex will sometimes truncate older messages automatically, which silently drops information the model needed. Always log and alert on truncation.

When context size is the wrong thing to optimize

Context is a real constraint, but rarely the one that matters most. Latency, output quality, and price-per-token will affect your app more. If you're stuck between two models with similar windows, ignore the spec sheet and test both on your actual traffic. That's what LLMTest does for you automatically: it replays your real inputs through challengers, has a judge model score them, and tells you which one is the quieter win. See how our benchmarks work for the methodology.

Ship LLM features without burning your budget.

Related reading