Every provider charges by the token. If you estimate your bill by word count or character count, you will be wrong, usually in the expensive direction. The conversion rates are simple but vary in ways that matter: by content type, by tokenizer, and by language. Getting them right takes five minutes and prevents surprises on the first invoice.
What a token actually is
LLMs don't read text the way humans do. They read sequences of integers. Every model ships with a tokenizer (a vocabulary of text fragments, typically 32,000 to 100,000 entries) that maps substrings to integers before the model ever sees your input.
The vocabulary is built using Byte Pair Encoding (BPE): an algorithm that starts with individual characters, then repeatedly merges the most frequent adjacent pairs into single entries. Common English words ("the", "and", "function") land as single tokens. Rarer or longer words get split into fragments. "Tokenization" might map to ["Token", "ization"] or ["token", "ization"], depending on which model you are using.
Two things follow directly from this:
Vocabulary size changes the ratio. GPT-4o uses a 100,000-entry vocabulary (cl100k_base). Llama 2 used 32,000 entries. A larger vocabulary can represent more information per token, so GPT-4o needs fewer tokens than Llama 2 to encode the same sentence. When you see a "$1.50 vs $2.00 per million tokens" comparison between models, check whether those token counts came from the same tokenizer. Often they did not.
The tokenizer is model-specific. Claude Sonnet, GPT-4o, and Gemini 2.5 Flash each ship their own tokenizer. The same 800-word email produces a different token count from each. For plain English the differences are typically within 10-15%. For JSON, code, or non-English text they can diverge significantly.
The conversion rates by content type
English prose:
- 1 token = 0.75 words (or: 1 word = 1.33 tokens)
- 1 token = 4 characters, including spaces and punctuation
- A 500-word email lands at roughly 650-700 tokens in most current models
This is the ratio you will see cited most often, and it holds well for clean, conversational English or standard technical writing.
Code: Python source code tokenizes more efficiently than prose because Python keywords ("def", "class", "return", "import") appear often enough to become single tokens in most vocabularies. A typical Python file runs around 3.5 characters per token. The caveat is comments and docstrings, which tokenize like prose.
JSON is the expensive case. Structural characters (quotes, colons, commas, braces) do not compress well. A JSON payload you would describe as "50 words of content" can easily consume 200-plus tokens because every key, every separator, and every nested level counts. If your system prompt includes a JSON schema, count the tokens, not the words.
Non-English text: Languages sharing the Latin character set (French, Spanish, German) tokenize within 10-20% of English rates. German compound nouns sometimes tokenize expensively because they are long words that the vocabulary splits into parts.
For non-Latin scripts, the picture changes. Chinese and Japanese characters are often 1-2 tokens per character, but since a single character in those languages carries the meaning of a full English word, the tokens-per-concept ratio is comparable to English once you think about meaning rather than raw characters. Arabic script with vowel marks, or Hindi in Devanagari, typically runs 2-4 characters per token, which puts those languages in a worse position than English: expect to pay 30-50% more tokens for equivalent meaning than you would for an English translation.
The patterns that break your estimate
The prose ratio is a starting point, not a guarantee. Three patterns consistently cause underestimates in production:
Structured data in system prompts. If your system prompt includes a schema, enumerated field values, or formatted examples, those sections tokenize at 2-3x the rate of prose. A system prompt that reads as "300 words" might actually be 700 tokens.
Repeated formatting in outputs. When you request structured output (a table, a bulleted list, a JSON array), the formatting tokens accumulate. A 20-row table with 5 columns generates a lot of pipe characters and newlines, all billed.
Identifiers and paths. UUIDs, file paths, API endpoints, and long variable names do not compress. /api/v2/users/8a3f9c2d-1234-5678-abcd-ef0123456789/profile is 57 characters but tokenizes into many more fragments than the character-per-token average suggests.
This is directly related to why prompt bloat costs so much. The three LLM costs nobody talks about covers this with concrete auditing steps you can run this week.
A practical counting recipe
Before shipping, measure actual token usage on representative input samples. Every provider lets you do this without paying for model completions:
OpenAI / GPT models: use tiktoken in Python. tiktoken.encoding_for_model("gpt-4o").encode(text) gives you the exact token list. len() gives the count. Runs locally, costs nothing.
Claude (Anthropic): the API returns usage.input_tokens and usage.output_tokens in every response. To count before generating, call client.messages.count_tokens(model=..., messages=[...]): it returns a token count without invoking the model. Also free.
Gemini: call client.models.count_tokens(model=..., contents=[...]). Same idea: count only, no generation cost.
For cross-model estimates, if you have measured 10,000 tokens for a prompt on GPT-4o and want a rough sense of what Claude would show, assume plus or minus 15% for English prose and up to plus or minus 25% for code or structured data until you can measure directly.
The LLMTest API returns usage data in a unified format across models. If you are routing through our proxy, you get comparable token counts from every backend in the same response field, without adapting your code per provider.
FAQ
How many tokens is 1,000 words? On standard English prose, 1,000 words is approximately 1,300 to 1,400 tokens. The ratio is 1.33 tokens per word. For code or structured data that number rises; always measure against your actual content type.
Do different LLMs count tokens differently? Yes. Each model ships its own token vocabulary and encoding rules. For plain English the difference is usually under 15%. For JSON, code, or non-English text the gap can reach 25-30%. Always measure against the model you are actually shipping with.
Why does JSON cost so many tokens? JSON's structural characters (quotes, braces, commas, colons) are common in JSON but rare in natural text, so the encoding vocabulary never merged them into efficient multi-character groups. Every field boundary is billed individually. When possible, prefer compact formats such as newline-delimited text or CSV for high-volume outputs.
Is there a simple rule for estimating token costs? For English prose: divide word count by 0.75 to get a token estimate, then multiply by the model's price per million tokens. For code or structured output, double the prose estimate and measure from there. For non-English text, add 30-50% on top of the prose estimate depending on the script.
What is the difference between input tokens and output tokens on my bill? Output tokens are typically 3-5x more expensive than input tokens because the model generates them one at a time rather than processing them in parallel. A response of 500 tokens often costs more than a prompt of 2,000 tokens. For cost control, output length matters more than prompt length in most workloads.
Does Anthropic's prompt caching change my token count?
No. Cached tokens still appear in usage.cache_read_input_tokens and count the same as uncached input tokens for context-usage purposes. The difference is in price: cached input tokens are billed at around 10% of the normal input rate. The count stays the same; only the cost changes. The context window guide covers how caching interacts with your per-call budget.
Measure first, then optimize
Token count is the one number you should know before your first real invoice arrives. Run tiktoken, count_tokens, or countTokens on a representative sample of your prompts. Note the ratio to your word estimate. If it is more than 30% above what you expected, something in your prompt structure is tokenizing expensively and worth auditing.
Try LLMTest free to log input and output tokens per model across all your calls in one place, without instrumenting each provider separately.