Fastest LLMs under $1/M tokens in 2026: speed and cost ranked

Three years ago, $1 per million input tokens was the budget floor. Today it's the ceiling of a tier that includes some genuinely capable models. The gap between "cheap" and "slow and mediocre" has narrowed considerably. What remains is a real choice with real trade-offs: which model under $1/M is actually fastest, and for what kind of workload?

This ranking covers five models priced at or below $1/M input tokens. Pricing is verified as of June 2026. Speed figures come from BenchLM.ai's independent speed tracking and our own runs through the LLMTest proxy.

The field

Model	Input $/M	Output $/M	Context	Speed
Gemini 2.5 Flash-Lite	$0.10	$0.40	1M	Fast
DeepSeek V4	$0.14	$0.28	128K	Mid
GPT-4o-mini	$0.15	$0.60	128K	Fast
Mistral Small 3.1	$0.20	$0.60	128K	Fast
Gemini 2.5 Flash	$0.30	$2.50	1M	Fastest
Claude Haiku 4.5	$1.00	$5.00	200K	Fast

Pricing sources: devtk.ai's June 2026 comparison, OpenRouter model cards, and Anthropic's published pricing.

The table covers six models, not five. Gemini 2.5 Flash-Lite and Mistral Small 3.1 are real options worth knowing about, but they don't have standout characteristics that earn dedicated sections. Flash-Lite at $0.10/M is the cheapest entry in the entire tier; it trades quality for cost on tasks requiring reasoning. Mistral Small 3.1 at $0.20/M is the cleanest EU-hosted option for teams with data-residency requirements. Both are solid choices for the right workload.

The four models below are where the interesting trade-offs live.

Gemini 2.5 Flash: the throughput leader

Gemini 2.5 Flash runs at 204.5 tokens per second on standard prompts, placing it at the top of the non-reasoning model category in BenchLM.ai's June 2026 leaderboard. Time-to-first-token is around 150ms. Nothing else in the under-$1/M tier comes close on raw speed.

The input cost is $0.30/M, which is higher than DeepSeek V4 and GPT-4o-mini but still deep in the budget range relative to frontier models. The issue is output: at $2.50/M output tokens, Flash is priced for short completions. Run it on a pipeline that generates 500 tokens per response at 10,000 requests per day and the output bill alone hits $12.50/day. Run the same volume through DeepSeek V4 and the output cost is $1.40/day.

Flash earns its place on workloads where output is brief: classification, routing decisions, entity extraction, sentiment tagging, short Q&A, structured field extraction from a longer input. For those patterns, the 1-million-token context window and 150ms TTFT are a meaningful advantage. You can pass enormous inputs and get a fast response back.

If you want the same context window at even lower input cost, Flash-Lite at $0.10/M handles the same document-scale inputs. The quality trade-off on reasoning tasks is real, but for extraction on structured documents, many teams find it sufficient.

DeepSeek V4: cheapest output in the tier

At $0.14/M input and $0.28/M output, DeepSeek V4 has the lowest output price of any production-grade model in this tier. That matters more than input cost for long-form workloads: summarization, translation, report drafting, code generation. When output tokens outnumber input tokens by a factor of two or three, the output price dominates the bill.

The quality holds up. Our DeepSeek V4 review found it competitive with frontier-class models on reasoning and coding tasks. The architecture is Mixture of Experts: DeepSeek activates only a fraction of its parameters per token, which keeps compute costs low without the usual quality penalty. For a deeper look at why that trade-off works, the MoE explainer covers the mechanism.

Time-to-first-token runs around 300ms, higher than Flash and with more variance. That's acceptable for async workloads and batch jobs; it shows up in real-time chat or interactive features where users see tokens stream in. If you're building a pipeline that runs nightly or processes queued jobs, the latency is irrelevant and the cost advantage is decisive.

One caveat to name directly: DeepSeek's servers are China-hosted. For personal projects and internal tooling, that's likely fine. For client data, PII, or anything under data-residency requirements, route through a provider that runs DeepSeek's weights on US or EU infrastructure.

GPT-4o-mini: the reliable generalist

GPT-4o-mini is priced at $0.15/M input, $0.60/M output. In our SQL generation benchmark, it beat Claude Sonnet 4.5 (priced 20x higher), GPT-4o, and Gemini 2.5 Flash on six real PostgreSQL tasks. Average end-to-end latency in that run was 1,722ms, which is mid-range for the tier on complex tasks.

What GPT-4o-mini offers that the cheaper options don't is consistency. The uptime record is strong. Structured output comes back well-formed more reliably than smaller alternatives. OpenAI's rate limits are generous at scale. For a solo developer who doesn't want to manage routing logic or think about provider reliability, GPT-4o-mini is the default that rarely embarrasses you.

The $0.60/M output price is on the higher end of the budget tier (DeepSeek V4 output is half that), but the gap closes when you factor in retry rates. A model that returns valid JSON on the first call costs less per successful response than one that requires a retry 10% of the time.

Claude Haiku 4.5: quality ceiling at $1/M

Haiku 4.5 is at the threshold: $1.00/M input, $5.00/M output. The output price is the highest in this group by a significant margin, which prices it out of long-form generation workloads at scale. But the quality ceiling is real.

Our code review benchmark tested Haiku 4.5 against GPT-4o on six real buggy diffs: Haiku won five of six, at roughly one-tenth the cost per call compared to Opus. For CI pipelines, agentic loops, or any task where getting the structure right on the first call matters, Haiku's output quality translates to fewer retries. On coding tasks specifically, fewer retries can make the effective cost competitive with cheaper models that require two or three attempts to produce valid code.

The 200K context window is useful for large file reviews or long document tasks without chunking. Time-to-first-token consistently clocks under 600ms, making it responsive enough for interactive features where quality matters more than raw throughput.

Where each model breaks

Gemini 2.5 Flash breaks on long-form generation volume. The $2.50/M output price is the highest ceiling in the tier; scale it up and the cost curve turns steep fast.

DeepSeek V4 breaks on latency-sensitive real-time applications. The 300ms TTFT and higher jitter make it a poor fit for streaming chat features where users are watching tokens arrive.

GPT-4o-mini breaks on deep reasoning and multi-step planning. It handles most tasks well; it struggles when the task requires sustained logical chains or domain-specific precision beyond its training.

Claude Haiku 4.5 breaks on output volume. At $5/M output tokens, any pipeline generating substantial output per call adds up quickly. If you're running 5M output tokens per day, Haiku costs $25/day versus $1.40/day for DeepSeek V4.

Verdict

For high-volume short-completion workloads (classification, extraction, routing): Gemini 2.5 Flash at $0.30/M input and 204 tok/s. If you need even lower cost and can tolerate the quality trade-off, Flash-Lite at $0.10/M.

For long-form generation on a tight budget: DeepSeek V4 at $0.28/M output, assuming async workload and no data-residency constraints.

For a predictable, well-rounded default across mixed workloads: GPT-4o-mini at $0.15/M input.

For coding tasks, agentic pipelines, or CI use cases where first-call quality reduces retry overhead: Claude Haiku 4.5, with its output cost weighed against your retry rate.

If you're building a routing layer that sends cheap tasks to the budget tier and escalates harder ones to frontier models, the cheap-first escalation pattern covers the production implementation in full. All models in this post are accessible from a single endpoint via LLMTest, with per-call cost and latency tracking so you can measure which tier is actually delivering value. The benchmarks methodology explains how our comparisons are structured if you want to run your own before committing.