DeepSeek released V4 on April 24, 2026, in two variants: V4 Pro (1.6 trillion parameters, 49 billion active per token) and V4 Flash (284 billion total, 13 billion active). Both use a Mixture of Experts architecture, both are open-source under MIT, and both carry a 1 million-token context window. The pricing is the headline: V4 Pro costs $1.74 per million input tokens and $3.48 per million output tokens. V4 Flash goes further at $0.14 input and $0.28 output.
We ran five developer tasks through V4 Pro, GPT-5.5, Claude Opus 4.7, and Llama 4 Maverick using the LLMTest benchmark runner. Two findings stand out: V4 Pro beats GPT-5.5 on quality while costing 4.5 times less, and it averages 28 seconds per response. Opus 4.7 remains the quality leader. Llama 4 Maverick won nothing.
What we tested
Five prompts, chosen to reflect what our readers actually build:
- SQL: Top 10 customers by Q2 2026 spend, with NULL handling and fraud exclusions
- MoE explanation: Explain Mixture of Experts to a solo web developer, 150-200 words
- React debug: Find and fix the infinite re-render in a
useEffecthook depending on state it also sets - Pricing copy: Write three distinct headline-plus-subheadline pairs for a developer SaaS pricing page
- Cost math: Calculate cost per 1,000 API requests from specific token counts and published pricing
Each model answered every prompt. Then anthropic/claude-sonnet-4 judged every model pair twice, with positions swapped, to eliminate order bias. Six pairs per prompt times five prompts gives 60 judging calls total.
Results
| Model | Wins (of 15) | Cost per 5 prompts | Avg latency |
|---|---|---|---|
| Claude Opus 4.7 | 13 (87%) | $0.068 | 5.8 s |
| DeepSeek V4 Pro | 6 (40%) | $0.013 | 28.1 s |
| GPT-5.5 | 5 (33%) | $0.059 | 8.6 s |
| Llama 4 Maverick | 0 (0%) | $0.001 | 8.5 s |
Direct matchups:
- V4 Pro vs GPT-5.5: V4 Pro wins 2, ties 2, loses 0
- V4 Pro vs Opus 4.7: V4 Pro wins 0, ties 0, loses 4
- V4 Pro vs Llama 4 Maverick: V4 Pro wins 4, ties 0, loses 0
Where V4 Pro wins
Against GPT-5.5, V4 Pro's edge is concreteness. On the MoE explanation prompt, where V4 Pro won, the judge's reasoning:
"Assistant A provides specific numbers (37B active vs 671B total parameters) which adds concrete context... Assistant A excels here with specific parameter counts and a very concrete downside example about 'factual recall in niche domains' with clear reasoning about why this occurs (uneven data distribution across experts). Assistant B's downside about 'consistency across long tasks' and 'switching expertise awkwardly' is vaguer and less actionable for a developer."
The same pattern held on the pricing copy prompt: V4 Pro's headlines included specific dollar amounts and implementation details where GPT-5.5 stayed vague. Across five tasks, V4 Pro finished 2-0 against GPT-5.5 with 2 ties, losing nothing.
At $3.48 per million output tokens versus GPT-5.5's $30, the quality-adjusted cost advantage is substantial. If you need quality above Llama 4 Maverick but don't need Opus 4.7's precision, V4 Pro beats GPT-5.5 at a fraction of the price.
Where Opus 4.7 holds the edge
Opus 4.7 beat V4 Pro 4-0. The clearest example is the React debugging prompt. The judge's reasoning where Opus won:
"Assistant B is slightly more technically precise by explicitly mentioning Object.is for React's dependency comparison and shows a cleaner code fix. Assistant A's explanation is clear but less technically detailed. Both solutions are practically identical in effectiveness, but Assistant B's response demonstrates deeper technical understanding of React's internals while maintaining the same clarity and conciseness."
Opus 4.7 consistently packed more signal per token. V4 Pro averaged 906 output tokens per prompt while Opus averaged 518, yet Opus won more judgments. Longer answers hurt V4 Pro here: when the judge scored for quality of explanation rather than length, the denser Opus responses came out ahead.
Opus also showed better epistemic honesty. On the MoE prompt, Opus acknowledged upfront that DeepSeek V4's architecture wasn't in its training data and clarified it was describing V3's design. The judge flagged this accuracy. V4 Pro stated V4 specifics as settled fact. For tasks where hallucinated confidence matters, Opus is the safer choice.
The quality premium has a price. Opus 4.7 costs $5 per million input tokens and $25 per million output tokens, against V4 Pro's $1.74 and $3.48. In our five-prompt benchmark, Opus cost $0.068 to V4 Pro's $0.013, a 5.2x difference. Our Claude Opus 4.7 review covers the tokenizer cost increase that amplifies this gap further.
The latency problem
The most consequential finding is not in the win column. V4 Pro averaged 28.1 seconds per response. Opus 4.7 averaged 5.8 seconds. GPT-5.5 averaged 8.6 seconds.
Part of this is V4 Pro's longer outputs: generating 906 tokens takes more wall time than generating 383. But even accounting for tokens-per-second, V4 Pro is slower than the alternatives. For batch processing, nightly summarization jobs, or document pipelines that run offline, 28 seconds is irrelevant. For anything user-facing, it caps what you can build. A 28-second wait for a coding suggestion is not a latency budget any user interface survives.
The hidden costs in your LLM bill post covers why latency compounds into abandonment rates and retry overhead in ways that per-token pricing hides. V4 Pro's cost advantage shrinks when you factor in the retry and timeout infrastructure a 28-second tail latency forces you to build.
V4 Flash: nearly as good, 12x cheaper output
In a separate head-to-head using the same prompt set, V4 Flash matched V4 Pro on all three prompts that completed within the runner's 45-second call limit. Three ties, zero wins for either side. Three prompts timed out due to prompt complexity, which limits the conclusion, but the signal on the completed prompts is that V4 Flash does not meaningfully fall behind V4 Pro on straightforward developer tasks.
At $0.28 per million output tokens versus V4 Pro's $3.48, V4 Flash is 12.4 times cheaper on the output side. If V4 Flash's response latency proves meaningfully better than V4 Pro's in extended testing, it becomes the obvious default for most workloads.
The cost math
At published API rates, for a workload sending 1,000 requests per day at 200 input tokens and 300 output tokens each:
| Model | Daily cost | Monthly cost |
|---|---|---|
| GPT-5.5 | $10.00 | ~$300 |
| Opus 4.7 | $8.50 | ~$255 |
| V4 Pro | $1.39 | ~$42 |
| V4 Flash | $0.11 | ~$3.40 |
The GPT-5.5 benchmarks we promised in our GPT-5.5 review are now in: V4 Pro beats GPT-5.5 on quality at 7x lower daily cost. Whether GPT-5.5's shorter, faster responses offset the quality gap depends on your latency requirements and task type.
V4 Flash at $3.40 per month for 30,000 requests is an extraordinary value if the quality holds on your specific task. Test before routing production traffic.
How this was tested
Benchmark tool: LLMTest runner via server/blog-runner.js. All calls routed through https://llmtest.io/v1.
Judge model: anthropic/claude-sonnet-4, with fallbacks to openai/gpt-4o and google/gemini-2.5-flash-preview if it fails.
Position swap: Every pair judged twice with positions reversed. The final verdict requires both judging passes to agree on direction. Ties occur when the forward and reverse judging passes disagree, or both call it even.
Prompts: 5 tasks covering SQL generation, technical explanation, debugging, marketing copy, and cost calculation. Chosen to represent realistic tasks for solo developers, not benchmark-optimized problems.
Total runner cost for this post: $0.51.
Full methodology at /docs/benchmarks.
Verdict
Use V4 Pro if your workload is batch or async, you want quality above GPT-5.5 at a fraction of the price, or you need open-source weights for compliance. Test V4 Flash first: our limited data suggests it matches V4 Pro quality at 12x lower output cost, and a smaller active parameter count often means better latency too.
Use Claude Opus 4.7 if you need maximum precision on developer tasks, sub-10-second latency, or accurate handling of knowledge boundaries. The 5x cost premium is real; so is the quality gap.
Avoid Llama 4 Maverick for structured developer tasks. Zero wins in fifteen matchups is a clear result.
Run your own prompt set against V4 Pro before making a production routing decision. The LLMTest benchmark runner lets you replay your actual workload across any model we proxy, so the comparison is on your tasks, not ours. Sign up to run your own benchmarks.