AI now writes a significant share of the code landing in production repos. The question isn't whether you'll review AI-generated code but whether your reviewer can actually catch what the AI got wrong. We ran six real buggy diffs through four models: Claude Opus 4.7, GPT-4o, Gemini 2.5 Flash, and Claude Haiku 4.5. Opus dominated. But the finding worth talking about is Haiku 4.5 outperforming GPT-4o on five of six prompts, at roughly one-tenth the cost per call.
How we tested
Each model received the same six code review prompts covering the most common production bugs: a SQL injection in an Express route, a race condition in a token refresh function, a WebSocket memory leak in a React component, an off-by-one error in a pagination helper, a Stripe webhook missing signature verification, and a logic bug in a tiered discount calculator.
No few-shot examples. No chain-of-thought scaffolding. Each prompt asked the model to identify bugs and security issues, citing the specific line or pattern and explaining why it mattered.
We judged every model pair on each prompt with positions swapped to cancel order bias, using anthropic/claude-sonnet-4 as the judge. The rubric prioritized catching real bugs (security vulnerabilities, logic errors, data loss scenarios) over style nits. A model that led with formatting feedback while missing a SQL injection scored lower than one that caught the injection and left the indentation alone.
Candidates: 4. Prompts: 6. Pairwise matchups: 36 (72 judge calls total, with positions swapped each time).
Results
| Model | Wins (of 18) | Avg cost/call | Avg latency |
|---|---|---|---|
| anthropic/claude-opus-4-7 | 14 | $0.03920 | 23.0 s |
| anthropic/claude-haiku-4-5 | 10 | $0.00416 | 6.2 s |
| google/gemini-2.5-flash | 4 | $0.00424 | 8.7 s |
| openai/gpt-4o | 0 | $0.00537 | 4.7 s |
GPT-4o finished 0-for-18. It lost to Opus 4.7 on all six prompts, to Haiku 4.5 on five of six (one tie), and to Gemini 2.5 Flash on four of six. Claude Haiku 4.5 costs $0.00416 per call; GPT-4o costs $0.00537. The model that charges more per call won zero rounds.
What separated the leaders
SQL injection (Opus 4.7 vs GPT-4o): Both models spotted the textbook SQL injection from string interpolation. Only Opus 4.7 also flagged that the endpoint had no authentication check at all: any caller could fetch any user's record by guessing IDs, and SELECT * would hand them the password hash. The judge's reasoning:
"Assistant A correctly identifies and prioritizes the most critical issues: SQL injection (with specific attack examples), broken access control/IDOR vulnerabilities, and sensitive data exposure from SELECT *. Assistant B covers the SQL injection vulnerability well but completely misses the critical access control issues that A identifies - namely that anyone can request any user's data without authentication, and that sensitive fields like password hashes would be exposed."
WebSocket memory leak (Opus 4.7 vs GPT-4o): Both caught the missing cleanup function. Opus additionally flagged the race condition between old and new sockets: when symbol changes, ws.close() fires, but any message already in flight from the old socket still resolves and can overwrite the new symbol's price in state. The judge:
"Both responses identify the critical WebSocket cleanup bug, but Assistant A provides significantly deeper analysis of the real-world implications. Assistant A correctly identifies and explains the race condition between old and new WebSocket connections - a subtle but important bug that Assistant B completely misses."
Token refresh race condition (Haiku 4.5 vs GPT-4o): The bug in this one is two-layered. The refreshing flag is set inside the if (!refreshing) block, not before it, so two concurrent callers can both pass the check before either sets the flag. But there is also a second problem: while refreshing === true, all subsequent callers return accessToken immediately, which is still null at that point. They don't wait for the in-flight refresh to complete. GPT-4o identified the flag timing issue. Haiku caught both, with the judge noting:
"Assistant B correctly identifies and prioritizes the most critical bug: the race condition that allows multiple concurrent refresh requests, and crucially explains that when
refreshing === true, subsequent callers receive the old/null token instead of waiting for the refresh to complete. This is a fundamental logic flaw that would break the entire authentication system."
The GPT-4o finding
GPT-4o is not a bad model. It generates code that passes tests and handles structured tasks well. But code review requires something different: noticing what is absent. Missing authorization checks. Missing cleanup. Missing error propagation when a flag is set but never reset on failure.
On every prompt where Haiku 4.5 beat GPT-4o, the judge's reasoning followed the same pattern: GPT-4o identified the most visible issue (the one named in the CWE title) and stopped. Haiku kept looking.
Whether that reflects a tuning difference, training data, or just these six prompts is not something one benchmark can answer. What it does say: if you're using GPT-4o for code review on the assumption it's the solid mid-tier default, Haiku 4.5 is worth testing first. It's faster, cheaper, and caught more bugs in this run.
Verdict by use case
For security-critical review (auth, payments, data access): Claude Opus 4.7. At 14 wins it swept every competitor and caught IDOR, stale-closure bugs, and missing webhook signature verification that others missed. The 23-second latency and $0.039 per call aren't a problem for a CI gate on high-risk diffs. You're paying for depth, and it delivers.
For general-purpose automated code review: Claude Haiku 4.5. Ten wins, 6.2 seconds per call, $0.0042 per call. It beats GPT-4o on bug detection and matches Gemini Flash on price. At roughly 9x cheaper than Opus per call, the practical pattern is Haiku on every PR with Opus escalation when diffs touch auth or payment logic.
If you're on a Google-first stack: Gemini 2.5 Flash is a reasonable third option. It beat GPT-4o 4-0 across six prompts but lost to Haiku 5-0. If you're already routing through Google AI, Flash is a usable code reviewer. It's not the best value available.
GPT-4o for code review: Pass. Use GPT-4o on tasks where it performs: structured JSON extraction, function calling with complex schemas, document summarization. Code review against production bugs is not where it's competitive right now, and the pricing doesn't compensate.
Subscription vs API
| Model | API price (input / output per 1M tokens) | Subscription access |
|---|---|---|
| claude-opus-4-7 | $15 / $75 | Claude Max ($100-200/mo, includes Claude Code) |
| claude-haiku-4-5 | $0.80 / $4 | Claude Pro ($20/mo) or Max ($100-200/mo) |
| gemini-2.5-flash | $0.15 / $0.60 | Google One AI Premium ($19.99/mo) |
| gpt-4o | $2.50 / $10 | ChatGPT Plus ($20/mo) or Pro ($200/mo) |
For individual use (IDE extension, personal PR review), a $20/month subscription covers far more volume than the same dollars in API credits. Claude Pro at $20/month or ChatGPT Plus at $20/month both give you access to strong models with effectively unlimited personal usage during working hours.
For product-level automation (CI pipeline running code review on every PR for a team), the API calculates differently. At 100 reviews per day, Haiku 4.5 at $0.0042/call costs roughly $380/month. Opus 4.7 at $0.039/call runs about $3,510/month for the same volume. For most teams, Haiku-first with Opus escalation on high-risk files is the practical default. Verify current pricing at Anthropic{target="_blank" rel="noopener"} and Google AI{target="_blank" rel="noopener"} before building a cost model.
The break-even point for Opus 4.7 API vs Claude Max subscription: Claude Max at $200/month pays for itself if you're making more than roughly 5,100 Opus API calls per month through your own tooling. Below that, the subscription wins. Above it, the API is cheaper.
How this was tested
Prompts: 6 real production code snippets with deliberate bugs spanning SQL injection, race conditions, memory leaks, authorization gaps, and logic errors. Candidates: anthropic/claude-opus-4-7, anthropic/claude-haiku-4-5, google/gemini-2.5-flash, openai/gpt-4o. Judge: anthropic/claude-sonnet-4 with position-swap methodology (each pair evaluated twice, A vs B and B vs A; winner declared by combined score). Total pairwise matchups: 36 across 6 prompts. Total runner cost: approximately $0.62 (model calls $0.32, judge calls estimated at $0.30). Full methodology at /docs/benchmarks.
For a broader head-to-head of Claude Opus 4.7 against GPT-5.5 on general coding tasks, see our Claude Opus 4.7 vs GPT-5.5 coding benchmark. For the same benchmark format applied to SQL generation across four models, see the SQL generation use-case test. To run your own code review comparisons across providers from one endpoint, LLMTest routes to all four models in this test with per-call cost tracking.