Anthropic released Claude Opus 4.8 on May 28, 2026, less than two months after Opus 4.7. The headline claims: sharper judgment, more honesty about uncertainty, and better long-horizon agentic performance. We ran 12 real-world prompts (5 coding, 4 math/logic, 3 data analysis) through Opus 4.8, Opus 4.7, and GPT-5.5 via the LLMTest runner the same day. The results have one clear story and one honest caveat.
What Anthropic changed
Opus 4.8's official benchmark improvements over 4.7 (all vendor-reported):
| Benchmark | Opus 4.7 | Opus 4.8 | Change |
|---|---|---|---|
| Agentic coding (SWE-Bench Pro) | 64.3% | 69.2% | +4.9pp |
| Multidisciplinary reasoning with tools | 54.7% | 57.9% | +3.2pp |
| Agentic computer use | 82.8% | 83.4% | +0.6pp |
| Knowledge work score | 1,753 | 1,890 | +137 |
Anthropic describes the model as "more likely to flag uncertainties about its work and less likely to make unsupported claims." That framing matters: the improvements are concentrated in agentic, long-running tasks, not single-turn chat. Pricing is unchanged from 4.7: $5 per million input tokens, $25 per million output tokens.
Anthropic also announced "dynamic workflows" alongside 4.8, a research preview that lets Claude deploy hundreds of parallel subagents for complex coding tasks. That feature is separate from the base model and not what we tested here.
How we tested
12 prompts drawn from tasks that solo devs and vibe coders actually run:
- Coding (5): async race condition debug, PostgreSQL window function with ranking, React infinite re-render fix, Python off-by-one sliding window, JavaScript token bucket rate limiter with concurrency safety
- Math/logic (4): birthday paradox with 40 people, SaaS LTV and churn math, knights-and-knaves logic puzzle, binary search complexity with exact CPU fraction
- Data analysis (3): two-proportion A/B test z-test with confidence intervals, pandas groupby ordering bug, cohort retention floor analysis
Each prompt ran through both pairs: Opus 4.8 vs Opus 4.7, then Opus 4.8 vs GPT-5.5. The judge (anthropic/claude-sonnet-4) evaluated each pair twice with positions swapped to eliminate order bias, then returned a combined verdict. Two prompts in the GPT-5.5 run aborted mid-stream (birthday paradox and cohort retention), leaving 10 valid comparisons for that run.
Total runner cost: $1.54 ($0.83 for run 1, $0.71 for run 2).
Opus 4.8 vs Opus 4.7: closer than the benchmarks suggest
| Metric | Opus 4.8 | Opus 4.7 |
|---|---|---|
| Wins (of 12 prompts) | 3 | 5 |
| Ties | 4 | n/a |
| Avg latency | 13.8s | 13.6s |
| Input price | $5/M | $5/M |
| Output price | $25/M | $25/M |
Verdict: Opus 4.7 edges 4.8 on these direct-response tasks. The margin is small (4 of 12 ended in ties), and the pattern across winning prompts is consistent: both models are correct, and wins come down to thoroughness and edge-case coverage.
On the binary search complexity problem, Opus 4.8 won on presentation and verification:
"Both responses arrive at the correct mathematical answers: 30 comparisons for 1 billion elements, 1.5% CPU usage, 40 comparisons for 1 trillion elements, and 2.0% CPU usage. However, Assistant A provides significantly better presentation and pedagogical value with LaTeX equations, includes verification checks (like confirming 2²⁹ < 10⁹ < 2³⁰), and provides a more detailed step-by-step breakdown of the calculations."
On the knights-and-knaves logic puzzle, 4.8 won again on structure:
"Both responses arrive at the correct answer (C is the knight) using sound logical reasoning. They both systematically test each possibility and correctly identify the contradictions that eliminate A and B as candidates. Assistant A provides a more comprehensive presentation with clear section headers, detailed step-by-step verification, and a helpful summary table that makes the final solution easy to verify."
On the token bucket rate limiter, Opus 4.7 took the win by adding a defensive guard clause that 4.8 omitted: checking whether the requested cost exceeds total bucket capacity before processing. Opus 4.8's implementation was correct, just less complete.
The honest takeaway: Anthropic's agentic benchmark gains for 4.8 are real, but they show up in orchestrated, multi-step workflows. On single-turn coding and math tasks, 4.8 and 4.7 are at parity or slight advantage to 4.7.
Opus 4.8 vs GPT-5.5: a clear win
| Metric | Opus 4.8 | GPT-5.5 |
|---|---|---|
| Wins (of 10 valid) | 8 | 0 |
| Ties | 2 | n/a |
| Errors | 0 | 2 |
| Avg latency | 13.0s | 19.1s |
| Input price | $5/M | $5/M |
| Output price | $25/M | $30/M |
This wasn't close. Opus 4.8 won 8 of 10 valid comparisons; GPT-5.5 won zero. GPT-5.5 also aborted two responses mid-stream, counting as effective losses.
The sharpest distinction came from the token bucket rate limiter. Opus 4.8 correctly implemented concurrency safety using a promise chain lock. GPT-5.5 left out any concurrency protection and added an incorrect comment claiming safety it hadn't achieved. The judge's reasoning:
"Assistant A correctly identifies the concurrency problem in JavaScript's async environment and implements a proper solution using a promise chain lock. While JavaScript is single-threaded, async operations can interleave at await points, creating race conditions in check-then-act sequences. A's solution serializes access to the critical section, ensuring thread safety. Assistant B claims to be 'safe for concurrent async callers' but provides no actual concurrency protection mechanism. The comment 'No await inside this critical section' misunderstands the problem: the issue isn't awaits within the function, but multiple concurrent calls to the function itself."
Opus 4.8's rate limiter (excerpt):
class TokenBucket {
constructor({ capacity = 10, refillPerSec = 2 } = {}) {
this.capacity = capacity;
this.refillPerMs = refillPerSec / 1000;
this.tokens = capacity;
this.lastRefill = Date.now();
// Promise chain serializes access, preventing race conditions
// since JS is single-threaded but async ops can interleave.
this._lock = Promise.resolve();
}
// ...
}
GPT-5.5's rate limiter (excerpt):
class TokenBucketRateLimiter {
constructor({ capacity = 10, refillPerSecond = 2 } = {}) {
this.capacity = capacity;
this.refillPerMs = refillPerSecond / 1000;
this.tokens = capacity;
this.lastRefill = Date.now();
// no lock or serialization
}
async acquire() {
// No await inside this critical section: safe for concurrent async callers
// (incorrect: this comment doesn't make the code safe)
}
}
On the async cache stale-data debug prompt, GPT-5.5 cut off its response entirely, leaving the user with an unfinished analysis. Opus 4.8 caught the primary bug (no cache expiration), explained why it only shows up in production, identified a secondary truthiness check issue, and provided two concrete fix patterns with working code.
The latency gap matters for interactive use: Opus 4.8 averaged 13 seconds vs GPT-5.5's 19 seconds, a 46% reduction. At that margin, Opus 4.8 is both faster and higher quality at a 20% lower output price ($25/M vs $30/M).
Subscription vs API
Most vibe coders pay for a subscription, not API tokens. Here's what that looks like across both providers, with the break-even math:
| Monthly sub | What's included | API input | API output | |
|---|---|---|---|---|
| Claude Pro | $20 | Opus 4.8 access, 5x free quota | $5/M | $25/M |
| Claude Max | $100-200 | Higher limits, Claude Code, priority | $5/M | $25/M |
| ChatGPT Plus | $20 | GPT-5.5 access | $5/M | $30/M |
| ChatGPT Pro | $200 | Unlimited GPT-5.5, operator tools | $30/M | $180/M |
At identical API input costs, the output gap ($25/M vs $30/M) adds real money at scale. A task generating 1,000 output tokens costs $0.025 with Opus 4.8 vs $0.030 with GPT-5.5 (20% more). At 500 requests/day with 500-token outputs, that's about $7.50/month extra for GPT-5.5, roughly a third of the Plus subscription itself.
Break-even for Claude Pro ($20/month): At $5/M input and $25/M output, assume an average request uses 400 input tokens and 600 output tokens. That's ($0.002 + $0.015) = $0.017 per call. $20/month ÷ $0.017 ≈ 1,176 calls. At 39 calls/day, Claude Pro saves money versus raw API. Below that, API-direct (with prompt caching active) is cheaper. Our post on when prompt caching breaks even covers the math for cached system prompts specifically.
Claude Max at $100 makes sense if you're doing extended Claude Code sessions; that's where the agentic improvements in 4.8 are most likely to surface. For the full cost picture of running Claude in production, including how real teams cut their bills, see Claude in production 2026: from $797 to $127.
Verify current rates at Anthropic's pricing page{:target="_blank" rel="noopener"} and OpenAI's API pricing page{:target="_blank" rel="noopener"}; both have changed in 2026.
Verdict
Against GPT-5.5: Opus 4.8 wins clearly. 8-0 on coding, math, and data tasks, 46% faster, 20% cheaper per output token. The quality gap on technical tasks is real and consistent across prompt types. If you are using GPT-5.5 for coding or analysis work, the data points toward switching.
Against Opus 4.7: The upgrade is narrower than Anthropic's benchmarks imply for direct-response work. Our runner gave 4.7 a 5-3 win on single-turn tasks. The gains in 4.8 sit in agentic workflows: orchestrated coding tasks, long agent sessions, multi-step tool use. If you run Claude Code or coordinate multi-step pipelines, 4.8 is the right choice. For single-turn questions in chat or API, 4.7 and 4.8 are effectively a wash.
The pattern tracks with our earlier Opus 4.7 vs GPT-5.5 coding head-to-head, which found Claude winning 10-2 on 15 coding prompts. Opus 4.8 continues that edge, now at 8-0 on a broader task set.
To run this comparison on your own prompts, point your existing OpenAI SDK calls at the LLMTest proxy; A/B routing between models takes one line of config.
How this was tested
Judge model: anthropic/claude-sonnet-4. Each prompt was evaluated twice with model positions swapped to eliminate order bias; the combined verdict resolves disagreements in favor of the model that led in both directions.
Prompt set: 12 prompts across coding (5), math/logic (4), and data analysis (3). All prompts were real tasks, not synthetic benchmarks; the full set is described in the methodology section above.
Runs: Run 1: Opus 4.8 vs Opus 4.7 (12 prompts, 0 errors, 12 valid comparisons). Run 2: Opus 4.8 vs GPT-5.5 (12 prompts, 2 aborted, 10 valid comparisons).
Total cost: $1.54 ($0.83 for run 1, $0.71 for run 2).
Vendor benchmarks: Anthropic reports agentic coding at 69.2% and knowledge work at 1,890 on their internal evaluations. These are single-model scores on vendor-designed tasks, which is a different measurement than a head-to-head on the kinds of prompts this audience actually uses. See our benchmarks methodology for how the LLMTest runner runs comparisons.
Sign up for LLMTest to run head-to-head comparisons on your own prompts.