Anthropic released Claude Fable 5 on June 9, 2026, the first publicly available model from the Mythos class, a tier Anthropic had previously kept off the market. We benchmarked it the same day through the LLMTest runner against Opus 4.8 (same-provider predecessor) and GPT-5.5 (cross-provider frontier). Here are the numbers.
What's actually new in Fable 5
Fable 5 is a Mythos-class model with safety guardrails bolted on. The guardrails classify queries before they reach the full model: requests that trip a cybersecurity or biology threshold get rerouted to Opus 4.8 instead. Anthropic says this triggers in under 5% of sessions.
The vendor benchmarks are aggressive:
| Benchmark | Fable 5 | Opus 4.8 | GPT-5.5 |
|---|---|---|---|
| SWE-Bench Pro | 80.3% | 69.2% | 58.6% |
| SWE-Bench Verified | 95.0% | n/a | n/a |
| CursorBench (max effort) | 72.9% | n/a | n/a |
| FrontierCode Diamond subset | 29.3% | 13.4% | 5.7% |
These are vendor-reported scores on vendor-designed coding benchmarks. We ran our own prompts.
Run 1: Fable 5 vs Opus 4.8
12 prompts across async debugging, system design, TypeScript generics, performance profiling, SQL, and refactoring. Judge: anthropic/claude-sonnet-4 with position-swap (each pair scored forward and reverse; combined verdict resolves disagreements). All calls through the LLMTest proxy.
| Metric | Fable 5 | Opus 4.8 |
|---|---|---|
| Wins | 5 | 3 |
| Ties | 4 | n/a |
| Avg latency | 20.1s | 19.3s |
| Cost | $1.51 | n/a |
Verdict: Fable 5, slight edge. The margin is narrower than the SWE-Bench Pro gap (80.3% vs 69.2%) suggests. Single-turn quality differences are smaller than agentic workflow differences. Fable 5 dominated on system design depth and TypeScript type-level code; Opus 4.8 won the Pydantic v2 validator prompt and the multi-turn architecture task.
One prompt produced a striking result: the security code review. Fable 5 returned nothing. The safeguard classifier fired and the response was empty. Opus 4.8 delivered a complete vulnerability analysis. That 5% trigger rate is a real planning constraint if you're building security tooling.
Example 1: Callback hell refactor (Fable 5 wins)
The prompt asked for promisification of a three-level nested callback function. The judge:
"Both responses correctly identify the core approach: promisify the database query function and use async/await with Promise.all to maintain parallelism. However, there's a critical difference in handling the database query results. Assistant A uses array destructuring
const [user] = await query(...)which correctly assumes the database returns an array of rows (standard behavior for most database drivers), while Assistant B treats the result as a direct objectconst user = await query(...). Assistant A also provides better error handling with a custom NotFoundError and more comprehensive explanations of the improvements made... The key difference is that Assistant A's approach is more likely to work correctly with real database drivers, while Assistant B's code would likely fail in practice due to incorrect assumption about query result format."
Example 2: Distributed rate limiter design (Fable 5 wins)
The prompt: design a rate limiter handling 10k req/s across 50 servers with global per-user enforcement. Fable 5 opened with sizing math, walked through algorithm trade-offs (fixed window, sliding window log, token bucket) with specific rejection reasons for each, then delivered a complete implementation including a Lua script for atomic Redis operations. Opus 4.8 produced solid analysis but cut off mid-sentence before delivering the implementation. The judge:
"Assistant A provides a more complete and technically sound solution. It correctly sizes the system, chooses token bucket with solid reasoning, and provides the complete implementation including the critical Lua script for atomic operations. The Redis Cluster setup with hash-tagged keys for proper sharding is technically correct... Assistant B starts well with good algorithm comparison and chooses sliding window counter with reasonable justification. However, it has a critical flaw - the response is incomplete, cutting off mid-sentence with 'If a user is 2'. It doesn't provide the actual implementation details like the Lua script or complete operational logic, which are essential for a production system."
Run 2: Fable 5 vs GPT-5.5
GPT-5.5 timed out on 6 of 12 prompts: system design, TypeScript generic types, transformer architecture, security code review, Pydantic v2, and multi-turn token budget architecture all hit the 45-second runner limit.
Valid comparisons (6 prompts):
| Prompt | Winner |
|---|---|
| React stale closure bug | Tie |
| SQL nth-highest salary | Fable 5 |
| Callback hell refactor | Fable 5 |
| Find duplicate O(n) O(1) | Tie |
| Python async race condition | Fable 5 |
| Python performance profiling | Tie |
Fable 5: 3 wins, 0 losses, 3 ties. Avg latency on valid runs: Fable 5 17.1s, GPT-5.5 19.8s. Cost: $0.64.
The timeouts are not a runner artifact. The 45-second limit is roughly the outer edge of what an interactive API call can tolerate in production. GPT-5.5 was generating detailed responses, long enough that they couldn't finish within that window. For the prompts where GPT-5.5 did complete, the judge gave Fable 5 a clean sweep.
Opus 4.8 went 8-0 against GPT-5.5 in an earlier run on coding and math prompts. Fable 5 extends that pattern on the valid comparisons we got.
Subscription vs API
Both models are available through Anthropic's subscription plans and API. Fable 5 is free on Pro/Max plans through June 22, 2026; usage-credit billing (at API rates) kicks in June 23.
| Plan | Monthly cost | What you get |
|---|---|---|
| Claude Pro | $20 | Fable 5, Opus 4.8, Claude Code |
| Claude Max | $100-200 | Higher quotas, Fable 5, extended Claude Code sessions |
| Claude API | Per token | $10/M input, $50/M output, $5/M cached (90% discount) |
| ChatGPT Plus | $20 | GPT-5.5 access |
| ChatGPT Pro | $200 | GPT-5.5 priority, Codex tier |
| OpenAI API | Per token | $5/M input, $30/M output for GPT-5.5 |
Break-even for Fable 5 API vs subscription:
At a typical interaction (500 input tokens + 800 output tokens), one Fable 5 call costs roughly $0.045. Claude Pro at $20/month covers about 444 calls, or about 15 per day. Above that, usage credits apply or you upgrade to Max. Claude Max at $100/month covers roughly 2,200 calls/month, the right tier for heavy Claude Code sessions or production pipelines.
For caching-heavy workloads, the $0.50/M cached input price (90% discount from $5/M standard) changes this math significantly. See Claude in production cost optimization for worked examples of how that plays out at scale.
Verify current rates at Anthropic's pricing page{:target="_blank" rel="noopener"} before building a cost model; the post-June-22 usage credit rules for Fable 5 are still being finalized.
How this was tested
Judge: anthropic/claude-sonnet-4. Each prompt evaluated twice with model positions swapped (forward and reverse). Combined verdict declared when both evaluations agree; disagreements produce no result for that direction.
Prompt set: 12 prompts across callback refactoring, SQL, algorithm, async debugging, system design, TypeScript generics, transformer explanation, security code review, Pydantic v2, performance optimization, and architecture tasks. All were real tasks drawn from production code, not synthetic patterns.
Run 1 (Fable 5 vs Opus 4.8): 12 prompts, 12 valid comparisons, 0 errors. Fable 5 wins 5, Opus 4.8 wins 3, 4 ties. Total cost: $1.51.
Run 2 (Fable 5 vs GPT-5.5): 12 prompts, 6 valid comparisons, 6 GPT-5.5 timeouts (AbortController at 45s). Fable 5 wins 3, 0 losses, 3 ties. Total cost: $0.64.
Total runner cost: $2.15. Vendor benchmark figures from Anthropic's release post. All runner comparisons are LLMTest-generated on June 10, 2026.
Verdict
Fable 5 is the strongest coding model we have benchmarks for right now, and the vendor SWE-Bench numbers track with what we saw on real prompts. The 5-3 win over Opus 4.8 is real but narrower than the benchmark gap implies: single-turn quality correlates less directly with agentic coding gains than Anthropic's charts suggest.
The safeguard firing on the security code review is worth knowing before you build. For security tooling, penetration testing scaffolding, or anything touching exploit analysis, that 5% trigger rate hits more often depending on your prompt distribution.
GPT-5.5 at $5/$30 per million tokens is a fifth of Fable 5's output cost. On prompts it finished, it never won. If your workload skews toward short, direct responses where GPT-5.5 doesn't time out, the cost-quality trade-off is worth testing, but our data doesn't show a quality advantage, only a cost one.
The earlier Claude Opus 4.7 vs GPT-5.5 coding head-to-head found Claude winning 10-2 on 15 real coding prompts. Fable 5 extends that lead, at double the output price.
Run the same comparison on your own prompts at LLMTest.