Claude Opus 4.7: genuine coding gains, hidden cost sting

Anthropic released Claude Opus 4.7 on April 16. The coding numbers are real: this is a meaningful upgrade for agentic workflows and multi-file code tasks. But the same release shipped a new tokenizer that can consume up to 35% more tokens on identical inputs. The sticker price didn't change; your bill might.

Here's what actually improved, where skepticism is warranted, and what to check before switching.

What improved in coding and agents

Anthropic publishes numbers from their own evals, so treat them as a lower bound on how competitors would score against these same benchmarks. That said, the gains are specific enough to take seriously:

CursorBench: 70%, up from 58% for Opus 4.6 (a 12-point improvement on real-world coding tasks)
SWE-Bench-style multi-file tasks: roughly 6 to 8 points better than 4.6 on complex, cross-file changes
Anthropic's 93-task coding benchmark: 13% higher resolution rate over Opus 4.6, including four tasks neither Opus 4.6 nor Sonnet 4.6 could complete at all
Multi-step agentic workflows: 14% higher task completion, averaging 7.1 model calls per task versus 16.3 for Opus 4.6 (a 2.3x reduction in invocations for the same work)

That last number matters most for production spend. Fewer model calls mean less total cost on complex agentic workflows, which partially offsets the tokenizer inflation described below.

The model's personality shifted, too. Opus 4.7 is more direct and more literal. It won't silently generalize an instruction from one item to another, gives fewer validation-forward phrases, and uses fewer emoji. If you found 4.6's warmer style slightly sycophantic, this is likely an improvement.

The vision jump

Opus 4.7 is the first Claude model with high-resolution image support. Maximum image resolution increased from 1,568px (about 1.15MP) to 2,576px (3.75MP). On visual navigation benchmarks without tools, Anthropic reports 79.5% for Opus 4.7 versus 57.7% for Opus 4.6, a 22-point improvement.

For UI automation, dense document parsing, and anything involving product screenshots or medical imaging, this is a real capability expansion. Higher-resolution images also consume more tokens per call, so if your workload involves many images, revisit how context windows and token budgets work before estimating costs.

The tokenizer situation

Here's where the launch got complicated.

Opus 4.7 ships with a new tokenizer. The per-token price is unchanged: $5 per million input tokens and $25 per million output tokens. But the new tokenizer produces 1x to 1.35x as many tokens as the previous one on identical inputs. The upper end of that range appears most on code, structured data, and non-English text. These are exactly the workloads where developers are most likely to reach for Opus over Sonnet.

In practical terms, heavy coding sessions that cost you $X with Opus 4.6 may cost up to 1.35X with Opus 4.7 at the same prompt. Some Claude Pro subscribers reported hitting usage limits after three questions in an extended coding session, a pattern they hadn't seen with 4.6.

The phrase "unchanged pricing" in Anthropic's announcement is accurate at the per-token level and misleading at the per-request level. Before committing Opus 4.7 across production, run your most common prompts through both models and compare token counts. This tokenizer effect is a newer version of the same pattern covered in the LLM costs that rarely show up in dashboards.

Community reaction

The launch triggered sharp pushback. A Reddit post titled "Opus 4.7 is not an upgrade but a serious regression" collected 2,300 upvotes within days of release. Examples circulated on X: the model reportedly stated "strawberry" contains two P's, rewrote a resume with a different school name and surname, and self-described its responses as "acting lazily" in certain sessions.

Boris Cherny, head of Claude Code at Anthropic, disputed the regression characterization directly. An Anthropic product manager separately acknowledged the team is "rapidly moving ahead with internal tuning work." Neither response denied the individual failure examples.

The complaints center on "adaptive reasoning," a new feature that lets the model decide how much computation to spend before responding. Critics argue the model underallocates reasoning to problems that warrant it. Anthropic says performance on their benchmarks shows otherwise. This is still being sorted out, and an Anthropic PM's own acknowledgment of tuning work in progress suggests the criticism has traction internally.

How it compares on spec

Model	Input (/M tokens)	Output (/M tokens)	Context	Vision
Claude Opus 4.7	$5.00	$25.00	1M tokens	Up to 3.75MP
Claude Opus 4.6	$5.00	$25.00	1M tokens	Up to 1.15MP
GPT-5.4	$2.50	$15.00	270K standard	Yes
Gemini 3.1 Pro	$2.00	$12.00	1M tokens	Yes

Prices from provider pricing pages, April 2026. Effective Opus 4.7 input cost may run 10 to 35% above listed due to the new tokenizer, varying by content type.

GPT-5.4 is priced at half of Opus 4.7's sticker rate. Gemini 3.1 Pro is under half. The key advantage Opus 4.7 holds over GPT-5.4 for most workloads is the full 1M context at base pricing: GPT-5.4 charges double the standard input rate for tokens above 272K. For large-codebase and long-document workflows, that context advantage is real. For anything under 128K tokens, the cost gap is harder to justify on price alone. OpenAI's GPT-5.5 release, announced the following week, changed the calculus: it matches the 1M context at the same $5 input rate, but raises output to $30 per million.

Our take

The coding and agentic improvements are documented with specific numbers on concrete tasks, and Anthropic's benchmark methodology is well enough understood that the figures are plausible. The vision upgrade is unambiguous.

The tokenizer situation deserves a workload test before any production decision. "Same price" does not mean "same bill," and the difference is large enough to matter at scale. If you're building a routing setup that selectively uses Opus for hard tasks and falls back to cheaper models for lighter work, our LLM fallback chain guide shows how to do that without much ceremony.

The adaptive reasoning story is unresolved. Anthropic's own team acknowledged active tuning work. That's not a reason to avoid Opus 4.7, but it's a reason to test it on your specific use cases before routing high-stakes traffic through it, rather than assuming benchmarks on Anthropic's eval set will translate directly to your workload.

We'll run a full head-to-head against Opus 4.6 and GPT-5.4 with our standard prompt battery once Opus 4.7 is available through LLMTest. Subscribe to the RSS feed to catch that follow-up. Our full benchmark methodology is at /docs/benchmarks.