GPT-5.5 review: 1M context, native computer use, at twice the price

OpenAI released GPT-5.5 on April 23, 2026, calling it "a new class of intelligence." Three actual changes stand out from the announcement: a 1M-token context window at base pricing, native computer use built into the model rather than bolted on as a separate capability, and a genuinely omnimodal architecture that handles text, images, audio, and video in one unified system. The price tag is equally notable: standard API rates doubled relative to GPT-5.4, from $2.50 to $5 per million input tokens and from $15 to $30 per million output tokens.

Here is what OpenAI says changed, where those claims still need independent validation, and how the spec sits against the current frontier.

What actually changed

Context window at 1M tokens. GPT-5.4's standard pricing covered 270K tokens, with usage above that threshold billed at double the standard input rate. GPT-5.5 makes the full million tokens the baseline, no surcharge. For large-codebase ingestion, long legal documents, or extended multi-turn agent sessions, this removes a real billing cliff. Whether the model reliably attends to content placed at the far end of that window is a separate question, and one worth testing against your actual workloads, since advertised context length and effective retrieval depth rarely land in the same place, as our context window explainer covers in detail.

Native computer use. Previous GPT generations offered computer use as a separately configured capability. GPT-5.5 ships it as part of the base model, which means agentic workflows can invoke computer-level actions without reconfiguring mid-task. OpenAI specifically highlights this for complex coding and research pipelines that span multiple tools and environments.

Natively omnimodal architecture. GPT-5.5 accepts and reasons across text, images, audio, and video inside a single model pass rather than routing to specialized sub-models per modality. OpenAI's claim is that this produces more coherent cross-modal reasoning. The claim is plausible given the broader industry trend toward unified architectures; whether it holds on adversarial multi-modal prompts requires structured testing.

OpenAI also states that GPT-5.5 "maintains the same per-token latency as GPT-5.4 while using fewer tokens to complete similar tasks." That efficiency claim is the most consequential one for real-world cost comparisons. If it holds on your workload, the doubling of sticker price narrows in practice. If it does not, the doubling is exactly what it looks like.

What it costs

Standard GPT-5.5 doubled GPT-5.4's rates on both sides of the token split:

Model	Input (/M tokens)	Output (/M tokens)	Context
GPT-5.5	$5.00	$30.00	1M tokens
GPT-5.5 Pro	$30.00	$180.00	1M tokens
GPT-5.4	$2.50	$15.00	270K tokens
Claude Opus 4.7	$5.00	$25.00	1M tokens
Gemini 3.1 Pro	$2.00	$12.00	1M tokens

Prices from provider pricing pages, April 2026.

GPT-5.5 matches Claude Opus 4.7's input rate exactly, then exceeds it on output by 20%. It runs at 2.5x Gemini 3.1 Pro's rate on both sides. GPT-5.5 Pro at $30 input and $180 output is a distinct tier positioned for high-stakes automated workflows where accuracy justifies the per-call premium. At $180 per million output tokens, a response producing 8,000 output tokens costs $1.44. That adds up fast at scale. Batch and Flex pricing are available at half the standard rate, which improves the math for non-latency-sensitive jobs.

The efficiency argument is where this gets interesting. OpenAI claims fewer tokens accomplish equivalent work. A 25% reduction in output token consumption would close most of the output cost gap with Opus 4.7. A 30% reduction would put effective total cost below Opus 4.7 for output-heavy workloads. Without independent replication, the 2x sticker increase is what you should budget for. The hidden compounding effects of prompt bloat and retry overhead, covered in our LLM cost breakdown, apply equally to GPT-5.5 as they did to every predecessor.

What is not yet verified

Token efficiency. "Fewer tokens to accomplish the same task" is OpenAI's own claim about a model they built. Token efficiency figures from a creator are exactly the ones most worth stress-testing on real workloads. We will measure this in our prompt battery once GPT-5.5 is routed through LLMTest.

Long-context retrieval quality. The move from 270K to 1M tokens is meaningful only if the model attends to content placed near the far end. The common pattern with large-context models is that recall quality degrades past a certain depth, often well before the nominal limit. This requires needle-in-a-haystack testing at 500K, 750K, and 900K token offsets to establish where the practical ceiling sits.

Regression baseline. When Anthropic shipped Claude Opus 4.7, community testing turned up failure modes on tasks the predecessor handled reliably, tied to a new adaptive reasoning feature. The same pattern appears regularly at frontier model launches: improvements on benchmark-targeted tasks can coincide with regressions in adjacent ones. Running a fixed prompt set against both GPT-5.4 and GPT-5.5 is the safest way to confirm you are moving forward before routing production traffic through the new model.

Cross-modal coherence. The omnimodal claim is testable in principle. We will include prompts that require reasoning across at least two modalities in our standard battery.

How the frontier map shifted

Two weeks before GPT-5.5, the cost tiers looked like this: Gemini 3.1 Pro at $2/$12 was the budget frontier option; GPT-5.4 at $2.50/$15 offered more headroom with 270K context; Claude Opus 4.7 at $5/$25 justified its premium mainly through the 1M context window for long-document and large-codebase work.

GPT-5.5 at $5/$30 now occupies a new tier. It removes GPT-5.4's context ceiling at the cost of a 2x price increase, and it undercuts Gemini 3.1 Pro's main advantage (cost) while matching Opus 4.7's context range. For teams currently on GPT-5.4, the migration math is: does GPT-5.5's claimed token efficiency save enough per task to offset the per-token increase? That is workload-specific. For teams building multi-provider setups, our LLM fallback chain guide covers how to route intelligently across providers without coupling your call sites to specific model IDs, which is relevant here because GPT-5.5's full 1M context at base pricing removes one of GPT-5.4's comparative weaknesses versus Claude Opus 4.7.

Our take

The context window expansion, native computer use, and unified modality are documented capabilities that matter for production agentic workflows. "A new class of intelligence" is marketing language; "1M context at $5/M input, $30/M output" is a fact you can work with.

A switch from GPT-5.4 to GPT-5.5 requires a workload-specific cost analysis before any production commit. The efficiency claim is plausible, not proven. A switch from Opus 4.7 requires side-by-side quality testing on your specific task types. At the same input rate but 20% higher output, it is not obviously better or worse, only different.

Those benchmarks are now published. In our DeepSeek V4 vs GPT-5.5 head-to-head, V4 Pro beat GPT-5.5 on developer tasks while costing 4.5 times less per request. Our full testing methodology is at /docs/benchmarks.

What actually changed

What it costs

What is not yet verified

How the frontier map shifted

Our take

Ship LLM features without burning your budget.

Related reading