LLM Capabilities · 2026

LLM Context Windows and Capabilities in 2026

What every major LLM can actually do. Context limits, max output, image and audio inputs, tool calling, JSON mode, prompt caching, batch APIs, and training cutoffs. GPT-5.5, Claude Opus 4.7, Gemini 2.5 Pro, Llama 4, DeepSeek, Grok, plus the smaller variants. We're LLMTest, the AI proxy that runs every model through real benchmarks so you don't have to.

Verified 2026-05-02 · Need pricing? See the LLM API pricing comparison.

Don't pick the perfect model. Ship it rough.

LLMTest is an AI proxy. On every call, we auto-pick the cheapest model that hits your quality bar. We also rewrite weak prompts, handle fallbacks when an API goes down, and run weekly benchmarks across 340+ models so we know what's actually working right now. Drop it in once. Ship features instead of memorizing capability matrices.

Start optimizing

Filter:

Model▲	Context▲	Max Out▲	Vision	Audio	Tools	JSON	Cache	Batch	Cutoff
GPT-5.5 Flagship OpenAI · openai/gpt-5.5	1.1M	16K	✓	✓	✓	✓	✓	✓	2026-01
GPT-5 Flagship OpenAI · openai/gpt-5	400K	16K	✓	✓	✓	✓	✓	✓	2025-10
GPT-4.1 Flagship OpenAI · openai/gpt-4.1	1.0M	16K	✓	—	✓	✓	✓	✓	2024-06
Claude Opus 4.7 Flagship Anthropic · anthropic/claude-opus-4.7	1M	8K	✓	—	✓	✓	✓	✓	2026-01
Claude Opus 4 Flagship Anthropic · anthropic/claude-opus-4	200K	8K	✓	—	✓	✓	✓	✓	2025-04
Gemini 2.5 Pro Flagship Google · google/gemini-2.5-pro	1.0M	66K	✓	✓	✓	✓	✓	✓	2025-06
Grok 4 Flagship xAI · x-ai/grok-4	256K	8K	✓	—	✓	✓	—	—	2025-11
Sonar Pro Flagship Perplexity · perplexity/sonar-pro	200K	8K	—	—	—	✓	—	—	Live web
o3 Reasoning OpenAI · openai/o3	200K	100K	✓	—	✓	✓	✓	✓	2024-06
o3-mini Reasoning OpenAI · openai/o3-mini	200K	100K	—	—	✓	✓	✓	✓	2023-10
GPT-5 Mini Mid OpenAI · openai/gpt-5-mini	400K	8K	✓	—	✓	✓	✓	✓	2025-10
GPT-4o Mid OpenAI · openai/gpt-4o	128K	16K	✓	✓	✓	✓	✓	✓	2023-10
Claude Sonnet 4.6 Mid Anthropic · anthropic/claude-sonnet-4.6	1M	8K	✓	—	✓	✓	✓	✓	2025-12
Claude Sonnet 4 Mid Anthropic · anthropic/claude-sonnet-4	1M	8K	✓	—	✓	✓	✓	✓	2025-04
Gemini 2.5 Flash Mid Google · google/gemini-2.5-flash	1.0M	66K	✓	✓	✓	✓	✓	✓	2025-06
Mistral Medium 3 Mid Mistral · mistralai/mistral-medium-3	131K	8K	—	—	✓	✓	—	✓	2025-03
GPT-5 Nano Small OpenAI · openai/gpt-5-nano	400K	4K	—	—	✓	✓	✓	✓	2025-10
GPT-4o Mini Small OpenAI · openai/gpt-4o-mini	128K	16K	✓	—	✓	✓	✓	✓	2023-10
Claude Haiku 4.5 Small Anthropic · anthropic/claude-haiku-4.5	200K	8K	✓	—	✓	✓	✓	✓	2025-09
Gemini 2.5 Flash Lite Small Google · google/gemini-2.5-flash-lite	1.0M	8K	✓	—	✓	✓	✓	✓	2025-06

Sorted by tier and recency. Click a column header to re-sort. Capability data curated from provider docs and tested on the LLMTest proxy.

What these capabilities actually mean for production

Every column above looks like a feature checklist. In practice, only two or three of them will matter for any one product. Here's the short version of which to care about.

Context window caps how much input plus output can fit in one call. Skip the marketing number and ask if you'll actually pack 200K tokens into one prompt. Most production flows use under 8K.
Max output is the hard cap on a single response, even when context is 1M. If you're generating long form (reports, code files, transcripts) this is the column that bites.
Vision and audio are still uneven. Vision is broad now. Audio input is rare outside Gemini and GPT-4o. If you need either, this column does the gating.
Tools and JSON are table stakes for agent loops and structured extraction. A "no" here means you'll be parsing markdown by hand or wrapping a fragile prompt.
Cache is the column that quietly saves real money. If your system prompt is stable across calls, prompt caching cuts cost 60% to 80%. When it's worth turning on.
Batch matters for offline workloads. Half-price async jobs that complete inside 24 hours. If your work isn't user-facing, you should probably be using it.
Cutoff tells you what the model doesn't know. For anything date-sensitive, pair it with retrieval. RAG basics here.

For a deeper read on picking, see How to choose an LLM in 2026: the definitive guide. Or check the pricing comparison when you're ready to put dollar numbers on each row.