LLM Capabilities · 2026
LLM Context Windows and Capabilities in 2026
What every major LLM can actually do. Context limits, max output, image and audio inputs, tool calling, JSON mode, prompt caching, batch APIs, and training cutoffs. GPT-5.5, Claude Opus 4.7, Gemini 2.5 Pro, Llama 4, DeepSeek, Grok, plus the smaller variants. We're LLMTest, the AI proxy that runs every model through real benchmarks so you don't have to.
Verified 2026-05-02 · Need pricing? See the LLM API pricing comparison.
Don't pick the perfect model. Ship it rough.
LLMTest is an AI proxy. On every call, we auto-pick the cheapest model that hits your quality bar. We also rewrite weak prompts, handle fallbacks when an API goes down, and run weekly benchmarks across 340+ models so we know what's actually working right now. Drop it in once. Ship features instead of memorizing capability matrices.
Start optimizing| Model▲ | Context▲ | Max Out▲ | Vision | Audio | Tools | JSON | Cache | Batch | Cutoff |
|---|---|---|---|---|---|---|---|---|---|
|
GPT-5.5
Flagship
OpenAI · openai/gpt-5.5
|
1.1M | 16K | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 2026-01 |
|
GPT-5
Flagship
OpenAI · openai/gpt-5
|
400K | 16K | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 2025-10 |
|
GPT-4.1
Flagship
OpenAI · openai/gpt-4.1
|
1.0M | 16K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2024-06 |
|
Claude Opus 4.7
Flagship
Anthropic · anthropic/claude-opus-4.7
|
1M | 8K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2026-01 |
|
Claude Opus 4
Flagship
Anthropic · anthropic/claude-opus-4
|
200K | 8K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2025-04 |
|
Gemini 2.5 Pro
Flagship
Google · google/gemini-2.5-pro
|
1.0M | 66K | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 2025-06 |
|
Grok 4
Flagship
xAI · x-ai/grok-4
|
256K | 8K | ✓ | — | ✓ | ✓ | — | — | 2025-11 |
|
Sonar Pro
Flagship
Perplexity · perplexity/sonar-pro
|
200K | 8K | — | — | — | ✓ | — | — | Live web |
|
o3
Reasoning
OpenAI · openai/o3
|
200K | 100K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2024-06 |
|
o3-mini
Reasoning
OpenAI · openai/o3-mini
|
200K | 100K | — | — | ✓ | ✓ | ✓ | ✓ | 2023-10 |
|
GPT-5 Mini
Mid
OpenAI · openai/gpt-5-mini
|
400K | 8K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2025-10 |
|
GPT-4o
Mid
OpenAI · openai/gpt-4o
|
128K | 16K | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 2023-10 |
|
Claude Sonnet 4.6
Mid
Anthropic · anthropic/claude-sonnet-4.6
|
1M | 8K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2025-12 |
|
Claude Sonnet 4
Mid
Anthropic · anthropic/claude-sonnet-4
|
1M | 8K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2025-04 |
|
Gemini 2.5 Flash
Mid
Google · google/gemini-2.5-flash
|
1.0M | 66K | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 2025-06 |
|
Mistral Medium 3
Mid
Mistral · mistralai/mistral-medium-3
|
131K | 8K | — | — | ✓ | ✓ | — | ✓ | 2025-03 |
|
GPT-5 Nano
Small
OpenAI · openai/gpt-5-nano
|
400K | 4K | — | — | ✓ | ✓ | ✓ | ✓ | 2025-10 |
|
GPT-4o Mini
Small
OpenAI · openai/gpt-4o-mini
|
128K | 16K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2023-10 |
|
Claude Haiku 4.5
Small
Anthropic · anthropic/claude-haiku-4.5
|
200K | 8K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2025-09 |
|
Gemini 2.5 Flash Lite
Small
Google · google/gemini-2.5-flash-lite
|
1.0M | 8K | ✓ | — | ✓ | ✓ | ✓ | ✓ | 2025-06 |
Sorted by tier and recency. Click a column header to re-sort. Capability data curated from provider docs and tested on the LLMTest proxy.
What these capabilities actually mean for production
Every column above looks like a feature checklist. In practice, only two or three of them will matter for any one product. Here's the short version of which to care about.
- Context window caps how much input plus output can fit in one call. Skip the marketing number and ask if you'll actually pack 200K tokens into one prompt. Most production flows use under 8K.
- Max output is the hard cap on a single response, even when context is 1M. If you're generating long form (reports, code files, transcripts) this is the column that bites.
- Vision and audio are still uneven. Vision is broad now. Audio input is rare outside Gemini and GPT-4o. If you need either, this column does the gating.
- Tools and JSON are table stakes for agent loops and structured extraction. A "no" here means you'll be parsing markdown by hand or wrapping a fragile prompt.
- Cache is the column that quietly saves real money. If your system prompt is stable across calls, prompt caching cuts cost 60% to 80%. When it's worth turning on.
- Batch matters for offline workloads. Half-price async jobs that complete inside 24 hours. If your work isn't user-facing, you should probably be using it.
- Cutoff tells you what the model doesn't know. For anything date-sensitive, pair it with retrieval. RAG basics here.
For a deeper read on picking, see How to choose an LLM in 2026: the definitive guide. Or check the pricing comparison when you're ready to put dollar numbers on each row.