Best LLM for OCR and document parsing in 2026: GPT-5.5 wins

By LLMTest Team · Jun 8, 2026 · 7 min read use-casebenchmarksdocument-parsingstructured-output
On this page

On this page

  1. What we tested
  2. Results
  3. Three prompts, up close
  4. Invoice extraction: where GPT-5.5 edges Opus 4.8
  5. Grocery receipt: why Haiku 4.5 costs more than you think
  6. W-9 form: where the budget models diverge
  7. How the judge decided
  8. How this was tested
  9. Subscription vs API
  10. When to pick what

The hard part of an OCR pipeline isn't reading the characters. Tesseract does that, or your PDF library, or whatever vision API you bolt on first. The hard part is what comes next: taking that noisy, error-laden text and getting a consistent, schema-compliant JSON object out the other side, every time, without hallucinating fields or mangling numbers.

We ran four current LLMs through six real-world document extraction tasks to find out which one you should trust at that step. The same structured-output discipline that separates models in SQL generation plays out here, with one extra wrinkle: OCR artifacts that look like intentional abbreviations until you know the document type.

What we tested

Six prompts, each giving the model raw OCR text and asking for a specific JSON schema. No markdown in the output, no prose, just the object:

  1. Grocery receipt: extract line items, subtotal, tax, total from Whole Foods OCR output with F00DS and other zero/letter swaps
  2. B2B vendor invoice: parse vendor, line items with quantities and prices, due date, and tax rate from a scan of an ACME Supplies invoice
  3. Handwritten prescription form: correct OCR artifacts (3->e, 1->i/l) and extract two prescriptions with dosage, sig, and refill count from a doctor's handwritten Rx
  4. Restaurant receipt with auto-gratuity: separate food, bar items, tax, and the 20% auto-gratuity from a Nobu Dallas check
  5. Amazon packing slip: extract order ID, ship-to address, three line items with ASINs, and tracking number
  6. IRS W-9 form: pull EIN, certifier name and title, and tax classification from a form scan with class1fication and Br0wn

Candidates:

Model Tier
anthropic/claude-opus-4-8 Frontier
openai/gpt-5.5 Frontier
anthropic/claude-haiku-4-5 Budget
openai/gpt-4o-mini Budget

Rubric: JSON validity, OCR correction accuracy (no hallucination), schema compliance, numeric precision, no fabricated fields.

Results

Rank Model Wins (of 18) Avg cost/call Avg latency
1 GPT-5.5 10 $0.0099 6,195 ms
2 Claude Opus 4.8 8 $0.0109 4,980 ms
3 GPT-4o-mini 4 $0.0002 5,835 ms
4 Claude Haiku 4.5 1 $0.0019 2,223 ms

Total runner cost: $0.64 across 4 candidates, 6 prompts, 72 pairwise judge calls.

Head-to-head matrix (wins by row model):

vs Opus 4.8 vs GPT-5.5 vs GPT-4o-mini vs Haiku 4.5
Opus 4.8 N/A 0-3 (3 ties) 4-1 (1 tie) 4-0 (2 ties)
GPT-5.5 3-0 N/A 3-0 (3 ties) 4-0 (2 ties)
GPT-4o-mini 1-4 0-3 N/A 3-1 (2 ties)
Haiku 4.5 0-4 0-4 1-3 N/A

GPT-5.5 beat every other model outright. Claude Opus 4.8 was close behind. The real gap was at the bottom: Haiku 4.5 lost to GPT-4o-mini 3-1 and only won a single matchup in 18 attempts.

Three prompts, up close

Invoice extraction: where GPT-5.5 edges Opus 4.8

Both frontier models handled INV0lCE, lnvoice, and Handl1ng flawlessly. Both extracted all six line items with correct quantities and prices. The judge noted:

"Both responses demonstrate excellent OCR error correction, properly interpreting 'INV0lCE' as 'INVOICE', 'lnvoice' as 'Invoice', and 'Handl1ng' as 'Handling'. Both responses contain identical and accurate data extraction. All fields are present and complete, numeric values are correctly parsed, and both stick strictly to the document content without fabrication. The JSON in both responses is valid and matches the requested schema perfectly. The only difference between the responses is formatting: Assistant A provides nicely formatted, human-readable JSON with proper indentation and line breaks, while Assistant B provides compact, minified JSON on a single line. However, the user specifically requested 'Output ONLY the JSON object with no markdown formatting' and emphasized extracting structured data for processing purposes, which typically favors compact output."

GPT-5.5 won because it omitted whitespace. Opus 4.8 pretty-printed. Both produced valid JSON, but one interpretation of "no markdown formatting" is "no decorative whitespace either." In a production parser you'd JSON.parse() both identically, but the judge scored strict instruction adherence.

Grocery receipt: why Haiku 4.5 costs more than you think

Opus 4.8 versus Haiku 4.5 on a Whole Foods receipt:

"Assistant A provides clean, valid JSON without any markdown formatting as explicitly requested. Assistant B wraps the JSON in markdown code blocks, which directly violates the instruction 'Output ONLY the JSON object with no markdown formatting.' Schema Compliance: the user requested a specific schema with fields store_name, store_id, date, items, subtotal, tax, and total. Assistant A follows this schema exactly. Assistant B includes these fields but adds several unrequested fields (location, time, transaction_id, tax_rate), which while potentially useful, goes beyond what was asked for."

Haiku's habit of wrapping output in code blocks showed up in five of six prompts. In a production pipeline, every one of those is a JSON.parse() failure, which means a retry, a repair call, or a dropped document. The hidden cost of JSON retries in LLM apps adds up fast at any real volume.

W-9 form: where the budget models diverge

GPT-4o-mini beat Haiku 4.5 3-1. On the W-9, the judge flagged Haiku for the same pattern: markdown wrapping and extra unrequested fields. GPT-4o-mini's output wasn't perfect either; it sometimes restructured bill_to into a nested object when a flat string was requested, but it followed the no-markdown rule consistently.

How the judge decided

The most important signal across all 18 matchups wasn't OCR correction accuracy. Every model handled the obvious artifacts: 1->i/l, 3->e, 0->O, dropped apostrophes. Even Haiku correctly identified that Amoxicill1n is Amoxicillin and Br0wn is Brown.

The differentiator was instruction following on output format:

  • GPT-5.5: clean JSON, zero markdown wrapping, 0/6 violations
  • Claude Opus 4.8: clean JSON, 0/6 violations
  • GPT-4o-mini: clean JSON, 0/6 violations (occasional schema drift in field structure)
  • Claude Haiku 4.5: markdown wrapping, 5/6 violations

If your parsing code calls JSON.parse() on the raw model output, Haiku 4.5 will fail on roughly 80% of calls unless you strip the code fences first. That's a middleware requirement the other three models don't impose.

How this was tested

Judge model: anthropic/claude-sonnet-4 with position-swap dual judging (each pair run forward and reverse to cancel order bias). Six prompts, four candidates, 36 pairwise matchups, 72 total judge calls. Total runner cost: $0.64. Full methodology at /docs/benchmarks.

Subscription vs API

For a production document-processing pipeline, you need the API. Subscriptions (Claude Max, ChatGPT Pro) cover chat UI access, not API calls. Here's the cost picture at three volumes, extrapolated from actual per-call costs in the runner:

Volume GPT-5.5 API Opus 4.8 API GPT-4o-mini API Haiku 4.5 API
100 docs/day ~$30/mo ~$33/mo ~$0.60/mo ~$5.70/mo
1,000 docs/day ~$297/mo ~$327/mo ~$6/mo ~$57/mo
10,000 docs/day ~$2,970/mo ~$3,270/mo ~$60/mo ~$570/mo

Verify current per-token rates on OpenAI's pricing page{:target="_blank" rel="noopener"} and Anthropic's pricing page{:target="_blank" rel="noopener"} before budgeting.

The frontier premium is $297 vs $6 at 1,000 docs/day. That's hard to justify unless your error rate from a cheaper model crosses a business threshold. GPT-4o-mini won 4 of 18 matchups, meaning it makes real mistakes on ambiguous schemas and complex multi-item documents. At $6/month, you might absorb a manual-review queue for the failures. At 10,000 docs/day, the calculus shifts.

Both Claude Max ($100-$200/month) and ChatGPT Pro ($200/month) include chat access to the frontier models. If your team does occasional ad-hoc invoice extraction through the chat UI, the subscription math works. For a pipeline, it doesn't.

When to pick what

GPT-5.5 is the call for production pipelines where strict schema compliance matters and you can't afford a format-enforcement middleware layer. It wins the most matchups and outputs clean JSON by default.

Claude Opus 4.8 is worth considering for complex, unusual document types where OCR artifacts are severe or field names are ambiguous. It runs 10% faster than GPT-5.5 (4,980 ms vs 6,195 ms average), costs about 10% more per call, and its OCR correction is equally strong. If you're already using Opus for other tasks in your pipeline, it handles document extraction cleanly.

GPT-4o-mini makes sense for high-volume, budget-sensitive pipelines where documents are structurally predictable: standard invoice layouts, common receipt formats, structured forms. At roughly 50x cheaper than GPT-5.5, a small re-parse budget for the occasional schema drift is still economically favorable.

Claude Haiku 4.5 is worth avoiding for JSON extraction unless you add format enforcement middleware. Its 2,223 ms average latency is the fastest in this group, and it competes well on other tasks. But for structured output, its code-block wrapping behavior is a consistent source of parse failures that the other three models simply don't produce.

To run your own document types through all four models at once, sign up for LLMTest — the same runner that produced this benchmark is the product.

Ship LLM features without burning your budget.

LLMTest proxies your OpenAI / Anthropic calls, tracks cost per feature, and auto-rewrites prompts to be cheaper while holding quality. Free to start.

Create a free account

Related articles

Best LLM for SQL generation in 2026: GPT-4o-mini wins clean
Four LLMs, six SQL tasks, one PostgreSQL schema. GPT-4o-mini led with 9 wins over Claude Sonnet 4.5, GPT-4o, and Gemini 2.5 Flash. Here's the full breakdown.
The three LLM costs nobody talks about (and how to find yours)
Your OpenAI bill isn't just input + output tokens. Thinking tokens, JSON retries, and prompt bloat quietly triple costs. Here's how to spot each one in your own app.