Best LLM for SQL generation in 2026: GPT-4o-mini wins clean

By LLMTest Team · May 1, 2026 · 7 min read use-casesqlbenchmarkscost
On this page

On this page

  1. The setup: a real schema, six hard prompts
  2. Results
  3. Three examples with judge reasoning
  4. Prompt 1: Q1 orders per customer
  5. Prompt 6: category revenue share with running total
  6. Prompt 4: month-over-month revenue growth
  7. When to use each model
  8. Subscription vs API
  9. How this was tested

The conventional wisdom says you pay for quality. Claude Sonnet 4.5 costs $3 per million input tokens. GPT-4o-mini costs $0.15. We ran 6 SQL generation tasks across 4 models on a real PostgreSQL schema, judged every pair with positions swapped to eliminate order bias, and GPT-4o-mini came out on top. Here's what that means for your stack.

The setup: a real schema, six hard prompts

We used a five-table e-commerce schema: customers (with a nullable email column), categories, products, orders, and order_items. The order_items table stores a denormalized unit_price rather than joining back to products, which is common in schemas that need to preserve historical pricing.

Six prompts, in increasing difficulty:

  1. Q1 orders per customer: aggregate with date filtering, descending sort
  2. Products never ordered: LEFT JOIN or NOT EXISTS approach, requires a join to categories
  3. Top 5 spenders in last 30 days: multi-table join, computed aggregate, nullable email in result
  4. Month-over-month revenue growth, 2025: CTE plus LAG window function, NULL for January's percentage
  5. Customers with orders but no email: aggregate with NULL filtering
  6. Category revenue share with running cumulative total: window functions over a LEFT JOIN aggregate

Each model received the full schema DDL and one task. The instruction was explicit: return only the SQL, no markdown fences.

Candidates:

  • openai/gpt-4o-mini (the budget tier)
  • anthropic/claude-sonnet-4-5 (Anthropic's mid-range frontier model)
  • openai/gpt-4o (OpenAI's standard frontier)
  • google/gemini-2.5-flash (Google's fast, low-cost alternative)

All calls routed through the LLMTest proxy. Judge model: anthropic/claude-sonnet-4 with dual-position evaluation.

Results

Model Wins (of 18 matchups) Avg cost/call Avg latency
openai/gpt-4o-mini 9 $0.00011 1,722 ms
anthropic/claude-sonnet-4-5 8 $0.00322 11,154 ms
google/gemini-2.5-flash 5 $0.00046 2,764 ms
openai/gpt-4o 4 $0.00180 1,628 ms

36 total pairwise results (6 pairs per prompt, 6 prompts). GPT-4o-mini and Claude Sonnet 4.5 split their direct head-to-head exactly 3-3. GPT-4o-mini's overall lead came from dominating GPT-4o and Gemini. Claude's 8 wins concentrated in prompts where NULL handling and comprehensive result sets mattered.

One practical note before the examples: Claude wrapped its output in markdown fences despite the prompt saying not to. GPT-4o, GPT-4o-mini, and Gemini all returned raw SQL. If you're piping model output directly into a query executor, Claude's formatting habit requires a stripping step in production.

Three examples with judge reasoning

Prompt 1: Q1 orders per customer

GPT-4o-mini (winner):

SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2026-01-01' AND o.order_date < '2026-04-01'
GROUP BY c.customer_name
ORDER BY order_count DESC;

Claude Sonnet 4.5 (lost this round):

SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
  AND o.order_date >= '2026-01-01'
  AND o.order_date < '2026-04-01'
GROUP BY c.customer_id, c.customer_name
HAVING COUNT(o.order_id) > 0
ORDER BY order_count DESC;

Judge reasoning:

"Both queries correctly identify Q1 2026 and use appropriate date filtering. However, Assistant A [Claude] uses a LEFT JOIN with date conditions in the join clause, then filters out customers with zero orders using HAVING COUNT(o.order_id) > 0, which effectively makes it equivalent to an INNER JOIN but with unnecessary complexity. Assistant B [GPT-4o-mini] uses a straightforward INNER JOIN with WHERE clause filtering, which is cleaner and more direct for this specific requirement."

Claude reached for the LEFT JOIN pattern useful when you want zero-order customers in the result. The task didn't ask for those. The HAVING clause patching around it added noise without adding value.

Prompt 6: category revenue share with running total

Here the verdict flipped. The task asked for window functions over a LEFT JOIN aggregate, with a cumulative revenue column sorted by revenue descending.

Claude Sonnet 4.5 (winner) used LEFT JOINs throughout, COALESCE on NULL revenue values from sparse categories, and explicit ROWS BETWEEN in the window spec:

SELECT
  c.category_name,
  COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS total_revenue,
  ROUND(
    COALESCE(SUM(oi.quantity * oi.unit_price), 0) * 100.0 /
    SUM(SUM(oi.quantity * oi.unit_price)) OVER (),
    2
  ) AS percentage_share,
  SUM(SUM(oi.quantity * oi.unit_price)) OVER (
    ORDER BY SUM(oi.quantity * oi.unit_price) DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_revenue
FROM categories c
LEFT JOIN products p ON c.category_id = p.category_id
LEFT JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY c.category_id, c.category_name
ORDER BY total_revenue DESC;

GPT-4o-mini used INNER JOINs, which silently drops any category with zero orders.

Judge reasoning:

"Assistant A [Claude] uses LEFT JOINs which correctly handles categories that might have no products or products with no orders, ensuring all categories appear in the results. It also uses COALESCE to handle NULL values that could arise from the LEFT JOINs, preventing issues with percentage calculations. Assistant B [GPT-4o-mini] uses INNER JOINs which will exclude categories with no orders — a significant data integrity issue for reporting queries."

When the task requires showing absence (categories with zero revenue), INNER JOINs hide it. Claude's instinct for LEFT JOINs on analytics queries is the right one.

Prompt 4: month-over-month revenue growth

Both models reached for CTEs with LAG window functions for the percentage calculation. GPT-4o-mini used DATE_TRUNC for month grouping and a precise open-ended date range for the year filter:

WHERE o.status = 'completed'
  AND o.order_date >= '2025-01-01'
  AND o.order_date < '2026-01-01'

Claude Sonnet 4.5 used EXTRACT(YEAR FROM o.order_date) = 2025 and formatted months as strings with TO_CHAR directly in the GROUP BY — correct but groups as VARCHAR rather than a timestamp, which complicates downstream ordering or joins. GPT-4o-mini won this one on date handling precision. The margin was narrow.

When to use each model

GPT-4o-mini is the right default for production SQL generation on standard reporting queries: aggregates, filters, straightforward joins, date math. It writes clean INNER JOIN syntax and at $0.00011 per call costs 29x less than Claude Sonnet 4.5 on the same prompt. Start here, and only route up if you hit edge cases.

Claude Sonnet 4.5 is the right choice when your schema has meaningful NULLs or when reporting queries must surface zero-activity rows (categories with no sales, products with no orders). The LEFT JOIN instinct and COALESCE discipline pay for themselves on analytics over sparse data. It also averaged 11 seconds per response versus 1.7 seconds for GPT-4o-mini; factor that into latency budgets. If you want to shrink Claude's cost through caching on shared system prompts or schemas, the prompt caching break-even analysis walks through when caching covers the write cost.

Gemini 2.5 Flash is a solid middle path: fast at 2.7 seconds, cheap at $0.00046 per call, and correct on most prompts. It trailed on the most complex window function queries but tracked GPT-4o-mini closely everywhere else. Worth a spot in a tiered routing setup.

GPT-4o finished last despite being positioned as a stronger model, losing to both budget alternatives on three of six prompts. This isn't a verdict on GPT-4o in general; benchmark scope matters. If you're routing SQL generation through it on the assumption that more expensive is better, our model selection guide covers why that heuristic breaks down in practice.

Subscription vs API

All four providers also sell consumer and IDE subscriptions:

Model API price (input/output per M tokens) Subscription access
gpt-4o-mini $0.15 / $0.60 ChatGPT Plus ($20/mo) or Pro ($200/mo)
gpt-4o $2.50 / $10.00 ChatGPT Plus ($20/mo) or Pro ($200/mo)
claude-sonnet-4-5 $3.00 / $15.00 Claude Pro ($20/mo), Max ($100-200/mo), includes Claude Code
gemini-2.5-flash $0.15 / $0.60 (non-thinking) Google One AI Premium ($19.99/mo)

For personal development work using Claude Code or Cursor, the subscription almost always wins. $20/month buys unlimited interactive usage. $20 in API credits for claude-sonnet-4-5 covers about 6,200 calls at the cost we observed here ($0.00322/call), or roughly 207 calls per day. Most individual developers blow through that in a single coding session.

For a product making API calls on behalf of users, the calculus flips. At 10,000 calls per day, claude-sonnet-4-5 runs about $1,170/month versus gpt-4o-mini at $40/month. If your SQL queries are standard joins and aggregates, start with gpt-4o-mini and route complex analytics queries selectively to Claude.

Verify current pricing before committing: Anthropic, OpenAI, Google AI.

How this was tested

Six SQL generation prompts across a five-table PostgreSQL schema with nullable columns and denormalized pricing. Four candidates: openai/gpt-4o-mini, anthropic/claude-sonnet-4-5, openai/gpt-4o, google/gemini-2.5-flash. Every pairwise combination judged by anthropic/claude-sonnet-4 with positions swapped to eliminate order bias: 36 pairwise results, 72 total judge calls. Total runner cost: $0.4788.

No few-shot examples. No chain-of-thought prompting. Each model received schema DDL plus a natural-language task. Full methodology at /docs/benchmarks.

Different schemas, SQL dialects, or prompt styles will shift the rankings. If your workload uses stored procedures, dialect-specific syntax, or hundred-table schemas, the only reliable number is one you measure yourself. Run your own SQL prompts through the LLMTest runner and see what your data says.

Ship LLM features without burning your budget.

LLMTest proxies your OpenAI / Anthropic calls, tracks cost per feature, and auto-rewrites prompts to be cheaper while holding quality. Free to start.

Create a free account

Related reading

DeepSeek V4 Pro review: beats GPT-5.5 and costs a fifth of Opus 4.7
Apr 29, 2026 · 6 min read
Claude Opus 4.7: genuine coding gains, hidden cost sting
Apr 21, 2026 · 5 min read
Prompt caching breaks even at 1.3 requests. Here's the math.
Apr 27, 2026 · 5 min read