Claude Opus 4.7 and GPT-5.5 cost the same per input token ($5/M). One of them is measurably better at French translation. The other is 10x more expensive than Gemini 2.5 Flash, which matches it point for point across our test set. GPT-4o-mini (the model that beats pricier options on SQL and code tasks) finishes with zero wins and a grammar error in plain sight. We ran four models through six French translation tasks, judged every pair with positions swapped to remove order bias, and here's what the numbers show.
The setup: six tasks, four candidates
Six prompts covering the dimensions French translation actually demands:
- False cognates email: a casual business email seeded with "actually," "eventually," "sensible," and "demand" to catch models that translate by surface similarity rather than meaning
- Idiom conversion: a startup message dense with English idioms ("get the ball rolling," "on the back burner," "bite the bullet," "beat around the bush") requiring French equivalents, not literal renders
- Literary register: an August passage requiring literary French: the right verb forms, the languid rhythm, the personification of winter
- Technical API documentation: rate limits, exponential backoff,
max_retries; tests precision and the handling of established technical terms - Formal French to English: a rejection letter with phrases like "mûre réflexion" and "donner suite à" that must land as natural professional English
- SaaS marketing copy: French tech marketing has its own conventions; "Shippez" is accepted, "cramer son budget" is idiomatic; English patterns transplanted directly feel foreign
Candidates:
anthropic/claude-opus-4-7(Anthropic's frontier model)openai/gpt-5.5(OpenAI's current frontier)google/gemini-2.5-flash(Google's fast, lower-cost option)openai/gpt-4o-mini(the budget tier)
All calls routed through the LLMTest proxy. Judge: anthropic/claude-sonnet-4 with position-swapped dual evaluation to eliminate presentation bias.
Results
| Model | Wins (of 36 matchups) | Avg cost/call | Avg latency |
|---|---|---|---|
| anthropic/claude-opus-4-7 | 10 | $0.0121 | 9,497 ms |
| openai/gpt-5.5 | 7 | $0.0076 | 5,940 ms |
| google/gemini-2.5-flash | 7 | $0.0012 | 2,916 ms |
| openai/gpt-4o-mini | 0 | $0.0001 | 2,743 ms |
36 total pairwise matchups (6 candidate pairs × 6 prompts), 72 judge calls. 12 ties in total.
The headline number isn't Claude's lead over GPT-5.5: it's that Gemini 2.5 Flash ties GPT-5.5 at 7 wins while costing 6x less per call. And GPT-4o-mini, which won our SQL generation benchmark, goes 0-for-6 on translation. The task matters more than the model tier.
Head-to-head breakdown:
| Matchup | Wins (A vs B) | Ties |
|---|---|---|
| Claude vs GPT-5.5 | 1 – 1 | 4 |
| Claude vs Gemini 2.5 Flash | 3 – 1 | 2 |
| Claude vs GPT-4o-mini | 6 – 0 | 0 |
| GPT-5.5 vs Gemini 2.5 Flash | 2 – 2 | 2 |
| GPT-5.5 vs GPT-4o-mini | 4 – 0 | 2 |
| Gemini 2.5 Flash vs GPT-4o-mini | 4 – 0 | 2 |
Claude and GPT-5.5 directly tied 4 of 6 prompts. Claude's overall win lead comes entirely from stronger performance against Gemini and GPT-4o-mini, not from beating GPT-5.5.
Three prompts with judge reasoning
Prompt 2 (idioms): the grammar test and the idiom gap
The idiom prompt asked for French equivalents of five English idioms. GPT-4o-mini opened with a grammar error: "tenir tous les parties prenantes informées" (parties is feminine, requiring toutes). Claude rendered the same phrase correctly and found "en veilleuse" (literally "on the pilot light") for "on the back burner", a more natural French idiom than GPT-4o-mini's "mis sur la glace."
The judge on the forward evaluation:
"Assistant A provides a more comprehensive and accurate translation with better idiom adaptation. Key strengths include using 'lancer la machine' for 'get the ball rolling,' 'en veilleuse' for 'on the back burner' (which is a perfect idiomatic match), 'prendre le taureau par les cornes' for 'bite the bullet,' and 'flancher' for 'drop the ball.' The translation flows naturally and maintains appropriate register."
Combined winner: Claude. The grammar error was disqualifying for GPT-4o-mini.
Prompt 3 (literary passage): the personification of winter
The August passage required literary French: "l'on" instead of "on," subjunctive verb forms, a rhythm that doesn't feel translated. Claude rendered "things you would never permit in winter" as "que l'hiver ne tolérerait jamais" (winter as an agent). GPT-5.5 used "que l'on ne s'autoriserait jamais en hiver," grammatically equivalent, but the seasonal personification is gone.
The forward judge:
"Assistant A shows superior command of French literary style with choices like 's'écoule autrement' (flows differently) instead of a more literal rendering, and 'que l'hiver ne tolérerait jamais' which elegantly personifies winter rather than using a mechanical translation. The phrase 'partout où elles veulent bien nous mener' captures the languid, permissive quality of August conversations beautifully."
The reverse evaluation found GPT-5.5's version slightly better; the judges disagreed, yielding a contested verdict that went to Claude by our combined-score methodology. Combined winner: Claude (narrow margin).
Prompt 5 (formal French to English): where GPT-5.5 edges Claude
The formal rejection letter asked for natural professional English from French. The phrase "vos démarches" is a French catch-all for professional pursuits, broader than "your job search." Claude rendered it as "job search." GPT-5.5 chose "future endeavors," which is more faithful to the scope. GPT-5.5 also translated "donner suite à votre demande" as "take your application further" versus Claude's "move forward with your application," a smaller difference, but "take further" is the more idiomatic phrase in formal rejection language.
The judge:
"Assistant A opens with 'Dear Applicant' which is more natural and specific than Assistant B's 'Dear Sir or Madam.' However, Assistant B better captures 'donner suite à votre demande' with 'take your application further,' which is more natural than A's 'move forward with your application.' Assistant B also translates 'vos démarches' more appropriately as 'future endeavors' rather than the overly specific 'job search,' since the French term is broader."
Combined winner: GPT-5.5. On formal document translation, the frontier models differ mainly in their feel for register, not accuracy.
When to use each model
Claude Opus 4.7 is the right pick when translation quality has direct business impact: marketing copy, literary content, anything where a stilted phrase or a false cognate would reach a reader. It's the clearest winner on idiom-heavy text and literary register. The trade-off is latency (9.5 seconds average) and cost: at $0.0121/call, a 10,000-request/month translation feature costs about $121.
GPT-5.5 performs essentially identically to Claude on six of six prompts (4 ties, 1 win each), with one specific edge on formal document translation, where its feel for professional English register is slightly tighter. At $0.0076/call and 5.9 seconds average latency, it's 37% cheaper and 37% faster than Claude for near-identical quality on most tasks. For a product that needs reliable professional translation without the literary requirements, GPT-5.5 is worth considering.
Gemini 2.5 Flash is the value pick for any use case where volume matters. 7 wins (the same as GPT-5.5) at $0.0012/call. A 10,000-call/month translation pipeline costs $12 versus $76 for GPT-5.5 or $121 for Claude. Latency is 2.9 seconds, the fastest of the three. The quality gap relative to GPT-5.5 in direct matchups is within noise (2-2-2 across six prompts). For localization pipelines, customer-facing multilingual content at scale, or any scenario where translation volume is the main constraint, Gemini 2.5 Flash is the model to route to. Our code review benchmark showed Gemini finishing behind Claude and Haiku on that task; translation reverses the story entirely.
GPT-4o-mini should not be used for French translation. The grammar error in the idiom prompt alone would disqualify it for any production use. This isn't a general verdict on the model: it excels on SQL generation and similarly structured tasks. But translation is a domain where the gap between frontier and budget tiers is real and visible.
Subscription vs API
All four providers sell subscriptions that give access to their models without per-token API billing:
| Provider | API price (input / output per M) | Subscription |
|---|---|---|
| claude-opus-4-7 | $5 / $25 | Claude Pro $20/mo, Max $100–200/mo (includes Claude Code) |
| gpt-5.5 | $5 / $30 | ChatGPT Plus $20/mo, Pro $200/mo |
| gemini-2.5-flash | $0.30 / $2.50 | Google One AI Premium $20/mo |
| gpt-4o-mini | $0.15 / $0.60 | ChatGPT Plus $20/mo |
Consumer subscriptions give you the chat UI and model access for personal work; the API is for building features into your own product. If you're routing translation calls through code, subscriptions don't apply; you pay per token.
For personal use, the break-even on Claude is roughly 1,600 translation calls per month before the $5/M API rate beats a $20 subscription. For a product making 10,000+ calls per month, the economics shift sharply: $121/month for Claude versus $12/month for Gemini 2.5 Flash.
Verify current pricing before committing, as rates change: Anthropic, OpenAI, Google AI.
How this was tested
Six French translation prompts across four candidate models: anthropic/claude-opus-4-7, openai/gpt-5.5, google/gemini-2.5-flash, openai/gpt-4o-mini. Each candidate pair judged by anthropic/claude-sonnet-4 with positions swapped: 36 pairwise comparisons, 72 judge calls total. Custom rubric weighted on accuracy, naturality, register, and idiom handling, with a specific penalty for false cognate errors. No few-shot examples, no chain-of-thought prompting. Each model saw the same prompt, unaltered. Total runner cost: $0.59.
French prose tokenizes at roughly 15-20% above English rates, worth factoring into cost estimates if you're running high-volume translation. Our token-to-word conversion guide covers the language-by-language ratios. Full benchmark methodology at /docs/benchmarks.
The rankings reflect this prompt set and rubric. Different content types (legal documents, poetry, highly technical material) may shift the standings. To test your own translation prompts across these models, run them through LLMTest and see what your specific inputs actually produce.