Benchmarks

LLMTest benchmarks use pairwise comparison with position swap — the same method used by Chatbot Arena and MT-Bench — to find models that match or beat your current one.

How it works

Samples collected — either from your real traffic or seeded manually via seed_samples. Up to 10 samples per benchmark.
Challengers selected — based on your optimize_for goal, the system picks up to 5 relevant alternatives from 340+ models. Smart selection considers price, provider diversity, and model freshness.
Baseline runs — your current model generates outputs for all samples.
Challengers run in parallel — each challenger generates outputs for the same samples, concurrently.
Pairwise judging — an AI judge (Claude Sonnet) compares each challenger's output against the baseline in pairs. Each comparison runs twice with positions swapped (A/B then B/A) to eliminate first-position bias.
Results ranked — win/loss/tie records, cost savings, and latency deltas are computed and ranked by your optimization goal.

Optimization goals

cost optional

Finds cheaper models. Selects challengers with lower prices. Ranks by cost savings.

quality optional

Finds better models. Includes pricier models. Ranks by win rate.

speed optional

Finds faster models. Prioritizes fast inference providers (Groq, Fireworks, Together). Ranks by latency.

balanced optional

Default. Tests a diverse spread of challengers across price tiers. Ranks by overall value.

Reading results

Each challenger gets a record like:

qwen/qwen3.5-flash: won 7, tied 3, lost 0 of 10 | 80% cheaper, 718ms faster

Wins — the challenger produced a better response than your baseline
Ties — quality was equivalent (or judges disagreed on position swap)
Losses — the baseline was better

Pre-launch vs post-launch

Pre-launch: No real traffic? Use seed_samples to create test prompts, then run_benchmark with currentModel set to the model you plan to use.
Post-launch: The proxy captures real prompts. The system auto-detects your current model and benchmarks against alternatives using your actual traffic patterns.

Testing specific models

Want to test a specific model against your baseline? Use the challengers parameter:

"Benchmark my code-reviewer flow against claude-sonnet-4 and gpt-4o-mini"