LLMTest benchmarks use pairwise comparison with position swap — the same method used by Chatbot Arena and MT-Bench — to find models that match or beat your current one.
How it works
Samples collected — either from your real traffic or seeded manually via seed_samples. Up to 10 samples per benchmark.
Challengers selected — based on your optimize_for goal, the system picks up to 5 relevant alternatives from 340+ models. Smart selection considers price, provider diversity, and model freshness.
Baseline runs — your current model generates outputs for all samples.
Challengers run in parallel — each challenger generates outputs for the same samples, concurrently.
Pairwise judging — an AI judge (Claude Sonnet) compares each challenger's output against the baseline in pairs. Each comparison runs twice with positions swapped (A/B then B/A) to eliminate first-position bias.
Results ranked — win/loss/tie records, cost savings, and latency deltas are computed and ranked by your optimization goal.
Optimization goals
costoptional
Finds cheaper models. Selects challengers with lower prices. Ranks by cost savings.
qualityoptional
Finds better models. Includes pricier models. Ranks by win rate.
speedoptional
Finds faster models. Prioritizes fast inference providers (Groq, Fireworks, Together). Ranks by latency.
balancedoptional
Default. Tests a diverse spread of challengers across price tiers. Ranks by overall value.
Reading results
Each challenger gets a record like:
qwen/qwen3.5-flash: won 7, tied 3, lost 0 of 10 | 80% cheaper, 718ms faster
Wins — the challenger produced a better response than your baseline
Ties — quality was equivalent (or judges disagreed on position swap)
Losses — the baseline was better
Pre-launch vs post-launch
Pre-launch: No real traffic? Use seed_samples to create test prompts, then run_benchmark with currentModel set to the model you plan to use.
Post-launch: The proxy captures real prompts. The system auto-detects your current model and benchmarks against alternatives using your actual traffic patterns.
Testing specific models
Want to test a specific model against your baseline? Use the challengers parameter:
"Benchmark my code-reviewer flow against claude-sonnet-4 and gpt-4o-mini"