LLMTest

Benchmarks

LLMTest benchmarks use pairwise comparison with position swap — the same method used by Chatbot Arena and MT-Bench — to find models that match or beat your current one.

How it works

  1. Samples collected — either from your real traffic or seeded manually via seed_samples. Up to 10 samples per benchmark.
  2. Challengers selected — based on your optimize_for goal, the system picks up to 5 relevant alternatives from 340+ models. Smart selection considers price, provider diversity, and model freshness.
  3. Baseline runs — your current model generates outputs for all samples.
  4. Challengers run in parallel — each challenger generates outputs for the same samples, concurrently.
  5. Pairwise judging — an AI judge (Claude Sonnet) compares each challenger's output against the baseline in pairs. Each comparison runs twice with positions swapped (A/B then B/A) to eliminate first-position bias.
  6. Results ranked — win/loss/tie records, cost savings, and latency deltas are computed and ranked by your optimization goal.

Optimization goals

cost optional
Finds cheaper models. Selects challengers with lower prices. Ranks by cost savings.
quality optional
Finds better models. Includes pricier models. Ranks by win rate.
speed optional
Finds faster models. Prioritizes fast inference providers (Groq, Fireworks, Together). Ranks by latency.
balanced optional
Default. Tests a diverse spread of challengers across price tiers. Ranks by overall value.

Reading results

Each challenger gets a record like:

qwen/qwen3.5-flash: won 7, tied 3, lost 0 of 10 | 80% cheaper, 718ms faster

Pre-launch vs post-launch

Testing specific models

Want to test a specific model against your baseline? Use the challengers parameter:

"Benchmark my code-reviewer flow against claude-sonnet-4 and gpt-4o-mini"