MT-Bench
A multi-turn conversation benchmark where models are judged by a strong "LLM-as-judge" on 80 open-ended questions across writing, reasoning, math, coding, and roleplay.
MT-Bench, introduced with the LMSYS Chatbot Arena, asks models to complete 80 two-turn conversations across 8 categories. A judge LLM (typically GPT-4-class) scores each answer 1–10. The aggregate is the MT-Bench score.
Its value was correlating well with human preference in the early instruction-tuning era. By 2026, MT-Bench is largely saturated — frontier models cluster between 9.0 and 9.5 — and Chatbot Arena ELO and harder benchmarks have taken over for differentiation.
It is still useful for evaluating mid-tier and open-source models, where the score spread is meaningful, and as a quick sanity check during fine-tuning.
Frequently asked
Why is MT-Bench saturated?+
Frontier models have all crossed 9.0/10 and the judge cannot reliably distinguish above that. The signal-to-noise ratio dropped, so the field moved to harder benchmarks (Chatbot Arena, GAIA, ARC-AGI).