📊Evaluationalso: mmlu, massive multitask language understanding

MMLUdefinition and how it works in 2026

MMLU: Massive Multitask Language Understanding — a 57-subject multiple-choice benchmark spanning STEM, humanities, social sciences, law, and ethics. The default measure of "general knowledge" for LLMs since 2020.

MMLU (Hendrycks et al., 2020) tests broad world knowledge with high-school-to-professional-level multiple-choice questions across 57 subjects. For half a decade it was the standard "how smart is this model" headline number.

By 2026 MMLU is mostly saturated — frontier models score 88–92%, with human expert performance at ~89.8%. Successor benchmarks (MMLU-Pro, GPQA Diamond, Humanity's Last Exam) ramp the difficulty and reduce contamination.

For agent buyers, an MMLU score tells you the model has the background knowledge for general use. It does NOT tell you the model can act, plan, or use tools — those need agent benchmarks (AgentBench, GAIA, SWE-bench).

Frequently asked

Why is MMLU still cited if it is saturated?+

Inertia and comparability. Every model has an MMLU score, so it remains the easiest cross-model comparison even though the signal at the top is thin.

What replaced MMLU?+

MMLU-Pro for harder multiple-choice. GPQA Diamond for grad-level science. Humanity's Last Exam for the new ceiling. AgentBench/GAIA for actual agent ability.

Frequently asked

Related terms