Benchmark
A publicly-shared, standardized eval suite used to compare models and agents across a uniform task — SWE-bench, MMLU, GAIA, etc.
Benchmarks let you compare apples to apples across model versions and agent designs. SWE-bench scores measure coding-agent capability; GAIA measures general agent reasoning; HumanEval measures function-level code generation.
The trap is over-fitting. When a benchmark becomes the metric, vendors optimize for it specifically. By the time SWE-bench scores hit 80%, real-world coding agent quality had diverged from the benchmark numbers.
Use benchmarks as a *floor*, not a ceiling. An agent that scores poorly is suspect; an agent that scores well still needs your own evals to confirm fit.
Frequently asked
Which benchmarks matter most for choosing an agent?+
SWE-bench Verified for coding agents, GAIA for general-purpose, τ-bench for tool-use reliability. Always check the date — six-month-old benchmark numbers are stale.