📊
Evaluation terms
Benchmarks, evals, and the metrics that matter.
- 📊EvaluationBenchmark
A publicly-shared, standardized eval suite used to compare models and agents across a uniform task — SWE-bench, MMLU, GAIA, etc.
- 📊EvaluationDeflection rate
In support agents: the percentage of customer contacts the agent resolves fully without escalating to a human.
- 📊EvaluationEval
A systematic test that measures agent performance on a fixed set of inputs — the agent equivalent of a test suite.
- 📊EvaluationHallucination
When an LLM generates content that sounds plausible but is factually wrong or fabricated — a citation that doesn't exist, a function that isn't in the API.