📊

Evaluation terms

Benchmarks, evals, and the metrics that matter.

📊Evaluation
Agent observability
Specialized observability for AI agents — tracing the agent's reasoning, tool calls, sub-agent communication, state changes, and decision points across a multi-step run.
📊Evaluation
Agent sandbox
An isolated execution environment — usually a container, microVM, or browser profile — where an agent can run code, browse, and act without affecting the host system or shared state.
📊Evaluation
AgentBench
A multi-environment benchmark suite for LLM-as-agent performance — covers OS, database, web shopping, knowledge graph, card game, and lateral-thinking tasks across 8 environments.
📊Evaluation
AI alignment
The research and engineering practice of ensuring AI systems pursue the goals their designers intend — covering training-time techniques like RLHF and constitutional AI as well as deployment-time guardrails.
📊Evaluation
AI bias
Systematic errors in AI outputs that disadvantage specific groups, perspectives, or topics — caused by biased training data, biased reward signals, or biased evaluation criteria.
📊Evaluation
AI content moderation
The classifier and policy layer that filters input to and output from an LLM agent — blocks unsafe categories (CSAM, self-harm, malware), enforces brand voice, and flags PII.
📊Evaluation
AI evals
Systematic test suites for AI systems — input/expected-output pairs run automatically to catch regressions when models or prompts change.
📊Evaluation
AI governance
The framework of policies, controls, and review processes that ensure AI systems are deployed safely, ethically, and in compliance with regulation — covers risk management, audit trails, and stakeholder accountability.
📊Evaluation
AI safety
The research and engineering discipline focused on making AI systems behave reliably, refuse harmful requests, and fail gracefully under unexpected inputs — covering both training-time alignment and deployment-time guardrails.
📊Evaluation
AI watermarking
Techniques that embed a detectable signal in AI-generated text, images, audio, or video so downstream systems can identify content as machine-generated.
📊Evaluation
Answer relevance
The RAG eval metric that scores whether the answer actually addresses the user's question. Catches the "perfectly grounded but useless" failure mode.
📊Evaluation
ARC-AGI
François Chollet's benchmark for measuring fluid intelligence — agents must induce a transformation rule from a few input/output grid examples and apply it. Designed to resist memorization.
📊Evaluation
Benchmark
A publicly-shared, standardized eval suite used to compare models and agents across a uniform task — SWE-bench, MMLU, GAIA, etc.
📊Evaluation
Citation quality
An eval metric for systems that cite sources — measures whether citations resolve to real documents, point to the supporting passage, and match the cited claim.
📊Evaluation
Deflection rate
In support agents: the percentage of customer contacts the agent resolves fully without escalating to a human.
📊Evaluation
EU AI Act
The European Union's regulatory framework for AI systems — categorizes AI by risk level (prohibited, high-risk, limited risk, minimal risk) and imposes obligations based on category. Phased into force 2024–2027.
📊Evaluation
Eval
A systematic test that measures agent performance on a fixed set of inputs — the agent equivalent of a test suite.
📊Evaluation
Faithfulness
The RAG eval metric that scores whether the answer's claims are supported by the retrieved context — the standard RAGAS metric and a near-synonym for groundedness.
📊Evaluation
GAIA benchmark
A 466-question benchmark from Meta + Hugging Face that tests general-purpose AI assistants on real-world tasks requiring web browsing, file handling, and multi-step reasoning.
📊Evaluation
Groundedness
A RAG eval metric measuring whether the generated response is supported by the retrieved context. Distinct from factual accuracy — the answer could be grounded in a wrong source.
📊Evaluation
Guardrails (AI)
Constraints and filters layered around an LLM that prevent it from producing harmful, off-topic, or policy-violating outputs — applied at input, output, or both.
📊Evaluation
Hallucination
When an LLM generates content that sounds plausible but is factually wrong or fabricated — a citation that doesn't exist, a function that isn't in the API.
📊Evaluation
HumanEval
A code-generation benchmark from OpenAI: 164 Python programming problems with unit tests, used to measure an LLM's ability to generate correct code from a natural-language description.
📊Evaluation
Jailbreak (AI)
A prompting technique that bypasses an LLM's safety guardrails to make it produce content the model was trained to refuse.
📊Evaluation
LLM as a judge
An evaluation pattern where a stronger LLM scores another LLM's outputs — replacing or supplementing human review when exact-match grading is infeasible.
📊Evaluation
LLM observability
The practice of monitoring, tracing, and debugging LLM-powered systems in production — capturing prompts, completions, latency, cost, and errors across every call.
📊Evaluation
MMLU
Massive Multitask Language Understanding — a 57-subject multiple-choice benchmark spanning STEM, humanities, social sciences, law, and ethics. The default measure of "general knowledge" for LLMs since 2020.
📊Evaluation
Model card
A short structured document published with an AI model — declares intended uses, training data overview, performance across subgroups, known limitations, and risk factors.
📊Evaluation
MT-Bench
A multi-turn conversation benchmark where models are judged by a strong "LLM-as-judge" on 80 open-ended questions across writing, reasoning, math, coding, and roleplay.
📊Evaluation
Prompt injection
An attack where malicious instructions are smuggled into an LLM's input — through user prompts, web pages, documents, or tool outputs — causing the agent to ignore its real instructions.
📊Evaluation
RAGAS
An open-source RAG evaluation framework — the de facto standard in 2026 for measuring faithfulness, answer-relevance, context-precision, and context-recall.
📊Evaluation
Red teaming
A structured testing practice where adversaries actively try to break an AI system — finding jailbreaks, hallucinations, harmful outputs, or unsafe tool calls before attackers do.
📊Evaluation
SWE-bench
A benchmark from Princeton that tests coding agents on real GitHub issues — given the bug report and repo, the agent must produce a patch that passes the project's tests.
📊Evaluation
WebArena
A benchmark of realistic web-task scenarios (e-commerce, social, content management) where agents are scored on completing real multi-step user goals through a real browser.