Evaluation terms
Benchmarks, evals, and the metrics that matter.
- 📊EvaluationAgent observability
Specialized observability for AI agents — tracing the agent's reasoning, tool calls, sub-agent communication, state changes, and decision points across a multi-step run.
- 📊EvaluationAgent sandbox
An isolated execution environment — usually a container, microVM, or browser profile — where an agent can run code, browse, and act without affecting the host system or shared state.
- 📊EvaluationAgentBench
A multi-environment benchmark suite for LLM-as-agent performance — covers OS, database, web shopping, knowledge graph, card game, and lateral-thinking tasks across 8 environments.
- 📊EvaluationAI alignment
The research and engineering practice of ensuring AI systems pursue the goals their designers intend — covering training-time techniques like RLHF and constitutional AI as well as deployment-time guardrails.
- 📊EvaluationAI bias
Systematic errors in AI outputs that disadvantage specific groups, perspectives, or topics — caused by biased training data, biased reward signals, or biased evaluation criteria.
- 📊EvaluationAI content moderation
The classifier and policy layer that filters input to and output from an LLM agent — blocks unsafe categories (CSAM, self-harm, malware), enforces brand voice, and flags PII.
- 📊EvaluationAI evals
Systematic test suites for AI systems — input/expected-output pairs run automatically to catch regressions when models or prompts change.
- 📊EvaluationAI governance
The framework of policies, controls, and review processes that ensure AI systems are deployed safely, ethically, and in compliance with regulation — covers risk management, audit trails, and stakeholder accountability.
- 📊EvaluationAI safety
The research and engineering discipline focused on making AI systems behave reliably, refuse harmful requests, and fail gracefully under unexpected inputs — covering both training-time alignment and deployment-time guardrails.
- 📊EvaluationAI watermarking
Techniques that embed a detectable signal in AI-generated text, images, audio, or video so downstream systems can identify content as machine-generated.
- 📊EvaluationAnswer relevance
The RAG eval metric that scores whether the answer actually addresses the user's question. Catches the "perfectly grounded but useless" failure mode.
- 📊EvaluationARC-AGI
François Chollet's benchmark for measuring fluid intelligence — agents must induce a transformation rule from a few input/output grid examples and apply it. Designed to resist memorization.
- 📊EvaluationBenchmark
A publicly-shared, standardized eval suite used to compare models and agents across a uniform task — SWE-bench, MMLU, GAIA, etc.
- 📊EvaluationCitation quality
An eval metric for systems that cite sources — measures whether citations resolve to real documents, point to the supporting passage, and match the cited claim.
- 📊EvaluationDeflection rate
In support agents: the percentage of customer contacts the agent resolves fully without escalating to a human.
- 📊EvaluationEU AI Act
The European Union's regulatory framework for AI systems — categorizes AI by risk level (prohibited, high-risk, limited risk, minimal risk) and imposes obligations based on category. Phased into force 2024–2027.
- 📊EvaluationEval
A systematic test that measures agent performance on a fixed set of inputs — the agent equivalent of a test suite.
- 📊EvaluationFaithfulness
The RAG eval metric that scores whether the answer's claims are supported by the retrieved context — the standard RAGAS metric and a near-synonym for groundedness.
- 📊EvaluationGAIA benchmark
A 466-question benchmark from Meta + Hugging Face that tests general-purpose AI assistants on real-world tasks requiring web browsing, file handling, and multi-step reasoning.
- 📊EvaluationGroundedness
A RAG eval metric measuring whether the generated response is supported by the retrieved context. Distinct from factual accuracy — the answer could be grounded in a wrong source.
- 📊EvaluationGuardrails (AI)
Constraints and filters layered around an LLM that prevent it from producing harmful, off-topic, or policy-violating outputs — applied at input, output, or both.
- 📊EvaluationHallucination
When an LLM generates content that sounds plausible but is factually wrong or fabricated — a citation that doesn't exist, a function that isn't in the API.
- 📊EvaluationHumanEval
A code-generation benchmark from OpenAI: 164 Python programming problems with unit tests, used to measure an LLM's ability to generate correct code from a natural-language description.
- 📊EvaluationJailbreak (AI)
A prompting technique that bypasses an LLM's safety guardrails to make it produce content the model was trained to refuse.
- 📊EvaluationLLM as a judge
An evaluation pattern where a stronger LLM scores another LLM's outputs — replacing or supplementing human review when exact-match grading is infeasible.
- 📊EvaluationLLM observability
The practice of monitoring, tracing, and debugging LLM-powered systems in production — capturing prompts, completions, latency, cost, and errors across every call.
- 📊EvaluationMMLU
Massive Multitask Language Understanding — a 57-subject multiple-choice benchmark spanning STEM, humanities, social sciences, law, and ethics. The default measure of "general knowledge" for LLMs since 2020.
- 📊EvaluationModel card
A short structured document published with an AI model — declares intended uses, training data overview, performance across subgroups, known limitations, and risk factors.
- 📊EvaluationMT-Bench
A multi-turn conversation benchmark where models are judged by a strong "LLM-as-judge" on 80 open-ended questions across writing, reasoning, math, coding, and roleplay.
- 📊EvaluationPrompt injection
An attack where malicious instructions are smuggled into an LLM's input — through user prompts, web pages, documents, or tool outputs — causing the agent to ignore its real instructions.
- 📊EvaluationRAGAS
An open-source RAG evaluation framework — the de facto standard in 2026 for measuring faithfulness, answer-relevance, context-precision, and context-recall.
- 📊EvaluationRed teaming
A structured testing practice where adversaries actively try to break an AI system — finding jailbreaks, hallucinations, harmful outputs, or unsafe tool calls before attackers do.
- 📊EvaluationSWE-bench
A benchmark from Princeton that tests coding agents on real GitHub issues — given the bug report and repo, the agent must produce a patch that passes the project's tests.
- 📊EvaluationWebArena
A benchmark of realistic web-task scenarios (e-commerce, social, content management) where agents are scored on completing real multi-step user goals through a real browser.