Evaluation terms
Benchmarks, evals, and the metrics that matter.
- ๐EvaluationAgent observability
Specialized observability for AI agents โ tracing the agent's reasoning, tool calls, sub-agent communication, state changes, and decision points across a multi-step run.
- ๐EvaluationAgent sandbox
An isolated execution environment โ usually a container, microVM, or browser profile โ where an agent can run code, browse, and act without affecting the host system or shared state.
- ๐EvaluationAgentBench
A multi-environment benchmark suite for LLM-as-agent performance โ covers OS, database, web shopping, knowledge graph, card game, and lateral-thinking tasks across 8 environments.
- ๐EvaluationAI alignment
The research and engineering practice of ensuring AI systems pursue the goals their designers intend โ covering training-time techniques like RLHF and constitutional AI as well as deployment-time guardrails.
- ๐EvaluationAI bias
Systematic errors in AI outputs that disadvantage specific groups, perspectives, or topics โ caused by biased training data, biased reward signals, or biased evaluation criteria.
- ๐EvaluationAI content moderation
The classifier and policy layer that filters input to and output from an LLM agent โ blocks unsafe categories (CSAM, self-harm, malware), enforces brand voice, and flags PII.
- ๐EvaluationAI evals
Systematic test suites for AI systems โ input/expected-output pairs run automatically to catch regressions when models or prompts change.
- ๐EvaluationAI governance
The framework of policies, controls, and review processes that ensure AI systems are deployed safely, ethically, and in compliance with regulation โ covers risk management, audit trails, and stakeholder accountability.
- ๐EvaluationAI safety
The research and engineering discipline focused on making AI systems behave reliably, refuse harmful requests, and fail gracefully under unexpected inputs โ covering both training-time alignment and deployment-time guardrails.
- ๐EvaluationAI watermarking
Techniques that embed a detectable signal in AI-generated text, images, audio, or video so downstream systems can identify content as machine-generated.
- ๐EvaluationAnswer relevance
The RAG eval metric that scores whether the answer actually addresses the user's question. Catches the "perfectly grounded but useless" failure mode.
- ๐EvaluationARC-AGI
Franรงois Chollet's benchmark for measuring fluid intelligence โ agents must induce a transformation rule from a few input/output grid examples and apply it. Designed to resist memorization.
- ๐EvaluationBenchmark
A publicly-shared, standardized eval suite used to compare models and agents across a uniform task โ SWE-bench, MMLU, GAIA, etc.
- ๐EvaluationCitation quality
An eval metric for systems that cite sources โ measures whether citations resolve to real documents, point to the supporting passage, and match the cited claim.
- ๐EvaluationDeflection rate
In support agents: the percentage of customer contacts the agent resolves fully without escalating to a human.
- ๐EvaluationEU AI Act
The European Union's regulatory framework for AI systems โ categorizes AI by risk level (prohibited, high-risk, limited risk, minimal risk) and imposes obligations based on category. Phased into force 2024โ2027.
- ๐EvaluationEval
A systematic test that measures agent performance on a fixed set of inputs โ the agent equivalent of a test suite.
- ๐EvaluationFaithfulness
The RAG eval metric that scores whether the answer's claims are supported by the retrieved context โ the standard RAGAS metric and a near-synonym for groundedness.
- ๐EvaluationGAIA benchmark
A 466-question benchmark from Meta + Hugging Face that tests general-purpose AI assistants on real-world tasks requiring web browsing, file handling, and multi-step reasoning.
- ๐EvaluationGroundedness
A RAG eval metric measuring whether the generated response is supported by the retrieved context. Distinct from factual accuracy โ the answer could be grounded in a wrong source.
- ๐EvaluationGuardrails (AI)
Constraints and filters layered around an LLM that prevent it from producing harmful, off-topic, or policy-violating outputs โ applied at input, output, or both.
- ๐EvaluationHallucination
When an LLM generates content that sounds plausible but is factually wrong or fabricated โ a citation that doesn't exist, a function that isn't in the API.
- ๐EvaluationHumanEval
A code-generation benchmark from OpenAI: 164 Python programming problems with unit tests, used to measure an LLM's ability to generate correct code from a natural-language description.
- ๐EvaluationJailbreak (AI)
A prompting technique that bypasses an LLM's safety guardrails to make it produce content the model was trained to refuse.
- ๐EvaluationLLM as a judge
An evaluation pattern where a stronger LLM scores another LLM's outputs โ replacing or supplementing human review when exact-match grading is infeasible.
- ๐EvaluationLLM observability
The practice of monitoring, tracing, and debugging LLM-powered systems in production โ capturing prompts, completions, latency, cost, and errors across every call.
- ๐EvaluationMMLU
Massive Multitask Language Understanding โ a 57-subject multiple-choice benchmark spanning STEM, humanities, social sciences, law, and ethics. The default measure of "general knowledge" for LLMs since 2020.
- ๐EvaluationModel card
A short structured document published with an AI model โ declares intended uses, training data overview, performance across subgroups, known limitations, and risk factors.
- ๐EvaluationMT-Bench
A multi-turn conversation benchmark where models are judged by a strong "LLM-as-judge" on 80 open-ended questions across writing, reasoning, math, coding, and roleplay.
- ๐EvaluationPrompt injection
An attack where malicious instructions are smuggled into an LLM's input โ through user prompts, web pages, documents, or tool outputs โ causing the agent to ignore its real instructions.
- ๐EvaluationRAGAS
An open-source RAG evaluation framework โ the de facto standard in 2026 for measuring faithfulness, answer-relevance, context-precision, and context-recall.
- ๐EvaluationRed teaming
A structured testing practice where adversaries actively try to break an AI system โ finding jailbreaks, hallucinations, harmful outputs, or unsafe tool calls before attackers do.
- ๐EvaluationSWE-bench
A benchmark from Princeton that tests coding agents on real GitHub issues โ given the bug report and repo, the agent must produce a patch that passes the project's tests.
- ๐EvaluationWebArena
A benchmark of realistic web-task scenarios (e-commerce, social, content management) where agents are scored on completing real multi-step user goals through a real browser.