📊Evaluationalso: answer relevance, response relevance, rag answer relevance

Answer relevancedefinition and how it works in 2026

Answer relevance: The RAG eval metric that scores whether the answer actually addresses the user's question. Catches the "perfectly grounded but useless" failure mode.

Answer relevance measures whether the response answers what was asked. The RAGAS implementation: take the answer, ask an LLM to generate "what question would this answer be answering?", then measure the similarity between that generated question and the original question. High similarity = relevant; low = the answer went sideways.

Why this matters: a system can be perfectly grounded (every claim supported by retrieved docs) yet useless — "the document says many things about X" is grounded but doesn't answer "what is the deflection rate of X?" Answer-relevance catches that failure.

Production RAG evals typically track three top-level metrics in tandem: faithfulness (is the answer supported?), answer-relevance (does it answer the question?), and context-precision (was the right context retrieved?). Each tracks a distinct failure mode; gaps in any one degrade user experience.

Frequently asked

Why measure relevance separately from faithfulness?+

They catch different failure modes. Faithful + irrelevant: "The document discusses many topics." Relevant + unfaithful: "The deflection rate is 95%" (no source). You want both to be high; tracking them separately exposes which one breaks.

What's a target answer-relevance score?+

>0.85 for a production system. Below 0.7 usually means the prompt isn't directing the model to answer specifically, or the retrieval is so off-topic that the model gives up and meanders.

Frequently asked

Related terms