📊Evaluationalso: llm as a judge, llm as judge, llm judge

LLM as a judge

An evaluation pattern where a stronger LLM scores another LLM's outputs — replacing or supplementing human review when exact-match grading is infeasible.

LLM-as-a-judge is the practical answer to "how do I grade open-ended outputs at scale?" You give a frontier model the task prompt, the response, and a rubric, and ask it to score. The judge model does what a human grader would, faster and cheaper.

The pattern works well for relative comparison (response A vs response B), rubric-based scoring (does the response satisfy criteria 1-5), and pass/fail safety checks. It works less well for fine-grained correctness in domains where the judge model itself is wrong (specialized medicine, niche legal areas).

Best practices in 2026: use a stronger model than the one being evaluated as the judge, include a few-shot calibration with human-graded examples, run pairwise comparisons rather than absolute scoring when possible, and sample 5–10% of judge outputs for human spot-check. The pattern is now mainstream — Helicone, Braintrust, and LangSmith all have built-in LLM-judge runners.

Frequently asked

How accurate is LLM-as-a-judge?+

On open-ended quality scoring, frontier judge models correlate with human raters at 0.7–0.85. On factual correctness in specialized domains where the judge is also weak, the correlation drops to 0.4–0.6. Use human spot-checks to calibrate.

Should the judge be a different model from the one being judged?+

Yes when possible — same-model judging tends to favor outputs from the same family. Use a model from a different vendor or a stronger reasoning model as judge for cleaner signals.

LLM as a judge

Frequently asked

Agents that use llm as a judge

Related terms