aiagentrank.io
📊Evaluationalso: guardrails ai, ai guardrails, llm guardrails

Guardrails (AI)

Constraints and filters layered around an LLM that prevent it from producing harmful, off-topic, or policy-violating outputs — applied at input, output, or both.

Guardrails are the production-runtime layer that catches what training-time alignment missed. An input guardrail filters out prompt injections and harmful queries before they reach the model. An output guardrail re-classifies the model's response and blocks or rewrites unsafe content before it reaches the user.

The common techniques: NeMo Guardrails for declarative policy specs, Llama Guard and ShieldGemma for output classification, regex/heuristic filters for PII redaction, and tool-call allowlists for agent action gating. Modern stacks layer several together; no single guardrail catches everything.

For agents specifically, guardrails are non-negotiable when the agent has tools. The agent can be jailbroken in fifty ways you have not thought of; the guardrail layer ensures that even if jailbroken, the agent cannot execute the worst actions (delete data, send unauthorized emails, charge cards).

Frequently asked

What is the difference between guardrails and alignment?+

Alignment is built into the model at training time. Guardrails are bolted on at deployment time. Use both — guardrails catch what alignment misses, and alignment makes guardrails simpler.

Do open-source guardrail libraries work?+

For input filtering and PII redaction: yes, mature. For output classification of nuanced policy violations: partially — Llama Guard and Shield models help but commercial stacks (Aporia, Lasso, Lakera) still lead on coverage.

Agents that use guardrails (ai)

Related terms