Constitutional AI
An alignment technique developed by Anthropic where the model is trained to follow a written set of principles ("a constitution") rather than per-example human preferences — produces safer behavior without massive human-labeling effort.
Constitutional AI replaces some of the human-labeling work in RLHF with model-generated critiques. The pipeline: write a "constitution" of principles (be helpful, avoid harm, do not encourage illegal activity, etc.); have the model critique its own responses against the constitution; train the model on the corrected responses.
The technique was pioneered by Anthropic for the Claude family and underpins Claude's consistency on safety boundaries. In 2026, variants of constitutional AI are used at most frontier labs as a complement to or partial replacement for RLHF.
For agent builders, you usually consume the result rather than implementing it. The practical impact: Claude refuses certain requests in consistent, principle-grounded ways rather than via brittle keyword filters.
Frequently asked
Can I write my own constitution for my agent?+
You can write principles into your system prompt. True constitutional AI training requires running the technique during model training, which is beyond most teams. The system-prompt approach is the practical version.
Is constitutional AI better than RLHF?+
Different trade-offs. RLHF requires expensive human labeling but is highly responsive to preferences. Constitutional AI is cheaper and more principle-grounded but harder to fine-tune for specific behaviors. Most modern alignment combines both.