🏗️Architecturealso: model distillation, knowledge distillation, distillation

Model distillationdefinition and how it works in 2026

Model distillation: A training technique that transfers knowledge from a large "teacher" model to a smaller "student" model by training the student to match the teacher's outputs — produces a faster, cheaper model that retains most of the teacher's capability.

Distillation is how you get capable small models. The pipeline: run a large model (the teacher, e.g., GPT-5 or Claude Opus) on a diverse dataset; collect outputs; train a smaller model (the student, e.g., a 7B or 13B) to predict those outputs. The student inherits the teacher's "knowledge" at a fraction of the size and inference cost.

In 2026, distillation is everywhere. GPT-5 mini, Claude Haiku, Gemini Flash — all are distilled from their larger siblings. Open-source distillations (DeepSeek R1 distilled to Llama 8B, Qwen distillations) make frontier-level reasoning available at edge-device scale.

For agent builders, distilled models are the cost-effective option for routine work. Frontier models for planning and verification; distilled models for routine tool calls and bulk inference.

Frequently asked

How does distillation differ from fine-tuning?+

Fine-tuning adjusts a model on labeled task data. Distillation trains a smaller model to imitate a larger one on diverse inputs. Distillation transfers general capability; fine-tuning transfers specific task behavior.

Is the student model as good as the teacher?+

For routine tasks: very close, often within 5%. For hardest reasoning tasks: noticeable gap. Use distilled models for bulk work and call frontier models for the hard cases.

Frequently asked

Related terms