Small language model
A capable LLM in the 1B–13B parameter range — trained to compete with frontier-quality on specific tasks while running on consumer hardware or at fraction-of-frontier cost.
Small language models (SLMs) are the practical answer to "I want LLM quality at 1/100th the cost." In 2026, well-trained 3B–8B models — Phi, Llama 3.x, Qwen, Mistral Small, Gemma — match or exceed 2023 70B-class quality on many tasks.
SLMs win on three axes: cost per token (often 10–100× cheaper), latency (5–20× faster), and privacy (can run on-device or in your VPC). They lose on long-tail knowledge, hardest reasoning, and edge cases where the frontier model's broader prior matters.
In production, the smart play is hybrid: route easy traffic to an SLM, escalate hard queries to a frontier model. Done well, this can drop spend 60–90% while improving median latency. Done poorly, the routing layer becomes the bottleneck.
Frequently asked
How do I know if an SLM is good enough for my task?+
Run your eval on it. Tasks with clear inputs and outputs (extraction, classification, structured generation) often work great. Tasks needing world knowledge or hard reasoning often need frontier.
Can SLMs run on my laptop?+
Yes — 3B–8B models run on consumer hardware with 8–16 GB RAM via Ollama, llama.cpp, or MLX. Quantization (Q4/Q5) trades a few quality points for a 4× memory cut.