LoRA
Low-Rank Adaptation — a parameter-efficient fine-tuning technique that updates a small number of additional weights instead of the full model, cutting compute and storage cost by 100×+ with minimal accuracy loss.
LoRA solves the cost problem of fine-tuning frontier-class models. Instead of updating billions of parameters, LoRA freezes the base model and trains a small set of low-rank "adapter" matrices that modify the model's behavior. The adapters are a fraction of the size — typically 0.1%–2% of the base model — and can be swapped in and out without re-loading the base.
For practical impact: training cost drops from hundreds of GPU-hours to single-digit hours; storage drops from tens of GBs to tens of MBs per fine-tune. This makes per-customer or per-domain fine-tunes economically viable.
In 2026, LoRA is the default fine-tuning approach for Llama, Qwen, Mistral, and other open-weight models. Hosted vendors (Anthropic, OpenAI) support LoRA-style fine-tuning behind their APIs without exposing the technique directly.
Frequently asked
Is LoRA worse than full fine-tuning?+
Marginally on hardest tasks; equivalent on most. The ~100× compute savings usually outweigh the small accuracy gap. For production use, LoRA is the right default.
Can I stack multiple LoRAs?+
Yes — composable LoRAs is an active 2026 research area. You can train per-task or per-domain adapters and swap them in based on the request. Most production stacks have not reached this point yet but the tooling is improving.