🏗️Architecturealso: lora, low-rank adaptation, low rank adaptation

LoRAdefinition and how it works in 2026

LoRA: Low-Rank Adaptation — a parameter-efficient fine-tuning technique that updates a small number of additional weights instead of the full model, cutting compute and storage cost by 100×+ with minimal accuracy loss.

LoRA solves the cost problem of fine-tuning frontier-class models. Instead of updating billions of parameters, LoRA freezes the base model and trains a small set of low-rank "adapter" matrices that modify the model's behavior. The adapters are a fraction of the size — typically 0.1%–2% of the base model — and can be swapped in and out without re-loading the base.

For practical impact: training cost drops from hundreds of GPU-hours to single-digit hours; storage drops from tens of GBs to tens of MBs per fine-tune. This makes per-customer or per-domain fine-tunes economically viable.

In 2026, LoRA is the default fine-tuning approach for Llama, Qwen, Mistral, and other open-weight models. Hosted vendors (Anthropic, OpenAI) support LoRA-style fine-tuning behind their APIs without exposing the technique directly.

Frequently asked

Is LoRA worse than full fine-tuning?+

Marginally on hardest tasks; equivalent on most. The ~100× compute savings usually outweigh the small accuracy gap. For production use, LoRA is the right default.

Can I stack multiple LoRAs?+

Yes — composable LoRAs is an active 2026 research area. You can train per-task or per-domain adapters and swap them in based on the request. Most production stacks have not reached this point yet but the tooling is improving.

Frequently asked

Related terms