Local LLM
A large language model running entirely on hardware you control — your laptop, your server, or your data center — with no calls to external APIs.
Local LLMs are the deployment pattern for teams that need air-gapped data, predictable cost, or offline capability. Llama, Qwen, Mistral, and DeepSeek release strong open-weights models monthly; tools like Ollama, LM Studio, vLLM, and llama.cpp make them runnable on consumer GPUs or even Apple Silicon.
The 2026 quality bar: 70B-class open models are competitive with frontier closed models for routine coding, summarization, and structured-output tasks. They lag on multi-step reasoning, long-horizon planning, and the trickiest tool-use scenarios — but the gap is narrowing every quarter.
For agents, local LLMs make sense in three cases: (1) data sensitivity prevents API calls, (2) per-token cost at scale exceeds the amortized hardware cost, (3) offline operation is a requirement. Otherwise, frontier APIs usually win on quality per dollar.
Cline, Codex CLI, and Fixie can all run against a local Ollama or vLLM endpoint instead of a hosted API — giving you a fully local agent stack.
See open-source agentsFrequently asked
What hardware do I need to run a local LLM?+
A 7B model runs on any laptop with 16GB RAM. A 70B model needs 48GB+ VRAM (Apple M3 Max, RTX 4090 + offload, or a dual-GPU rig). Production deployments use H100/A100 servers.
Is a local LLM as good as Claude or GPT?+
For routine tasks, yes — Qwen 2.5 72B, Llama 3.3 70B, and DeepSeek-V3 are competitive. For frontier reasoning, planning, and tool use, frontier closed models still lead by a meaningful margin in 2026.
Can I run a local LLM for an agent?+
Yes. Cline, Codex CLI, and most open-source agents support OpenAI-compatible local endpoints (Ollama, vLLM, LM Studio). Expect 5–20× slower throughput than a frontier API.