🏗️Architecturealso: inference, llm inference, model inference

Inferencedefinition and how it works in 2026

Inference: The process of running a trained LLM to produce outputs — the production phase, distinct from training. Inference is what you pay for when you use an LLM API.

Inference is "using the model" as opposed to "training the model." Every LLM API call is an inference operation. The compute cost, latency, and quality you observe are all properties of the inference setup, not the underlying model parameters.

Inference performance depends on several factors: model size, quantization, KV-cache management, batch size, hardware (GPU/TPU/CPU), and inference engine (vLLM, TGI, TensorRT-LLM, llama.cpp). For self-hosted setups, optimizing inference is where most cost savings live.

In 2026, "inference-time compute" (also called test-time compute) has become a key dimension — reasoning models like o3 and Claude with extended thinking spend dramatically more compute at inference for harder problems. The trade-off is latency for accuracy.

Frequently asked

Why is inference more expensive than I expect?+

Long contexts, expensive models, and inefficient prompts compound. The fix: prompt caching, smaller models for routine work, quantization for self-hosted setups, aggressive context trimming.

What is the fastest inference engine for self-hosted LLMs?+

vLLM for throughput-oriented serving. TensorRT-LLM for absolute peak performance on NVIDIA. llama.cpp for CPU and Apple Silicon. SGLang for the most advanced routing and prefix sharing.

Frequently asked

Related terms