KV cache
A transformer inference optimization that stores key/value attention tensors from previous tokens so they do not need to be recomputed on every new token.
The KV cache is why LLM inference is fast enough to be useful. Without it, generating each token would re-attend over the entire prior sequence — quadratic cost. With it, attention over past tokens is computed once and reused; only the new token gets fresh keys and values.
For agents, the KV cache is the difference between a single-call latency of 200ms and 20s on long traces. Tools like vLLM, TensorRT-LLM, and SGLang optimize KV-cache management aggressively — paging, prefix sharing across requests, eviction policies for memory pressure.
In 2026, KV-cache reuse across requests (sometimes called "prompt caching" when surfaced as a vendor feature) cuts inference cost by 50–90% for agents that reuse a long system prompt. Anthropic, OpenAI, and Google all expose this as a billing optimization; using it well is now table-stakes.
Frequently asked
What is the difference between KV cache and prompt caching?+
KV cache is the inference-time data structure; prompt caching is the vendor-facing feature that reuses the KV cache across requests with the same prompt prefix. Prompt caching is built on KV caching.
How big does the KV cache get?+
For a 70B model with a 32K context, the KV cache is roughly 16–32 GB per concurrent request. This is why long-context inference is GPU-memory-bound and why paged-attention systems (vLLM, SGLang) matter so much in production.