Test-time compute
Spending more compute at inference time — longer reasoning, more samples, search — to get higher accuracy without retraining the model.
Test-time compute is the most important architectural shift of 2025–2026. Instead of making the base model bigger, you let it think for longer. OpenAI o3, Claude Sonnet with extended thinking, and Gemini 2.5 with deep thinking all spend dramatically more compute per query and get dramatically better results.
The patterns: longer chain-of-thought traces, parallel sampling with selection, self-critique loops, search over candidate solutions. A reasoning model on a hard problem may emit 50,000 internal tokens before answering — and still beat a non-reasoning model that emits 200.
For agents, test-time compute is the lever you tune by problem difficulty. Routine tool calls run on fast non-reasoning models. Planning, evaluation, and verification steps run on reasoning models with generous thinking budgets. The cost curve is steep but the accuracy curve is steeper.
Frequently asked
Is test-time compute the same as chain-of-thought?+
CoT is one technique within the broader test-time-compute paradigm. Test-time compute also covers parallel sampling, search, self-critique, and any other inference-side method that trades latency for accuracy.
How much does test-time compute cost?+
A reasoning model burst on a hard problem can cost 50–100× a single non-reasoning completion. The savings come from getting the right answer once instead of debugging wrong answers ten times.