🧰Capabilitiesalso: multimodal, vision capability

Visiondefinition and how it works in 2026

Vision: An agent capability for understanding images, screenshots, and video — letting the model reason over visual content.

Vision is what lets an agent read a Figma file, debug a UI by looking at it, OCR a receipt, or drive a browser by understanding screenshots instead of DOM trees.

In 2026, vision is standard in frontier models. The differentiation is no longer "can it see" but "can it reason about fine detail" — reading small text, counting elements, understanding charts.

Vision-driven browser use is the killer app: instead of parsing brittle DOM, the agent looks at the screenshot the way a human would and decides where to click.