ARC-AGI
François Chollet's benchmark for measuring fluid intelligence — agents must induce a transformation rule from a few input/output grid examples and apply it. Designed to resist memorization.
ARC-AGI presents a few input/output grid examples and asks the model to produce the output for a new input. Each task tests a different abstract rule (rotate, recolor, count, fill, etc.). Humans solve ~85%; LLMs historically scored <10%.
The 2024 ARC-AGI-1 Prize put $1M on cracking 85% under a compute budget. OpenAI's o3 hit 87.5% on the public set in late 2024, marking the first time a frontier model approached human performance. ARC-AGI-2 (2025) reset the bar — current frontier models score ~10% again.
For agent buyers, ARC-AGI is a signal of fluid reasoning, not domain knowledge. A high ARC-AGI score means the model adapts to novel patterns; a low score does not mean the model is useless, just that it leans on memorization.
Frequently asked
Is ARC-AGI a measure of AGI?+
It is a measure of one component of fluid intelligence. Beating ARC-AGI is necessary but not sufficient for AGI. Chollet himself frames it as "tests one thing AGI would do, not all of them."
Why is ARC-AGI-2 so much harder than ARC-AGI-1?+
ARC-AGI-2 was specifically designed against the test-time-compute strategies that cracked ARC-AGI-1. The puzzles are still simple for humans but harder to brute-force with search.