aiagentrank.io
Subscribe
📊Evaluationalso: arc agi, arc-agi benchmark, abstract reasoning corpus

ARC-AGI

François Chollet's benchmark for measuring fluid intelligence — agents must induce a transformation rule from a few input/output grid examples and apply it. Designed to resist memorization.

ARC-AGI presents a few input/output grid examples and asks the model to produce the output for a new input. Each task tests a different abstract rule (rotate, recolor, count, fill, etc.). Humans solve ~85%; LLMs historically scored <10%.

The 2024 ARC-AGI-1 Prize put $1M on cracking 85% under a compute budget. OpenAI's o3 hit 87.5% on the public set in late 2024, marking the first time a frontier model approached human performance. ARC-AGI-2 (2025) reset the bar — current frontier models score ~10% again.

For agent buyers, ARC-AGI is a signal of fluid reasoning, not domain knowledge. A high ARC-AGI score means the model adapts to novel patterns; a low score does not mean the model is useless, just that it leans on memorization.

Frequently asked

Is ARC-AGI a measure of AGI?+

It is a measure of one component of fluid intelligence. Beating ARC-AGI is necessary but not sufficient for AGI. Chollet himself frames it as "tests one thing AGI would do, not all of them."

Why is ARC-AGI-2 so much harder than ARC-AGI-1?+

ARC-AGI-2 was specifically designed against the test-time-compute strategies that cracked ARC-AGI-1. The puzzles are still simple for humans but harder to brute-force with search.

Related terms