2026-03-26AI benchmarkAGIARC PrizeAI competition

Every top AI just failed a test any human can pass

ARC-AGI-3 gave AI video-game-like puzzles humans find easy. ChatGPT-level AIs scored under 1%. A $2M prize challenges anyone to close the gap.

The short version: A new AI test called ARC-AGI-3 just dropped — and it exposed a massive gap between what AI can do and how humans think. The world's most powerful AIs, including ChatGPT, Claude, and Gemini, scored under 1% on interactive puzzles that most humans find fun and easy. A $2 million prize is now open to anyone who can build an AI that actually learns like we do.

Under 1% — On Games Humans Find Fun

The ARC Prize Foundation — a nonprofit backed by leaders from OpenAI, Google, xAI, and Anthropic — just launched ARC-AGI-3, and the results are humbling for the AI industry.

They gave the world's most advanced AIs a set of interactive, video-game-like challenges. No written instructions. No stated goals. Just a screen, some controls, and a puzzle to figure out — the same way you'd pick up a new game on your phone.

Humans completed these challenges with ease. Many found them genuinely fun. The best AI agent in the world? 12.58% efficiency. The biggest names in AI — ChatGPT, Claude, Gemini — scored under 1%.

It Looks Like a Video Game — That's the Point

ARC-AGI-3 arcade-style game environments

Previous AI tests were like multiple-choice exams — static puzzles with clear right answers. ARC-AGI-3 is completely different. It's the first interactive AI benchmark (a standardized test that measures how smart AI really is), and it works like a video game.

Each environment drops the AI into an unfamiliar world with no instructions, no rules, and no stated objectives. The AI has to:

Explore — poke around the environment and observe what happens

Discover — figure out the goal on its own

Learn — build a mental model of how the world works

Solve — use what it learned to complete the challenge efficiently

Think of it this way: when you pick up a new game, you tap around, notice patterns, and quickly figure out what you're supposed to do. That's exactly what ARC-AGI-3 tests — and it's exactly what today's AI can't do.

The full benchmark spans 1,000+ levels across 150+ environments, covering everything from map navigation to pattern matching to volume adjustment puzzles. During a preview phase, over 1,200 human players completed 3,900+ games — most while having a good time.

12.58% vs 100% — The Results Are In

StochasticGoose AI agent attempting an ARC-AGI-3 challenge

The winner was an agent called StochasticGoose. It used CNN-based reinforcement learning — essentially training the AI to predict what happens next based on what it sees on screen, similar to how a self-driving car learns to read road signs. It managed 12.58% efficiency and completed 18 levels.

The runner-up, Blind Squirrel, took a different approach — mapping every possible state of the game into a decision tree. It scored 6.71% and completed 13 levels.

Blind Squirrel AI agent attempting ARC-AGI-3 pattern matching

But here's the real headline: the large language models behind ChatGPT, Claude, and Gemini — the AIs millions of people use daily — scored under 1%. These are the same AIs that can write essays, solve math problems, and generate code. Put them in an environment where they need to learn from experience? They're essentially lost.

The key finding: Systematic state tracking (carefully mapping what's happening step by step) mattered far more than raw model size. A small, focused AI with a good exploration strategy beat massive language models that cost billions to train. Bigger isn't always smarter.

A $2 Million Open Challenge

The ARC Prize 2026 competition is now open with $2 million in total prizes across three tracks:

ARC-AGI-3 Track — Build an AI that can learn in interactive environments (the new video-game-style test)

ARC-AGI-2 Track — Tackle the traditional static reasoning puzzles from the previous version

Paper Prize — Publish research that explains why certain approaches work better than others

There's one big catch: all winning solutions must be open-sourced under permissive licenses. And during evaluation on Kaggle (the world's largest data science competition platform), no internet access is allowed — teams can't call ChatGPT, Claude, or any cloud AI. The AI must run entirely on local hardware.

Human performance data across ARC-AGI-3 preview games showing clear AI gap

Key dates:

June 30 & September 30: Milestone checkpoint submissions
November 2: Final submission deadline
December 4: Results announced

Play the Games Yourself

You don't need to be a researcher or engineer to try this. ARC Prize has a public game set where anyone can play the ARC-AGI-3 challenges directly in their browser. See how you compare to the best AI — spoiler: you'll almost certainly win.

For developers who want to build and test AI agents against the benchmark, the toolkit is open-source and installs in one command:

pip install arckit

It supports multiple AI backends including OpenAI, Anthropic, Google, and open-weight models — though competition submissions must run locally without any cloud APIs.

As the ARC Prize Foundation puts it: "As long as there is a gap between AI and human learning, we do not have AGI." With humans at 100% and the best AI at 12.58%, that gap is still enormous. The $2 million question: can anyone close it by December?

Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments