Every top AI just scored under 1% on a test any human aces
ARC-AGI-3 drops GPT-5.4 to 0.26%, Claude to 0.25%, Grok to 0%. Untrained humans score 100%. $2M prize awaits any AI that matches a regular person.
On March 25, 2026, the ARC Prize Foundation released a new benchmark called ARC-AGI-3 — and the results immediately spread across the AI research world. Every major frontier AI model scored below 1%. Untrained humans scored 100%. The gap between human and artificial intelligence, at least on this test, isn't narrowing. It is a cliff.
GPT-5.4 (OpenAI's latest model with a 1-million-token context window) scored 0.26%. Claude Opus 4.6 (Anthropic's most powerful model) scored 0.25%. Gemini 3.1 Pro Preview (Google's frontier model) hit 0.37% — the best of the group, and still essentially zero. Grok-4.20 scored 0.00%. Not 0.01%. Zero. All 135 environments were solved successfully by every human tester, none of whom had any prior training or instructions.
ARC-AGI-3 Leaderboard — March 2026
- 🧑 Untrained Humans: 100%
- 🤖 Gemini 3.1 Pro Preview: 0.37%
- 🤖 GPT-5.4: 0.26%
- 🤖 Claude Opus 4.6: 0.25%
- 🤖 Grok-4.20: 0.00%
$700,000 grand prize for any AI matching human performance. Still unclaimed.
What ARC-AGI-3 Actually Tests
Previous AI benchmarks (standardized tests used to measure and compare AI systems) have a fundamental weakness: the more widely used a benchmark becomes, the more AI systems get specifically trained to pass it. Once a benchmark gets “saturated” — meaning AI reaches near-perfect scores — it stops measuring real intelligence and starts measuring memorization.
ARC-AGI-3 was designed to be un-saturatable. Here is how it works:
The benchmark places an AI agent (an AI system that takes actions in an environment, not just answers questions) into a handcrafted, interactive game environment it has never seen before. The game has no instructions. No rulebook. No tutorial. No stated goal. The agent sees a visual display, can perform actions, and observes what changes — and that is it.
Think of it like being dropped into an alien video game and told to win, with no idea what winning means or how the controls work. Humans naturally explore, test hypotheses, adapt, and figure it out within minutes. Current AI systems largely do not.
The benchmark contains 135 unique environments, each handcrafted by human designers to be:
- Solvable by any normal human within a few minutes with zero preparation
- Novel enough that AI cannot have memorized the solution from training data
- Interactive — requiring real-time action and adaptation, not just a single typed answer
- Scalable — later levels within each environment get progressively harder
The Scoring System That Punishes Brute Force
ARC-AGI-3 uses a metric called RHAE (Relative Human Action Efficiency). This scoring system does not just ask "did the AI succeed?" — it asks "how efficiently did the AI succeed, compared to a human?"
The formula: (number of human actions needed) divided by (number of AI actions taken), then squared. The squaring is deliberate and brutal: it aggressively penalizes AI systems that “solve” problems by blindly trying thousands of actions until one works.
Here is the math in plain terms: if a human solves an environment in 10 moves and an AI takes 100 moves to stumble to the same answer, the AI's score is not 10%. It is (10 divided by 100) squared = 1%. This completely eliminates brute-force cheating and rewards genuine problem-solving understanding.
The human baseline itself is carefully controlled: each environment's score is based on the second-best performance out of ten first-time players. This filters out lucky outliers while maintaining realistic human competence as the bar to clear.
Why Even the Best AI Systems Fail
The core issue is not processing power or knowledge — frontier AI models have both in abundance. The problem is adaptability in genuinely unfamiliar situations.
Today's AI, including GPT-5.4 and Claude Opus 4.6, excels at tasks resembling patterns in its training data. Writing emails, coding, summarizing documents, answering questions — these all have strong precedents. But ARC-AGI-3 deliberately removes this advantage by placing AI in environments with zero precedent.
There is a revealing data point from the research teams: when engineers built heavily custom “harnesses” (specialized wrappers designed specifically to target certain ARC-AGI-3 environments), Claude Opus 4.6 could be pushed to 97.1% on one specific environment. That same harness scored 0% on a different environment. The engineering solved one puzzle — it learned nothing about how to approach new ones. This is the exact limitation ARC-AGI-3 is designed to expose.
The difference being measured here: task-specific optimization (what current AI does extremely well) versus general adaptive intelligence (what humans do naturally in any new situation). ARC-AGI-3 only tests the second one. And no AI has come close.
The $2 Million Prize — And Why It Sits Unclaimed
ARC Prize 2026 is running on Kaggle (the popular data science and AI competition platform) with a total prize pool of over $2 million, structured as follows:
- $700,000 grand prize: Any AI agent that matches untrained human performance on the hidden private evaluation set
- Additional prizes for open-source contributions, partial progress, and ARC-AGI-2 solutions
- All winning solutions must be published as open-source code — no proprietary black-box wins allowed
The competition is open to everyone: individual researchers, university labs, and commercial AI companies. OpenAI, Google DeepMind, and Anthropic are all eligible to enter. The prize specifically requires open-sourcing solutions, ensuring that any breakthrough in general AI reasoning becomes public knowledge rather than corporate property.
You can also play the games yourself — for free, in your browser. The ARC-AGI-3 public game set at arcprize.org lets anyone experience the benchmark firsthand. It is genuinely fascinating: try the environments that every top AI has failed, and notice how quickly your human brain figures them out.
What This Reveals About AI You Use Every Day
ARC-AGI-3 will not change the capabilities of ChatGPT, Claude, Gemini, or Copilot for everyday tasks — those tools remain just as useful as they were before the benchmark launched. But understanding what it reveals helps you use AI more effectively:
- AI is a pattern engine, not a thinker. It excels when your problem resembles something in its training data, and struggles in genuinely novel territory.
- Confident answers are not always correct answers. AI sounds certain even in unfamiliar situations — which is exactly where it is most likely to fail.
- Human oversight remains critical in any new, complex, or high-stakes situation. AI is a powerful starting point, not the final authority.
The good news: ARC-AGI-3 signals precisely where the next major AI breakthrough needs to come from. Several research teams are actively building systems that can learn new rules on the fly — not just pattern-match against training data. When that capability arrives, the AI tools used for everyday work will become dramatically more powerful.
Until then, the $700,000 prize sits unclaimed. The 135 environments remain unsolved by machines. And any random person with a laptop could outperform the world's most advanced AI systems at every single one of them.
The full ARC-AGI-3 launch post and live leaderboard are available at arcprize.org for anyone who wants to track progress or submit an AI agent.
Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News
Sources
Stay updated on AI news
Simple explanations of the latest AI developments