2026-05-04AI researchbenchmarksAnthropicautonomous AIJack Clark

Claude at 93.9%: Jack Clark says AI builds AI by 2028.

Jack Clark puts 60%+ odds on AI automating its own research by 2028, backed by benchmarks that jumped from 2% to 93.9% in 3 years.

Three years ago, no AI system could reliably fix a bug in a production codebase. Today, Claude Mythos Preview scores 93.9% on SWE-Bench (a standardized test that measures how well an AI can resolve real GitHub issues by writing code) — up from Claude 2's 2% in late 2023. That 46-fold leap just pushed Anthropic's co-founder past a threshold he had been resisting for months: he now assigns a 60%+ probability to AI systems that design and train their own successors arriving before the end of 2028.

Jack Clark, who co-founded Anthropic and publishes the weekly Import AI newsletter (now at issue 455, free to read at jack-clark.net), describes this as a "reluctant view" — not a milestone to celebrate, but an alarm worth sounding. His evidence is drawn almost entirely from public benchmarks, making his argument unusually verifiable for anyone willing to look at the data.

Every Benchmark Points the Same Direction

Clark's case rests on a mosaic of measurements, each with known flaws, all accelerating simultaneously. The three most significant data series:

SWE-Bench (the software engineering test — an AI is given a real open-source GitHub issue and must write code that resolves it): 2% in late 2023 with Claude 2 → 93.9% in May 2026 with Claude Mythos Preview. Clark calls this result "effectively saturating the benchmark."
CORE-Bench (research reproducibility — an AI is given a published scientific paper and must replicate its core experiments from scratch, using only the paper and available data): 21.5% with GPT-4o in September 2024 → 95.5% with Opus 4.5 in December 2025. A 4.4× improvement in just 15 months.
MLE-Bench (machine learning competition performance — AI attempts 75 real Kaggle data science problems that were previously solved only by human contestants): 16.9% with OpenAI o1 in October 2024 → 64.4% with Gemini3 in February 2026.

Clark is careful about the limits of benchmarks. He notes that roughly 6% of ImageNet's validation labels (the massive labeled image collection used to train and evaluate most modern visual AI) are incorrect or ambiguous — so even a 99% accuracy score does not mean perfect perception. Every benchmark has similar blind spots. But when three independent evaluations covering code, science reproduction, and data competitions all show the same steep upward curve over the same short window, the aggregate signal is hard to explain away.

AI benchmark progression — from 2% to 93.9% over three years of rapid development

The Task Horizon: From 30 Seconds to 12 Hours — and Climbing

Benchmark percentages measure accuracy on specific tasks. METR (an AI safety evaluation organization that specializes in measuring how long and how complex a task an AI can complete without any human help) tracks something different: the maximum sustained autonomy window. Their data shows a striking progression:

2022 — GPT-3.5: 30-second tasks
Early 2024: 4-minute tasks
Mid-2024: 40-minute tasks
Late 2025: 6-hour tasks
2026 — Opus 4.6: 12-hour tasks
End of 2026 (projected by Ajeya Cotra, METR): approximately 100-hour tasks

The move from 30 seconds to 12 hours over four years represents a 1,440× expansion in reliable autonomous task length. If the 100-hour projection holds, AI systems would be capable of completing the equivalent of a full work-week of focused technical work — design, implementation, testing, and iteration — without a single human checkpoint. Clark treats this trajectory as conservative, not optimistic.

Training Speedups: The Number That Closes the Loop

Perhaps the most consequential — and least discussed — data series is how quickly AI can now accelerate AI training itself. When a machine learning engineer optimizes a training run (adjusting the parameters that govern how efficiently a model learns from data — things like learning rate schedules, batch sizes, and memory usage), a skilled human achieves roughly a 4× speedup in 4 to 8 hours of hands-on work.

AI systems crossed that benchmark and kept accelerating sharply:

May 2025 — Claude Opus 4: 2.9× speedup (below human performance)
November 2025 — Claude Opus 4.5: 16.5× speedup (4× above human)
February 2026 — Claude Opus 4.6: 30× speedup
April 2026 — Claude Mythos Preview: 52× speedup

The recursive implication is what concerns Clark most. AI that optimizes AI training makes the next model train faster and cheaper. That model is then better at training optimization — and so on. At 52× versus a human's 4×, the gap is already more than 13-fold. Clark estimates a working proof-of-concept where "a model end-to-end trains its own successor" arrives within 1 to 2 years, before full autonomous R&D. That intermediate milestone is the technical threshold that makes everything downstream coherent to plan for.

Research Automation Already Running in Production

Clark doesn't rely on projections alone. Three production-level signals are already visible in 2026:

Multi-agent pipelines: Claude Code and OpenCode (AI coding environments — software where AI models plan, write, test, and review work by managing other AI sub-agents in coordinated chains) now run layered agent workflows in live production codebases. Clark writes: "The vast majority of people I meet at frontier labs and around Silicon Valley now code entirely through AI systems." The engineer's role is shifting from author to approver.

GPU kernel automation: GPU kernels (the low-level code instructions that control how fast a model trains on a graphics chip — the difference between a 2-hour and a 12-hour training run on identical hardware) have historically required engineers with deep chip-level specialization. Meta is now using large language models to generate Triton kernel code (the programming syntax for writing these hardware-level instructions). Clark describes this moving "from curiosity to competitive research area."

Automated Alignment Research proof-of-concept: Anthropic ran an internal experiment where AI agents tackled AI safety research problems — the kind that requires not just execution skill but judgment about what problems are worth investigating. The agents beat human-designed baselines. Clark is explicit about the limits: "small scale and doesn't (yet) generalize to a production model." But the directional finding is confirmed: AI agents can exceed human researchers on judgment-intensive safety tasks at small scale.

Research team working alongside AI automation tools in a modern technical environment

The Gap That Still Exists — and the Trajectory That Closes It

Clark is not arguing that AI is already a better researcher than humans. Post-training models (AI systems evaluated on specialized research task suites after their primary training completes) currently score 25–28% on research capability benchmarks, against a human researcher baseline of 51%. A meaningful gap remains.

The specific deficit is creative direction. Clark says explicitly: "AI cannot yet invent radical new ideas." The current generation excels at what he calls "brick-by-brick" work — iterating within established methods, reproducing existing experiments faster, optimizing known frameworks. Recognizing that a field needs a new paradigm, generating a genuinely novel hypothesis, or charting a research direction no one has attempted — those capabilities remain unconfirmed and possibly out of reach for the current architecture.

Whether that gap closes is the pivotal uncertainty. The pessimistic reading is that the creativity deficit caps AI research autonomy below the threshold that matters. The realistic reading is that the gap from 25% to 51% follows roughly the same curve as SWE-Bench (2% to 93.9%), CORE-Bench (21.5% to 95.5%), and MLE-Bench — which historically has meant 12 to 18 months to close a comparable performance gap once the benchmark has been properly defined.

Clark's Warning — in His Own Words

Clark has published more than 455 issues of Import AI, always in a measured, evidence-first tone. Issue 455 reads differently. Here is the passage that makes the piece unusual:

"I now believe we are living in the time that AI research will be end-to-end automated. If that happens, we will cross a Rubicon into a nearly-impossible-to-forecast future. I don't know how to wrap my head around it. It's a reluctant view because the implications are so large that I feel dwarfed by them, and I'm not sure society is ready for the kinds of changes implied by achieving automated AI R&D."

— Jack Clark, Import AI Issue #455, May 2026

He plans to spend most of 2026 working through the implications publicly — not from a position of certainty, but because someone with access to the data and the credibility to be heard needs to say it clearly. The 60% probability he assigned is not deterministic. But it is his honest estimate, grounded in public numbers, from someone who co-built one of the most careful AI labs in the world.

For developers, researchers, and team leads using AI tools today, the practical question is not whether to panic — it is whether the capability curve you are already riding has a steeper destination than your roadmap accounts for. The benchmarks Clark tracks are public and free to verify. The Import AI newsletter is free at jack-clark.net. If benchmark terminology is unfamiliar, start with the AI automation fundamentals guide — understanding what SWE-Bench and CORE-Bench actually measure is the minimum context needed to follow the argument Clark is making. The 60% is not a certainty, but the past three years of benchmark data make it the most defensible planning horizon available right now.

Related Content — Get Started | Guides | More News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments