2026-03-24Claude CodeautoresearchAI researchmachine learningKarpathy

He gave Claude Code an old research project — it found the bug he'd missed

A machine learning researcher used Karpathy's autoresearch framework with Claude Code to revive a shelved project. The AI ran 42 experiments in one Saturday and found a critical parameter bug the human never caught — improving results by 54%.

A machine learning researcher just proved that AI doesn't just speed up research — it catches mistakes humans overlook. Yogesh Kumar gave Claude Code his old, shelved research project and let it run autonomously for a Saturday. The result: 42 experiments, a 54% improvement in accuracy, and the discovery of a critical bug the researcher had missed the entire time.

The experiment used Andrej Karpathy's autoresearch framework — a tool that lets AI agents run training experiments on their own. It's already hit 52,000 GitHub stars, and this blog post shows what happens when a regular researcher (not a famous AI scientist) tries it on a real project.

The autoresearch agent loop: hypothesize, edit, train, evaluate, commit or revert, repeat

The setup: one GPU, one Saturday, zero babysitting

Kumar's project was called eCLIP — a model that learns to match images with text descriptions. Think of it like teaching a computer to look at a painting and understand what it depicts. He'd built it for analyzing 11,000 Japanese woodblock prints (ukiyo-e art) with text descriptions, but the results had plateaued and the project sat on a shelf.

The autoresearch setup was deliberately simple:

Hardware: One NVIDIA RTX 4090 GPU (a gaming-grade card)
Time per experiment: ~3 minutes
Total experiments: 42 runs (13 kept, 29 thrown out)
Human intervention: Nearly zero for the first 90%
Agent: Claude Code, restricted to editing a single training file

The agent followed a strict loop: form a hypothesis → change one thing → train → measure → keep or discard → repeat. It could only modify one file (train.py) and run one script. No internet access, no installing packages, no pushing code. A tightly sandboxed AI researcher.

The breakthrough: Claude found a 2-year-old mistake

This is where the story gets interesting. The single biggest improvement — a 113-point jump in the accuracy metric — came from Claude discovering that Kumar had accidentally clamped a "temperature" parameter (a number that controls how confidently the model makes predictions) at the wrong value.

Kumar had hard-coded a limit of 2 on this parameter years ago. Claude removed the restriction, and the model's performance immediately jumped. This one fix accounted for more improvement than all other changes combined.

Progress chart showing the sharp improvement after Claude fixed the temperature parameter bug

In Kumar's words: the AI didn't just optimize his code — it found a mistake he'd been blind to. The kind of thing a fresh pair of eyes catches immediately, except those eyes belonged to an AI running at 3 AM.

The final results tell the whole story

Before autoresearch:

Mean Rank: 344.68 (lower is better — measures how accurately the model matches images to text)
Top-1 accuracy: ~17%

After 42 experiments:

Mean Rank: 157.43 — a 54% improvement
Top-5 accuracy: 53% (more than 3x improvement)
Full dataset test: Mean Rank dropped to just 34.30

What worked — and where AI still struggles

Kumar broke the experiment into phases, and the pattern is instructive for anyone thinking about using AI for their own projects:

Phase 1-2: Hyperparameter tuning — Smooth. The agent methodically tested different learning rates, batch sizes, and projection dimensions. Steady improvements with minimal human input.

Phase 3: The bug fix — The single biggest win. Claude identified the clamped temperature parameter and removed the artificial constraint. −113 mean rank in one change.

Phase 4-5: Architectural changes and "moonshot" ideas — Diminishing returns. The agent tried attention mechanisms, different loss functions, and web-searched ideas. Most experiments were reverted. Claude occasionally forgot its sandbox restrictions and tried to run unauthorized commands.

Kumar's key takeaway: "The first 90% of the work was super smooth and barely needed my intervention. The last 10% was a slog." This matches a pattern many AI-assisted developers report — AI excels at well-defined optimization but struggles with genuinely creative leaps.

Why this matters beyond one researcher's weekend

Karpathy's autoresearch framework has 52,200 GitHub stars — but most coverage focuses on large-scale, well-funded experiments. Kumar's story is different: it's a solo researcher, one consumer GPU, one afternoon. No cloud costs. No team.

The implications are significant:

Old projects get a second life — shelved research can be revived with AI doing the grunt work of parameter tuning
AI as a code reviewer — Claude didn't just optimize; it found a human error that had been hiding in plain sight
Accessible to anyone — you don't need a cluster of GPUs or a $309 cloud bill. A gaming GPU and one Saturday is enough

Heatmap visualization showing how the eCLIP model focuses on specific regions of Japanese woodblock prints

Try autoresearch on your own project

If you have an NVIDIA GPU and a Python project with a clear evaluation metric, you can try this yourself:

# Install Karpathy's autoresearch framework
git clone https://github.com/karpathy/autoresearch
cd autoresearch

# Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# Prepare data and run a single test experiment
uv run prepare.py
uv run train.py

Then point Claude Code at the program.md file and let it iterate. The key constraint: make sure your agent can only edit one file and has a clear metric to optimize. Kumar's experience shows that tight sandboxing is essential — without it, the agent wanders off-track.

Kumar's full blog post and code are available on his website and the eCLIP GitHub repository.

Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments