AI for Automation
Back to AI News
2026-03-23AI testingClaudeGPTAI limitationsphysicsexperiment

He tested 7 AIs with a cup of coffee — all got it wrong

A blogger poured boiling water into a mug and asked 7 AI models to predict the temperature. Claude came closest, but none nailed it — here's what that tells us about AI.


Can the world's smartest AIs predict how fast a cup of coffee cools down? A blogger at Dynomight decided to find out — and the answer is humbling.

The experiment was dead simple: pour 8 oz of boiling water into a ceramic mug sitting at room temperature (20°C), then measure the actual temperature every few seconds for an hour. Before doing that, ask seven leading AI models to predict what would happen.

Two women drinking coffee — the experiment that stumped AI

Seven AIs, Seven Wrong Answers

Every model received the same prompt: given the water, mug, and room specs, write an equation that predicts the water temperature over time. Here's how they did:

The lineup and what they charged:

Claude 4.6 Opus — $0.61 per prediction — closest to reality
GPT 5.4 — $0.11 per prediction
Gemini 3.1 Pro — $0.09 per prediction
Kimi K2.5 — $0.01 per prediction
Qwen3-235B — $0.009 per prediction — cheapest
GLM-4.7 — $0.03 per prediction
DeepSeek & Grok — both refused to answer

All the models that responded produced equations with exponential decay (a math pattern where something drops fast at first, then slows down — like how a hot drink cools quickly in the first few minutes, then barely changes). That part was right. The details? Not so much.

Where the Real Mug Hit the Real Thermometer

When the blogger actually ran the experiment with a digital thermometer, the results surprised everyone — including the AIs.

Chart comparing LLM predictions vs actual coffee cooling over 60 minutes

The water cooled much faster in the first minute than any model predicted — the cold ceramic mug was acting like a heat sponge, pulling warmth out of the water on contact. Then it cooled much more slowly than predicted afterward, as the warm mug started insulating the water.

Zoomed chart of first 5 minutes — LLM predictions vs reality

In short: none of the AIs captured what your hand already knows — that the mug itself changes the game.

Claude Won, but It Wasn't Cheap

Claude 4.6 Opus came closest to reality, recognizing that there are two different rates of cooling happening simultaneously (the mug warming up and the water cooling down). But it cost 68 times more than the cheapest model (Qwen3-235B at under a penny).

As the author put it: "They may take our math, but they'll somewhat more slowly take our fine motor control."

The Hacker News discussion (99 points, 40 comments) was split. Some engineers pointed out this is a solved physics problem with century-old equations — the surprise isn't that LLMs can approximate it, but that none perfectly matched textbook solutions they were surely trained on. Others noted the impressive part: the best models recognized two separate timescales in the cooling, suggesting they grasped deeper thermal physics concepts.

What This Actually Tells Us About AI

This experiment captures something important about where AI stands in early 2026:

AI is great at patterns, shaky on physics. Every model knew the general shape of the answer (temperature drops exponentially). None captured the messy real-world details — the cold mug, evaporation from the surface, convection currents in the water.

Price doesn't guarantee accuracy. Claude's $0.61 answer was best, but Kimi's $0.01 answer wasn't dramatically worse. The cheapest model (Qwen3 at nine-tenths of a cent) gave a simpler equation that was still in the ballpark.

Two AIs just gave up. DeepSeek and Grok refused to even attempt the problem — a reminder that AI models have very different comfort zones.

Try It Yourself

This is one of those experiments anyone can replicate at home. Boil water, pour it in a mug, and ask your AI of choice to predict the temperature in 5 minutes. Then grab a thermometer and check. You'll likely find the same thing: AI gets the trend right but misses the details that make reality messy.

The full experiment, charts, and equations are on Dynomight's blog.

Related ContentGet Started with Easy Claude Code | Free Learning Guides | More AI News

Stay updated on AI news

Simple explanations of the latest AI developments