2026-03-24AI on deviceiPhonelocal AIANEMLLprivacyApple Neural Engine

Someone just ran a 400B AI model on an iPhone 17 Pro

A developer demonstrated a 397-billion-parameter AI model running entirely on an iPhone 17 Pro using SSD streaming — no cloud, no internet. The catch: 0.6 tokens per second.

A developer just proved that a 397-billion-parameter AI model can run entirely on an iPhone 17 Pro — no cloud, no internet connection, completely private. The demo, shared by the open-source project ANEMLL (pronounced "animal"), hit the front page of Hacker News and sparked a debate about whether edge AI (running AI directly on your device instead of sending data to remote servers) is ready for prime time.

iPhone 17 Pro running a 400B parameter AI model locally

A 400B brain inside 12 GB of RAM

The model used is Qwen3.5-397B-A17B — a Mixture of Experts (MoE) model. Think of MoE like a company with 397 billion specialists, but only 17 billion of them show up for any given task. That's the trick: the phone doesn't need to load all 397 billion parameters at once.

Even so, the compressed model requires over 200 GB of storage — far more than the iPhone's 12 GB of RAM can hold. ANEMLL solves this by streaming model weights directly from the phone's SSD to the GPU, loading only the parts it needs for each word it generates. It's like reading a 200-volume encyclopedia by only opening the exact page you need.

The numbers:
• Model size: 397 billion parameters (17B active per token)
• Speed: 0.6 tokens/second (~1 word every 2 seconds)
• RAM used: 12 GB LPDDR5X (iPhone 17 Pro)
• Storage needed: 200+ GB on device
• Internet required: None — fully offline

Impressive tech, impractical speed

Let's be honest: 0.6 tokens per second is painfully slow. A simple response takes 30+ seconds to appear. For comparison, cloud-based AI like Claude or ChatGPT generates 50-100 tokens per second — roughly 100x faster.

Hacker News commenters didn't hold back. One noted the irony of "billions of calculations producing vague pleasantries," while others referenced Douglas Adams' predictions about computers taking forever to say obvious things. The consensus: technically stunning, practically useless — for now.

But smaller models already run at usable speeds on the same hardware. ANEMLL achieves 47-62 tokens/second on 1B models and about 9 tokens/second on 8B models using Apple's Neural Engine — the dedicated AI chip inside every recent iPhone and Mac.

Why running AI on your phone matters

Speed aside, this demo proves an important principle: your private data never leaves your device. No API calls, no cloud servers, no subscriptions. Every conversation stays on your phone.

For anyone worried about privacy — medical questions, financial planning, personal journaling — local AI is the ultimate solution. And unlike cloud services, it works on airplanes, in tunnels, and in countries with restricted internet.

ANEMLL supports these models at practical speeds on iPhone/Mac:

Meta LLaMA 3.2 (1B, 8B)

Qwen 3 (0.6B to 8B)

Google Gemma 3 (270M to 4B)

DeepSeek R1 (8B distilled)

Try it yourself

ANEMLL is fully open-source and has an iOS/macOS chat app available on TestFlight. If you have a Mac with Apple Silicon:

# Install ANEMLL on Mac
brew install uv
git clone https://github.com/Anemll/Anemll.git
cd Anemll
./create_uv_env.sh
source env-anemll/bin/activate
./install_dependencies.sh

# Test with a small model first
./tests/conv/test_qwen_simple.sh

For iPhone, join the ANEMLL Chat TestFlight beta to try it with smaller, fast models.

The trajectory that matters

Two years ago, running any AI model on a phone was science fiction. Today, 8B models run at conversational speed, and 400B models technically work. If hardware keeps improving at Apple's pace — better Neural Engines, faster SSDs, more RAM — today's party trick becomes tomorrow's default.

The real question isn't whether 0.6 tokens/second is useful. It's whether your next phone will run AI that's actually smarter than cloud services — without ever sending a single byte of your data to anyone.

Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments