Google Gemma 4 Runs Free on Mac with Ollama v0.20.0
Run Google Gemma 4 free on your Mac with Ollama v0.20.0 — 4 model sizes, local audio input, and Apple Silicon MLX fixes. No subscription.
On April 2, 2026, Ollama shipped v0.20.0 — and it just added Google's brand-new Gemma 4 local LLM to the list of things you can run completely free on your own computer. No monthly subscription, no usage fees, and no data leaving your device. That combination of "zero cost + full privacy" is exactly why local AI automation has been gaining on cloud alternatives.
Gemma 4 in Four Sizes — Pick the Right Local LLM for Your Machine
Gemma 4 ships in four distinct configurations, each targeting a different hardware tier. Google designed the smaller variants specifically for edge devices — laptops, desktops with integrated graphics, even single-board computers.
- E2B (Effective 2B) — 2 billion active parameters (the "weight count" of an AI model, which directly determines RAM requirements). Runs comfortably on 8 GB of RAM. Best for quick tasks, low-latency Q&A, and battery-constrained devices.
- E4B (Effective 4B) — 4 billion parameters. Noticeably better reasoning than E2B, still manageable on a 12–16 GB machine.
- 26B MoE (Mixture of Experts) — 26 billion total parameters, but only ~4 billion activate per inference pass (a routing architecture where the model selects the most relevant "expert" sub-networks per query, delivering 26B quality at 4B compute cost). The sweet spot for most users.
- 31B Dense — 31 billion parameters, all active simultaneously. Maximum quality, requires roughly 20–24 GB of RAM.
To pull and run any variant today, open a terminal:
ollama run gemma4:e2b # Smallest — works on 8 GB RAM
ollama run gemma4:e4b # Mid-range
ollama run gemma4:26b # MoE — best efficiency balance
ollama run gemma4:31b # Largest — needs 24 GB+ RAM
Ollama Audio Input: A Local Feature, Not a Cloud Perk
The headline technical addition in v0.20.0 is support for Gemma 4's audio_tower — the tensor-naming convention (the internal labeling system that tells Ollama how to load the audio-processing component of a model) that enables multimodal input combining text and audio. Until now, processing voice locally alongside text required patching model files manually or using separate tools.
In practical terms: you can feed Gemma 4 an audio file alongside a text prompt, and the model processes both without sending a byte of voice data to an external server. Full documentation on supported audio formats and latency benchmarks is still being finalized by the Ollama team, but the underlying infrastructure is now in place for developers to start building on.
Apple Silicon MLX Got a Serious Tune-Up
Version 0.19.0 (released March 31, 2026, just 2 days before v0.20.0) made a foundational shift: Ollama's Apple Silicon runner migrated to MLX (Apple's own machine learning framework, designed for the unified memory architecture in M-series chips). Unified memory means the CPU and GPU share one physical RAM pool — so a MacBook with 16 GB can dedicate all 16 GB to AI inference (running the model), instead of splitting it between CPU tasks and GPU tasks separately.
v0.20.0 cleaned up three significant issues introduced during that transition:
- KV cache memory leak fixed — A KV cache (key-value cache: the memory structure that stores previously computed conversation context, so the model doesn't re-read every prior message on each new token) was silently accumulating RAM during long sessions. Over an hour of use, this could exhaust available memory entirely.
- Periodic snapshot scheduling added — The MLX runner now takes memory checkpoints during prompt prefill (the phase where the model reads and encodes your input before generating output), preventing gradual memory buildup across multiple back-to-back queries.
- Anthropic-compatible API cache hit rate improved — If you call Ollama through a client that speaks the Claude API format, repeated identical prompts now reuse cached computation instead of re-processing from scratch — noticeably faster in agent loops.
Five Ollama v0.20.0 Fixes That Were Quietly Breaking Real Workflows
Alongside Gemma 4, v0.20.0 and its release candidates patched several issues that were actively affecting users running other popular models:
- Qwen3.5 tool calling corrupted — Qwen is Alibaba's open-weight model family. A bug was injecting tool-call instructions (structured commands that tell the model to invoke external functions) into the model's internal chain-of-thought output, producing garbled responses when tools were enabled.
- Grok flash attention crash — Flash attention (a GPU memory optimization that computes attention scores in smaller tiles instead of materializing the full attention matrix) was incorrectly enabled for Elon Musk's Grok models. On most hardware this caused hard crashes. Now disabled for Grok by default.
- qwen3-next:80b loading failure resolved — The 80-billion parameter variant simply refused to load. Fixed in this release.
- SentencePiece BPE tokenizer added — A tokenizer (the preprocessing system that splits raw text into the numeric tokens an AI model actually reads) used by several newer model families is now natively supported, expanding compatibility.
- MLX add_bos_token now respected — Some models require a special beginning-of-sequence token (a marker telling the model "a new input starts here") that the MLX pipeline was silently ignoring, causing subtly wrong outputs on affected models.
The Bigger Picture: Local AI Automation Is Catching Up Fast
Google released Gemma 4 in early 2026 to challenge open-source leaders like Meta's Llama and Mistral AI. Ollama's team shipped full compatibility within days — including the audio multimodal capability that even some enterprise cloud tools haven't added yet. That turnaround speed is becoming a competitive signal in itself.
The Apple Silicon angle matters beyond MacBook enthusiasts. Over 100 million Macs are estimated to be in active use. If running a 26-billion-parameter model locally becomes as reliable as streaming a YouTube video — which the MLX work is pushing toward — then the economics of cloud AI subscriptions (typically $20–$30/month per tool) start to look very different for individual users and small teams.
If you haven't set up Ollama yet, you can get started with local AI in under 10 minutes. For hands-on guides on using Gemma 4 for specific tasks like code review, document summarization, or voice transcription, the AI automation guides section has step-by-step walkthroughs — no prior technical background required.
Related Content — Get Started | Guides | More News
Stay updated on AI news
Simple explanations of the latest AI developments