2026-05-14ollamallama.cpplocal-aiapple-siliconmlxgguflocal-llmai-automation

Ollama v0.30.0 Drops GGML: llama.cpp + MLX Now Native

Ollama v0.30.0: llama.cpp replaces GGML, Apple Silicon gains native MLX acceleration. Pre-release rc15 is live — test your local AI models now.

Ollama v0.30.0, now in pre-release as rc15, made a fundamental architectural change: it dropped GGML (the old computation library that sat as a translation layer between Ollama and its AI engine) and now talks directly to llama.cpp. For Mac users running Apple Silicon — M1 through M4 — models now accelerate through MLX, Apple's own machine learning tool built specifically for its chips. For anyone running AI models locally without paying for cloud subscriptions, this is a meaningful shift in how the engine under the hood works.

Ollama v0.30.0 local AI model runner with llama.cpp native support and Apple Silicon MLX acceleration

Cutting Out the Middleman: How llama.cpp Native Support Changes Local AI

Before v0.30.0, Ollama's internal chain looked like this: your model request → Ollama → GGML → llama.cpp → output. GGML (an open-source C library for machine learning math operations — essentially the numeric engine underneath local AI) sat in the middle as an abstraction layer. Ollama built on top of GGML rather than calling llama.cpp directly.

Version 0.30.0 collapses that chain: requests now go Ollama → llama.cpp → output. The team's own description: "This version of Ollama will change the architecture to directly support llama.cpp instead of building on top of GGML."

Why does removing one layer matter? llama.cpp (a C++ implementation that has become the de facto standard for running quantized AI models locally — it's what the whole open-source model community tests against first) is the reference implementation. By standardizing directly on it, Ollama gets several concrete benefits:

All future llama.cpp optimizations automatically apply without re-implementing them through a GGML wrapper
Model compatibility improves — developers who test against llama.cpp directly will now get identical behavior in Ollama
One fewer translation layer means fewer edge-case bugs between layers

Apple Silicon Gets Native MLX Speed — What This Means for Local LLMs

For Mac users, the most tangible change in v0.30.0 is the switch to MLX. MLX (Apple's open-source machine learning framework, built from scratch specifically for Apple Silicon's unified memory design) replaces Ollama's previous acceleration approach on M-series chips.

The technical reason this is a real improvement: Apple Silicon chips don't separate CPU and GPU memory like traditional computers. Everything shares one pool of RAM. MLX was designed to exploit this — it keeps model data in unified memory and lets the CPU, GPU, and Neural Engine (Apple's dedicated on-chip AI accelerator) all access it simultaneously without data-copy overhead. Large language models (LLMs — the AI systems that power text generation, coding help, and chatbots) that require moving gigabytes of weights around benefit significantly from this architecture.

The v0.30.0 changelog also includes an MLX thread affinity update — meaning the framework now makes smarter decisions about which CPU performance cores handle which tasks. A macOS 26 SDK (software development kit) Metal library compilation bug was also fixed, which had been causing failures when compiling GPU-accelerated model code for the pre-release.

GGUF Format: Finally a First-Class Citizen

GGUF (GGML Unified Format — a file standard for packaging AI model weights alongside their configuration metadata, introduced as a replacement for the older GGML binary format) becomes native in v0.30.0. Previously, GGUF support required workarounds; now it is baked directly into the architecture as a fully supported format.

This matters for anyone pulling models from Hugging Face (the largest open-source AI model repository, hosting hundreds of thousands of models). GGUF has become the dominant format for quantized models — compressed AI models that trade minor accuracy reductions for dramatically smaller file sizes. A 70B-parameter model that would require 140GB at full precision might shrink to around 40GB in 4-bit quantization. With native GGUF support in v0.30.0, pointing Ollama at any of these files no longer requires conversion steps — it just works.

Ollama v0.30.0 rc15: What's Not Working Yet

Version 0.30.0 is explicitly pre-release, and two model types are currently non-functional:

laguna-xs.2 — incompatible with the new v0.30.0 architecture in current builds
llama3.2-vision — vision model support (models that accept both image and text input) is planned but not yet implemented in the new architecture

The pace of iteration signals how active the work is: 6 release candidates (rc10 through rc15) shipped in just 4 days. The team is specifically asking early testers to report on three areas:

Performance — is your model noticeably faster or slower than on v0.23.3?
Crashes and errors — anything that did not happen on the previous stable build
Memory usage — whether RAM consumption has improved or increased for your models

The current stable release remains v0.23.3, which also shipped on May 13, 2026, with its own improvements: app updater security hardening, Windows update flow fixes with new opt-in CI testing, and reliability improvements to the integration test suite across 7+ merged pull requests.

How to Test Ollama v0.30.0-rc15 on Your Machine Today

The Ollama team provides install scripts with explicit version pinning (locking the installer to a specific version number so it does not accidentally pull an untested newer build):

Mac or Linux — run in Terminal:

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.30.0-rc15 sh

Windows — run in PowerShell:

$env:OLLAMA_VERSION="0.30.0-rc15"; irm https://ollama.com/install.ps1 | iex

If you rely on llama3.2-vision or laguna-xs.2, stay on v0.23.3 for now — those models will not run on rc15. But if you run text-only models on an M-series Mac, the MLX acceleration switch is genuinely worth benchmarking against your current setup. Load a model, run a prompt, time it, and compare. The team's rapid RC cycle means your feedback can land in the next build within 24 hours. To set up Ollama from scratch or optimize your local AI automation stack, explore our local AI guides — or start from the beginning with our AI setup walkthrough.

Related Content — Get Started | Guides | More News

Sources

Ollama GitHub Releases

Stay updated on AI news

Simple explanations of the latest AI developments