2026-04-03ollamalocal-aivs-codeapple-silicongithub-copilotai-automationlocal-llmdeveloper-tools

Ollama v0.19.0: Free GitHub Copilot Alternative for VS Code

Ollama v0.19.0 lets you replace GitHub Copilot with any free local AI model in VS Code — no subscription, no cloud. Apple Silicon MLX support added.

GitHub Copilot costs $20 a month. Ollama — the free, open-source tool that runs AI models on your own laptop — just changed that. Version 0.19.0, released March 31, 2026, connects directly to Visual Studio Code through the Copilot extension and lets you swap the cloud assistant for any local model you already own. No API key. No monthly bill. No internet connection required.

The same release adds preview support for Apple's MLX framework (Apple's machine learning engine, purpose-built for M-series chips), cutting inference latency on MacBooks without any extra setup. Two days later, v0.20.0-rc0 landed on April 2, 2026, fixing critical issues with Google's Gemma4 and Alibaba's Qwen3.5 models. These aren't incremental tweaks — Ollama is systematically dismantling every reason developers still pay for cloud AI subscriptions, from AI automation pipelines to everyday vibe coding.

The $20/Month Problem Ollama Just Solved

GitHub Copilot Individual costs $19/month. Copilot Business runs $19 per user per month. For a five-person engineering team, that's $1,140 a year just for AI autocomplete. Ollama v0.19.0 turns your existing hardware into a zero-cost replacement.

The integration is remarkably direct. Ollama hooks into the GitHub Copilot extension already installed in VS Code — no new extension, no extra tooling. Once Ollama is running locally, you open VS Code's Copilot settings and select any Ollama model from a dropdown menu. From that moment, every code suggestion routes through your machine instead of Microsoft's servers.

VS Code Ollama model selector — free GitHub Copilot alternative using local AI in VS Code settings

The practical benefits go well beyond the $0 price tag:

Complete privacy: Your code never leaves your machine. For anyone handling proprietary software, medical records, or financial data — industries where cloud data leaks carry legal consequences — this isn't just convenient, it's essential.
Lower latency: Local inference (running the AI on your own CPU or GPU rather than a remote server) cuts the round-trip delay of a cloud API call. On modern hardware, suggestions often feel faster than the cloud version.
Unrestricted model choice: You're not locked into GitHub's curated list. Pull Qwen2.5-Coder, Llama 3.3, Mistral, Gemma4, or any other model from Ollama's library with one command and use it as your coding assistant immediately.
Offline capability: No internet? Still works. The model runs entirely on your hardware.

This move follows a pattern Ollama has been executing for over a year: take a workflow that required cloud dependency, and make it run locally without friction. The VS Code integration is the biggest friction point Ollama has eliminated yet — it places local AI directly inside the editor that the majority of professional developers use daily. Previously, VS Code users who wanted local AI had to install separate extensions, configure custom endpoints, and fight compatibility issues. Now it's a single dropdown selection in a settings panel they already know.

Apple Silicon Gets Its Own AI Engine

The second major addition in v0.19.0 is MLX support — and it matters for the tens of millions of Mac users who already own M-series hardware without realizing how much AI performance they've been leaving on the table.

MLX is Apple's own machine learning framework (a software library that tells the chip how to run AI calculations efficiently), engineered specifically for M1, M2, M3, and M4 processors. Before this, Ollama used GGML — a general-purpose model format and runner that works across Windows, Mac, and Linux. GGML works well everywhere, but it wasn't built for what makes Apple Silicon unusual: unified memory architecture (a design where the CPU and GPU share a single pool of RAM instead of maintaining separate memory banks).

Ollama v0.19.0 MLX support for Apple Silicon M-series — faster local AI inference on MacBook without cloud subscription

In a standard Windows or Linux PC, moving data between the CPU and GPU takes time — it's a genuine bottleneck during AI inference (the process of generating tokens from a model). On Apple Silicon, both chips share a single memory pool. A 32GB MacBook Pro has a full 32GB available for AI inference with zero transfer overhead. MLX is written to exploit this directly. GGML was designed for universality, not Apple-specific optimization.

The Ollama team added two reliability mechanisms alongside the new MLX runner:

Periodic snapshots during prompt processing: The MLX runner now saves its working state at regular intervals. This prevents "token exhaustion" (a failure mode where the model runs out of available context window and crashes mid-response), which was causing silent failures on long coding sessions before the fix.
Memory leak fix: A bug in the snapshot handling code was causing memory to grow unbounded during extended use — meaning a long VS Code session would eventually slow to a crawl. This was caught and patched between v0.19.0 and the April 2 release of v0.20.0-rc0, a two-day turnaround that signals the team is actively monitoring real-world usage.

One important caveat: MLX support in Ollama is currently marked as preview. That means it's functional for most workflows, but edge cases exist. The feature is stable enough for daily use — the Ollama team is collecting real-world feedback to finalize it for a stable release. If you hit issues, falling back to the GGML runner is as simple as setting an environment variable.

Gemma4, Qwen3.5, and the Bug-Fix Sprint

While VS Code integration and MLX grabbed the headlines, v0.20.0-rc0 also resolved long-standing problems affecting two of the most popular model families in Ollama's catalog.

Gemma4 Gets Proper Architecture Support

Google's Gemma4 uses a Mixture-of-Experts (MoE) architecture — a design where the model routes each query to specialized "expert" sub-networks rather than activating all its parameters at once. This makes large models faster and cheaper to run, but MoE models require precise handling of how their internal "gate" layers split calculations. Ollama's GGML runner had a bug in the fused gate_up split logic, causing Gemma4 to produce incorrect outputs or fail silently. The v0.20.0-rc0 fix resolves this, making Gemma4 properly stable under Ollama for the first time.

Qwen3.5 Tool Calls Now Work Correctly

Qwen3.5 (Alibaba's reasoning-capable model series) had a frustrating bug for anyone using it in automated workflows: tool calls (structured commands the AI issues to invoke external functions, like running a script or querying a database) were appearing inside the model's "thinking" output instead of as clean, parseable instructions. This broke every Qwen3.5-powered agent task silently. The multiline tool-call argument parsing fix in v0.20.0-rc0 resolves it — Qwen3.5 can now reliably drive AI automation without garbled output.

Other notable fixes across both releases:

Flash attention (a memory-efficient algorithm for speeding up the core attention computation in transformer models) was incorrectly enabled for Grok models — causing subtly wrong outputs. Disabled in v0.18.4.
KV cache hit rates improved for Anthropic-compatible API endpoints (the interface that lets tools designed for Claude — including Claude Code — also work with local Ollama models), reducing redundant computation on long conversations.
qwen3-next:80b — an 80-billion parameter model requiring careful memory management — now loads properly after a previous loader failure.
The macOS "model is out of date" false notification is fixed — no more phantom warnings interrupting your workflow.
ollama launch pi now bundles an integrated web search plugin, giving you a search-augmented local AI with a single command.

How to Set Up Ollama as a Free Local AI in VS Code (Under 5 Minutes)

Whether you're after the VS Code integration or want to try MLX on a Mac, setup is minimal:

# 1. Install or update Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a coding-optimized model for VS Code
ollama pull qwen2.5-coder:7b

# 3. Or try Google's Gemma4 (now with proper architecture support)
ollama pull gemma4

# 4. Verify your models are ready
ollama list

# 5. Launch the search-augmented assistant
ollama launch pi

For VS Code: with Ollama running locally and the GitHub Copilot extension installed, go to VS Code's settings, search for "Copilot model," and select your Ollama model from the dropdown. No restart required — the switch is instant.

For Apple Silicon: the MLX runner activates automatically on M-series chips once you're on v0.19.0 or later. No configuration needed. Ollama detects your hardware and routes accordingly. Test it by pulling a larger model and comparing time-to-first-token against your previous experience.

The gap between "local AI" and "cloud AI" used to be measured in friction: more setup steps, worse tool integration, slower models. Ollama v0.19.0 closes that gap at two of its widest points — the editor where developers spend their day, and the chip architecture that powers one in four developer laptops. If you're still paying $19/month for Copilot, try the local setup first — you might not renew.

Related Content — Get Started | Guides | More News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments