AI for Automation
Back to AI News
2026-04-30OllamaNVIDIA TensorRTlocal AI inferencelocal LLMAI automationopen source AIquantized AI modelsApple Silicon AI

Ollama NVIDIA TensorRT: Run Quantized AI Models Locally

Ollama v0.22.1 adds NVIDIA TensorRT for local AI inference, Nemotron 3 Omni, and Poolside's Laguna XS.2 — three production releases shipped in 72 hours.


In 72 hours, Ollama shipped three production releases advancing local AI inference — and quietly crossed a line that separates hobbyist tools from enterprise infrastructure. The addition of NVIDIA TensorRT Model Optimizer support means quantized models (models compressed to use less memory without losing much accuracy) can now run through Ollama for the first time. That is not a minor version bump. It is a signal that the project is moving upmarket.

For the developers and AI engineers who use Ollama to run large language models (LLMs — AI systems that understand and generate text) on their own hardware without sending data to the cloud, this week's v0.22.0 and v0.22.1 releases are worth paying attention to. Three releases in eight days. Seventeen pull requests merged. Two new enterprise-grade models added. And a bug that silently corrupted tokenizer output — patched.

Ollama GitHub repository showing NVIDIA TensorRT support and local AI inference updates

From Hobby Runner to Enterprise Local AI Inference Layer

The headline change in v0.22.1 is NVIDIA TensorRT Model Optimizer import support (PR #15566). TensorRT (NVIDIA's toolkit for compressing and accelerating AI models on its GPUs) was previously the domain of specialized inference servers — vLLM, TensorRT-LLM, or custom deployment pipelines. Bringing it into Ollama means developers can now import quantized model checkpoints (the compressed, optimized versions used in production environments) directly through the same CLI (command-line interface — the text-based terminal you type commands into) they already use to run models locally.

This matters in practical terms: quantized models are typically 2–4× smaller than their full-precision counterparts while delivering comparable output quality. An enterprise team that previously needed a dedicated TensorRT pipeline — specialist knowledge, custom tooling — now has a simpler path through Ollama's familiar one-command interface.

Two enterprise-grade models also landed in v0.22.0:

  • NVIDIA Nemotron 3 Omni — NVIDIA's own open enterprise model, now available in the Ollama model library with a single ollama pull command. The fact that a chipmaker chose Ollama as a primary distribution channel says something about where the project sits in the ecosystem.
  • Poolside Laguna XS.2 — the first open-weight coding model from Poolside (an AI coding startup backed by enterprise investment), making its debut through Ollama's distribution channel. Open-weight means the model's trained parameters are publicly available, so anyone can run, study, or build on top of it.

MLX Batching — The Speed Gain Apple Silicon Users Will Actually Feel

On the performance side, v0.22.1 introduces MLX runner sampler batching across multiple sequences (PR #15736). MLX is Apple's machine learning framework built for M-series chips (Apple Silicon — the processors inside MacBook Pro, Mac Mini, and Mac Studio models from late 2020 onward). Sampler batching means the system can now process multiple inference requests simultaneously rather than queuing them one at a time. For anyone running Ollama as a local inference server — serving multiple browser tabs, multiple agents, or concurrent API calls — this is a meaningful throughput improvement.

Earlier v0.21.1 changes compound this further. The release notes describe fused top-P and top-K sampling in a single sort pass. Top-P and top-K (probabilistic filters that control how the model selects its next word from a ranked list of candidates) previously required separate computation steps; fusing them into one sort reduces latency per token. Repeat penalties are also now applied at the sampler level, not as a post-processing step.

MLX prompt tokenization was also moved into request handler goroutines (independent processing threads that run in parallel), so tokenizing incoming prompts no longer blocks ongoing generation. For multi-user or multi-session setups, this change reduces head-of-line blocking — the problem where one slow operation stalls everything behind it.

The GLM4 MoE model got a dedicated optimization too: a fused sigmoid router head (a specialized component that routes each token to the correct expert sub-network within the model) now computes in fewer passes, yielding model-specific throughput gains.

Ollama v0.22.0 GitHub release notes listing Nemotron 3 Omni and Laguna XS.2 enterprise AI models

Agentic AI, Kimi K2.6, and the OpenAI Compatibility Layer

Beyond inference performance, Ollama is expanding its role as infrastructure for agentic AI — systems where the AI doesn't just answer questions but takes multi-step actions: browsing, writing code, calling external tools, and passing results between steps. Kimi K2.6, integrated via the Ollama CLI in v0.21.1, is designed for what the release notes describe as "long horizon agentic execution tasks through a multi-agent system." Multi-agent means multiple AI model instances working in parallel or sequence, each responsible for a different part of a larger task.

The OpenAI API compatibility layer received a targeted fix in v0.21.3: the reasoning_effort parameter — used in OpenAI's o1 and o3 model series to control how deeply the model deliberates before responding — now maps to Ollama's internal think parameter. Developers who built on OpenAI's interface and migrated to local Ollama setups no longer need to manually translate that parameter. It passes through automatically.

The MLX runner also gained logprobs support in v0.21.1. Logprobs (log-probabilities — the numerical confidence score assigned to each possible next word before the model makes its selection) enable research workflows, fine-tuning pipelines, and advanced sampling strategies that require visibility into the model's internal probability distribution.

Three Bug Fixes That Remove Silent Problems

Bug fixes rarely make headlines, but three across this release window address failures that could silently degrade real workflows:

  • Desktop app startup crash (PR #15657): Active inference sessions were being killed when the Ollama desktop app initialized. Running Ollama as a background service while also opening the app interface no longer terminates in-progress requests. This was reproducible and hit anyone who opened the desktop app while a session was running.
  • Tokenizer BPE offset corruption (PR #15844): A bug in multi-regex BPE handling (byte-pair encoding — the standard method most LLMs use to break text into processable chunks before generation) caused incorrect character position tracking. Silently wrong tokenization is worse than an obvious error because inference continues, but output quality degrades in hard-to-diagnose ways.
  • macOS model picker stale display (v0.21.1): The chat interface was showing the previous model name after switching conversations. Gemma 4 structured output in think=false mode was also corrected in the same release.

Install Ollama v0.22.1 and Run the New AI Models

Updating to v0.22.1 takes one command on macOS and Linux. The new models — Nemotron 3 Omni, Laguna XS.2, and Kimi K2.6 — are then available immediately from the Ollama model library with no additional configuration, no API key, and no cloud sign-up required.

# Update Ollama to v0.22.1 (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull NVIDIA Nemotron 3 Omni (enterprise open model)
ollama pull nemotron3

# Pull Poolside Laguna XS.2 (open-weight coding model)
ollama pull laguna-xs:xs.2

# Run Kimi K2.6 for multi-step agentic tasks
ollama run kimi-k2.6

# Verify version
ollama --version

If you are running an OpenAI-compatible client (a tool originally built for ChatGPT's API, such as Open WebUI or LibreChat), the new reasoning_effort mapping means that parameter passes through to local models without modification. No code changes required on the client side.

The 17 pull requests merged across these three releases — from 7 contributors including jessegross, dhiltgen, hoyyeva, ParthSareen, and madflow — represent a clear directional shift. Ollama is no longer optimizing only for the solo developer who wants to chat with a model offline. It is adding the quantization import pipeline, enterprise model distribution, parallel batching infrastructure, and multi-agent integration that production deployments require. To learn how local AI inference fits into practical automation workflows, the AI automation guides cover setup from scratch on any hardware.

Related ContentGet Started | Guides | More News

Stay updated on AI news

Simple explanations of the latest AI developments