2026-03-31vLLMvLLM v0.18.0open-source LLMLLM inferenceGPU-free AI servingAI infrastructureAI automationmachine learning

vLLM v0.18.0: GPU-Free Open-Source LLM Serving Is Here

vLLM v0.18.0 enables GPU-free LLM serving, adds gRPC & 6 new AI models — 213 contributors built the biggest open-source AI infrastructure release of 2026.

vLLM v0.18.0 delivers GPU-free open-source LLM serving — cutting infrastructure costs for every team running AI at scale. The open-source inference engine (software that serves AI responses at scale) trusted by Meta, Microsoft, and Hugging Face just shipped its most ambitious release, letting teams preprocess AI inputs on regular CPU servers — no expensive GPU required. Built by 213 contributors with 445 commits merged, it is the largest coordinated update in the project's history.

For engineering teams building AI automation pipelines and paying cloud GPU bills, this matters more than any benchmark. Splitting the pipeline (the full sequence of steps that turn a user's question into an AI response) frees expensive GPU hardware from routine prep work — and that translates directly into lower infrastructure costs at every scale.

vLLM GPU-Free LLM Serving: How It Works

A typical LLM serving pipeline has two phases: preprocessing — tokenizing text (converting words into numbers the AI understands), formatting prompts, and preparing inputs — and inference (the actual AI computation where responses are generated). Until v0.18.0, both phases required GPU access.

The new vllm launch render command runs the entire preprocessing phase on CPU hardware. This architectural split delivers four concrete wins:

Lower cost: CPU server time runs at a fraction of GPU rates ($0.05/hr vs $2–$4/hr for comparable cloud instances)
Independent scaling: Add preprocessing nodes during traffic spikes without expanding GPU capacity
Simpler architecture: Eliminates external preprocessing services teams previously had to build themselves
Better GPU utilization: GPUs focus exclusively on high-value inference work

The impact is largest for MoE deployments (Mixture of Experts — AI architectures that activate only selected specialist sub-networks per request, making them more compute-efficient than full-parameter models). In those setups, preprocessing overhead has historically consumed GPU cycles better spent on actual generation work.

gRPC: A Faster Lane for AI Requests

The second major infrastructure addition is gRPC serving (a high-speed binary communication protocol — as opposed to HTTP/REST, which sends data as verbose human-readable text and carries more overhead per call). A new --grpc flag enables this mode alongside the existing HTTP interface, so teams can migrate incrementally.

HTTP (the protocol your browser uses to load web pages) handles occasional requests fine. For high-frequency machine-to-machine calls — a recommendation engine, content moderation system, or customer service bot making thousands of LLM requests per second — gRPC reduces round-trip latency (the time between sending a request and receiving a reply) by transmitting data in compact binary format. The switch is a single flag; the gains are real at production scale.

# Serve with gRPC support enabled
vllm serve meta-llama/Llama-3-8B-Instruct --grpc

# GPU-free preprocessing node (new in v0.18.0)
vllm launch render

# Eagle3 speculative decoding for Qwen3.5
vllm serve Qwen/Qwen3.5-7B-Instruct --speculative-model Qwen/Qwen3.5-7B-eagle3

# Full upgrade
pip install vllm==0.18.0

Smarter Caching and Faster LLM Inference

Two performance upgrades in v0.18.0 target the bottlenecks production teams hit most often in high-traffic deployments:

KV Cache with CPU Offloading

KV Cache (Key-Value Cache — GPU memory storage where the AI saves intermediate calculations from earlier in a conversation, avoiding costly recomputation) now features smart CPU offloading via a new FlexKV backend. Frequently reused memory blocks stay on GPU; less-used blocks migrate automatically to cheaper CPU RAM.

In practice: longer conversations and larger batch sizes (processing multiple user requests simultaneously) become feasible without requiring more GPU memory. Teams building multi-turn chat applications will feel this most immediately — especially those supporting sessions with 50+ message turns.

GPU-Native Speculative Decoding

Speculative decoding (a technique where a small draft model pre-generates likely text that the main model verifies in parallel — significantly speeding up output generation without changing the final result) now runs directly on GPU and integrates with the async scheduler (a system that processes multiple requests simultaneously rather than one at a time). This removes a synchronous execution bottleneck present in earlier versions.

Eagle3 speculative decoding models are now available specifically for Qwen3.5 and Kimi K2.5 MLA, providing out-of-the-box speed improvements for teams already using those model families — no custom configuration needed.

Ray Is Gone — A Leaner Stack for Most Teams

Ray (an open-source distributed computing framework used to coordinate tasks across multiple machines in a cluster) has been removed as a default dependency. Until now, installing vLLM automatically pulled in the entire Ray ecosystem — a heavyweight framework carrying cluster management overhead most users never needed.

For teams running vLLM on a single server or small cluster — the majority of production deployments — removing Ray means:

Faster install: no Ray overhead on pip install vllm
Smaller Docker images (containers) for containerized production environments
Fewer dependency conflicts in complex Python machine learning environments
Explicit opt-in: teams that genuinely need Ray for multi-node setups can still install it manually

This signals a clear architectural direction: vLLM is optimizing for standalone, self-contained deployments over distributed Ray cluster configurations. Most production teams were carrying infrastructure weight they never used. Simpler stacks are more maintainable stacks.

6 New Model Families — and One Known Bug

v0.18.0 adds support for 6 new model architectures, expanding vLLM's coverage across multilingual and multimodal use cases:

Sarvam MoE — mixture-of-experts model optimized for Indian language tasks
OLMo Hybrid — Allen Institute's architecture combining transformer and state-space components
HyperCLOVAX-SEED — Naver's Korean-specialized large language model family
Kimi-Audio-7B — Moonshot AI's 7-billion-parameter audio-language model
ColPali — document retrieval with visual page understanding for PDF and image inputs
ERNIE pooling — Baidu's embedding model (a tool that converts text into numbers AI can understand for similarity search) for semantic retrieval

LoRA support (Low-Rank Adaptation — a method for fine-tuning AI models efficiently using a fraction of normal compute) also expands significantly. Whisper LoRA enables transcription model customization, and FP8 LoRA (8-bit floating point compression — shrinks model weights to save GPU memory and boost throughput) gains a dense kernel optimization for production deployments handling heavy request volumes.

One bug to check before upgrading: Teams running Qwen3.5 on NVIDIA B200 GPUs with FP8 KV cache quantization (memory compression to 8-bit format to reduce GPU RAM usage) face confirmed accuracy degradation in this release. Hold off on upgrading that specific configuration until a patch lands. Separately, the CUBLAS errors reported in v0.17.0 are now resolved — but the fix requires upgrading to the PyTorch 2.10.0 wheel alongside vLLM.

How to Upgrade — and Where to Start

# Upgrade to vLLM v0.18.0
pip install vllm==0.18.0

# Standard HTTP serving (unchanged from v0.17)
vllm serve meta-llama/Llama-3-8B-Instruct

# New: high-performance gRPC serving
vllm serve meta-llama/Llama-3-8B-Instruct --grpc

# New: GPU-free preprocessing node
vllm launch render

# Eagle3 speculative decoding for Qwen3.5
vllm serve Qwen/Qwen3.5-7B-Instruct --speculative-model Qwen/Qwen3.5-7B-eagle3

Full release notes and migration details live on the vLLM v0.18.0 GitHub release page. Notably, 61 of the 213 contributors were brand new to the project — a strong signal that vLLM's production user base is expanding well beyond its UC Berkeley origins. If your team is evaluating open-source AI infrastructure, the AI for Automation deployment guides walk through GPU-free rendering, gRPC setup, and multi-model serving configurations. Start with GPU-free rendering first — it is the lowest-risk upgrade path for any team already running vLLM in production today.

Related Content — Get Started | Guides | More News

Sources

vLLM v0.18.0 GitHub Release Notes

Stay updated on AI news

Simple explanations of the latest AI developments