GreenBoost lets you run AI models bigger than your GPU
A new open-source Linux tool transparently extends GPU memory using system RAM and NVMe storage, letting you run 32GB AI models on a 12GB graphics card at up to 60 tokens per second.
If you've ever tried running a large AI model locally and hit the dreaded "out of memory" wall, there's now a way around it — without buying a new graphics card.
GreenBoost is a new open-source Linux tool that transparently extends your GPU's memory (VRAM) by borrowing from your computer's regular RAM and even your SSD storage. The result: you can run AI models that are 2–5x larger than your GPU would normally allow.
The key numbers
On an RTX 5070 with just 12 GB of VRAM, GreenBoost enabled running a 32 GB model at 25–60 tokens per second — fast enough for real-time conversation. Without it, the model simply wouldn't load at all.
How it works — the three-tier memory trick
Think of your computer's storage like a filing system with three drawers, each progressively larger but slower:
Tier 1 — GPU VRAM (12 GB): The fast drawer right on your desk. Handles the most active calculations at ~336 GB/s.
Tier 2 — System RAM (up to 51 GB): The filing cabinet across the room. Stores overflow data at ~32 GB/s via PCIe 4.0.
Tier 3 — NVMe SSD (up to 64 GB): The storage room down the hall. Emergency overflow at ~1.8 GB/s.
GreenBoost installs two components: a Linux kernel module that manages the RAM pool, and a CUDA shim (a thin wrapper layer) that intercepts memory requests from AI software. When a model asks for more memory than the GPU has, GreenBoost transparently routes the overflow to system RAM — no code changes needed in your AI tools.
The clever part: it works with existing AI software like Ollama without any modifications. You just install GreenBoost once, and your GPU suddenly "appears" to have much more memory than it physically does.
Real-world performance — from unusable to usable
The developer tested the GLM-4.7-Flash model (a large Chinese language model) on an RTX 5070:
| Setup | Speed |
| Ollama + GreenBoost (basic) | 2–5 tok/s |
| + memory compression | 4–8 tok/s |
| ExLlamaV3 + GreenBoost cache | 8–20 tok/s |
| FP8 quantization (16 GB footprint) | 10–25 tok/s |
| EXL3 2-bit quantization (8 GB) | 25–60 tok/s |
The sweet spot: combining GreenBoost with model quantization (a technique that shrinks models by reducing numerical precision) delivers genuinely usable speeds. The EXL3 2-bit configuration hits 25–60 tokens per second — that's fast enough that responses feel nearly instant.
Bundled optimization tools
GreenBoost doesn't just extend memory — it bundles several optimization tools that work together:
ExLlamaV3 — Optimized inference engine with native GreenBoost integration
kvpress — Compresses the model's working memory (KV cache) in real-time
NVIDIA ModelOpt — Shrinks models using FP8 and INT4 quantization
Unsloth + LoRA — Fine-tune 30B parameter models on consumer hardware
Who is this for?
If you run AI models locally — whether for privacy, cost savings, or experimentation — GreenBoost removes the biggest bottleneck: GPU memory. A $500 GPU can now handle models that previously required $2,000+ hardware.
If you're curious about local AI but thought your hardware wasn't enough — this changes the equation. A gaming PC with 64 GB of RAM and a mid-range NVIDIA GPU becomes a capable AI workstation.
Caveats to know: It's Linux-only (Ubuntu 26.04+ tested), requires NVIDIA GPUs (Blackwell, Ada Lovelace, Ampere), and is still early software with 34 commits since its February 2026 creation. The NVMe tier is noticeably slower for random access patterns.
Try it yourself
git clone https://gitlab.com/IsolatedOctopi/nvidia_greenboost.git
cd nvidia_greenboost
sudo ./greenboost_setup.sh full-install
sudo ./greenboost_setup.sh diagnose
The installer auto-detects your GPU, RAM, CPU, and NVMe to calculate optimal settings. Once installed, just run your AI tools normally — GreenBoost handles the rest transparently.
Why this matters now
AI models keep getting bigger, but consumer GPU memory hasn't kept pace. The RTX 5070 shipped with just 12 GB of VRAM — the same amount as cards from years ago. Meanwhile, state-of-the-art open models routinely need 16–70+ GB.
GreenBoost bridges that gap using hardware you already own. It's not magic — there's a speed trade-off when data spills to RAM — but combined with quantization tools, the results are surprisingly practical. The project uses documented CUDA APIs, not hacks, and runs alongside official NVIDIA drivers without replacing them.
Created by developer Ferran Duarri under the GPL v2 license, with Red Hat/Rocky Linux support contributed by Alan Sill. The project appeared on Hacker News today and is actively developed on GitLab.
Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News
Stay updated on AI news
Simple explanations of the latest AI developments