VRAM Extender
Calculate how to run larger AI models by extending GPU memory with system RAM and NVMe. Enter your hardware specs to see which models you can run and how much offloading you need.
⚙️Hardware Configuration
~26GB usable for offloading
Slower but extends capacity
GPU VRAM (12GB) + RAM offload (26GB) + NVMe swap (50GB)
🤖Model Compatibility
💡How VRAM Extension Works
1. GPU VRAM
Fastest memory. Model weights here give maximum inference speed. Typical speeds: 50-100+ tokens/sec.
2. System RAM
Offload overflow to RAM via llama.cpp, Ollama, or LM Studio. Moderate slowdown (~5-10x slower than GPU).
3. NVMe Swap
Last resort for massive models. Very slow but enables running models that don't fit in RAM. Use with GreenBoost or similar.
Tip: Use quantized models (Q4_K_M) to reduce memory needs by ~75% with minimal quality loss. For best performance, keep at least the "hot" layers (attention) in GPU VRAM.
🛠️Recommended Tools
🦙 Ollama
Easiest way to run local LLMs. Automatic GPU/RAM offloading, simple CLI.
🖥️ LM Studio
GUI for running local models. Visual GPU/RAM split control.
⚡ llama.cpp
Core engine for CPU+GPU inference. Fine-grained control over offloading.
🔧 GPU VRAM Extender
Experimental driver-level VRAM extension using system memory.
Memory calculations are estimates. Actual usage varies by model architecture, context length, and batch size.
Based on Q4_K_M quantization. Other quantizations will have different memory requirements.