Understanding LLM Memory: VRAM, KV Cache & GPU Monitoring
The Three Components of VRAM Usage
When running a local LLM, VRAM is used for three things:
1. Model Weights (Fixed Size)
- The actual neural network parameters
- Size depends on model + quantization
- Example: Qwen 7B Q4 = ~4.5GB
- Does NOT change during conversation
2. KV Cache (Grows During Conversation)
- Stores the conversation context
- Grows with each token generated
- Longer conversation = bigger cache
- This is why long chats use more VRAM
3. CUDA Overhead (~0.5GB)
- GPU driver requirements
- Compute buffers
- Relatively constant
Example VRAM Breakdown: Qwen 7B Q4
| Component | Idle | Short Chat | Long Chat |
|---|---|---|---|
| Model Weights | 4.5 GB | 4.5 GB | 4.5 GB |
| KV Cache | 0.1 GB | 0.5 GB | 2.5 GB |
| CUDA Overhead | 0.5 GB | 0.5 GB | 0.5 GB |
| Total | 5.1 GB | 5.5 GB | 7.5 GB |
Why This Matters
- A 4.5GB model can peak at 7-8GB during use
- Always leave headroom for KV cache growth
- Very long conversations can cause crashes
- Context window size affects maximum KV cache
Multi-GPU Behavior
With Ollama:
- Model is placed at load time
- Splits across GPUs if needed
- Does NOT dynamically rebalance
- If VRAM fills up → crashes or uses slow RAM
Best Practice:
- Choose model that fits with ~2GB headroom
- For 8GB GPU: use 7B Q4 models max
- For 14GB (dual GPU): use 13B Q4 models max
GPU Monitoring Tools
nvidia-smi (built-in)
watch -n 0.5 nvidia-smi
Basic VRAM and utilization stats
nvtop (recommended - pretty graphs)
sudo apt install nvtop
nvtop
Shows real-time graphs of GPU activity, memory, temperature
gpustat (compact)
pip install gpustat
gpustat -i 0.5
One-liner status updates
What the Graphs Show
When you ask a question: 1. GPU utilization spikes (processing) 2. VRAM increases (KV cache growing) 3. After response: GPU drops, VRAM stays higher 4. New question: spikes again, VRAM grows more
Key Takeaways
- Model size ≠ VRAM needed - Always add 1-3GB for KV cache
- Longer conversations use more VRAM
- nvtop is your friend - Install it to see what's happening
- Headroom matters - Don't max out your VRAM with the model alone
Last updated: 2025-12-08 Session: s_801t