Understanding LLM Memory: VRAM, KV Cache & GPU Monitoring

The Three Components of VRAM Usage

When running a local LLM, VRAM is used for three things:

1. Model Weights (Fixed Size)

The actual neural network parameters
Size depends on model + quantization
Example: Qwen 7B Q4 = ~4.5GB
Does NOT change during conversation

2. KV Cache (Grows During Conversation)

Stores the conversation context
Grows with each token generated
Longer conversation = bigger cache
This is why long chats use more VRAM

3. CUDA Overhead (~0.5GB)

GPU driver requirements
Compute buffers
Relatively constant

Example VRAM Breakdown: Qwen 7B Q4

Component	Idle	Short Chat	Long Chat
Model Weights	4.5 GB	4.5 GB	4.5 GB
KV Cache	0.1 GB	0.5 GB	2.5 GB
CUDA Overhead	0.5 GB	0.5 GB	0.5 GB
Total	5.1 GB	5.5 GB	7.5 GB

Why This Matters

A 4.5GB model can peak at 7-8GB during use
Always leave headroom for KV cache growth
Very long conversations can cause crashes
Context window size affects maximum KV cache

Multi-GPU Behavior

With Ollama:

Model is placed at load time
Splits across GPUs if needed
Does NOT dynamically rebalance
If VRAM fills up → crashes or uses slow RAM

Best Practice:

Choose model that fits with ~2GB headroom
For 8GB GPU: use 7B Q4 models max
For 14GB (dual GPU): use 13B Q4 models max

GPU Monitoring Tools

nvidia-smi (built-in)

watch -n 0.5 nvidia-smi

Basic VRAM and utilization stats

nvtop (recommended - pretty graphs)

sudo apt install nvtop
nvtop

Shows real-time graphs of GPU activity, memory, temperature

gpustat (compact)

pip install gpustat
gpustat -i 0.5

One-liner status updates

What the Graphs Show

When you ask a question: 1. GPU utilization spikes (processing) 2. VRAM increases (KV cache growing) 3. After response: GPU drops, VRAM stays higher 4. New question: spikes again, VRAM grows more

Key Takeaways

Model size ≠ VRAM needed - Always add 1-3GB for KV cache
Longer conversations use more VRAM
nvtop is your friend - Install it to see what's happening
Headroom matters - Don't max out your VRAM with the model alone

Last updated: 2025-12-08 Session: s_801t