root

Understanding LLM Memory - VRAM, KV Cache & GPU Monitoring

Understanding LLM Memory: VRAM, KV Cache & GPU Monitoring

The Three Components of VRAM Usage

When running a local LLM, VRAM is used for three things:

1. Model Weights (Fixed Size)

  • The actual neural network parameters
  • Size depends on model + quantization
  • Example: Qwen 7B Q4 = ~4.5GB
  • Does NOT change during conversation

2. KV Cache (Grows During Conversation)

  • Stores the conversation context
  • Grows with each token generated
  • Longer conversation = bigger cache
  • This is why long chats use more VRAM

3. CUDA Overhead (~0.5GB)

  • GPU driver requirements
  • Compute buffers
  • Relatively constant

Example VRAM Breakdown: Qwen 7B Q4

Component Idle Short Chat Long Chat
Model Weights 4.5 GB 4.5 GB 4.5 GB
KV Cache 0.1 GB 0.5 GB 2.5 GB
CUDA Overhead 0.5 GB 0.5 GB 0.5 GB
Total 5.1 GB 5.5 GB 7.5 GB

Why This Matters

  • A 4.5GB model can peak at 7-8GB during use
  • Always leave headroom for KV cache growth
  • Very long conversations can cause crashes
  • Context window size affects maximum KV cache

Multi-GPU Behavior

With Ollama:

  • Model is placed at load time
  • Splits across GPUs if needed
  • Does NOT dynamically rebalance
  • If VRAM fills up → crashes or uses slow RAM

Best Practice:

  • Choose model that fits with ~2GB headroom
  • For 8GB GPU: use 7B Q4 models max
  • For 14GB (dual GPU): use 13B Q4 models max

GPU Monitoring Tools

nvidia-smi (built-in)

watch -n 0.5 nvidia-smi

Basic VRAM and utilization stats

sudo apt install nvtop
nvtop

Shows real-time graphs of GPU activity, memory, temperature

gpustat (compact)

pip install gpustat
gpustat -i 0.5

One-liner status updates

What the Graphs Show

When you ask a question: 1. GPU utilization spikes (processing) 2. VRAM increases (KV cache growing) 3. After response: GPU drops, VRAM stays higher 4. New question: spikes again, VRAM grows more

Key Takeaways

  1. Model size ≠ VRAM needed - Always add 1-3GB for KV cache
  2. Longer conversations use more VRAM
  3. nvtop is your friend - Install it to see what's happening
  4. Headroom matters - Don't max out your VRAM with the model alone

Last updated: 2025-12-08 Session: s_801t

ID: 06152694
Path: Understanding LLM Memory - VRAM, KV Cache & GPU Monitoring
Updated: 2026-01-13T12:50:55