LARS Un-sloth Training Investigation
Problem Summary
The 30B abliterated model (huihui-ai/Huihui-Qwen3-Coder-30B-A3B-Instruct-abliterated) crashes with CUDA out of memory during training setup.
Root Cause
Single GPU Loading: Unsloth is only using GPU 0 (23.5GB) even though 2x RTX 3090s are available (47GB total). The 30B model with 4-bit quantization needs ~24GB for loading + optimizer states, exceeding single GPU capacity.
Error Details
CUDA out of memory. Tried to allocate 2.00 MiB.
GPU 0 has a total capacity of 23.56 GiB of which 1024.00 KiB is free.
10.35 GiB allocated by PyTorch, 12.89 GiB reserved but unallocated.
Investigation Points for Tomorrow
- Enable Multi-GPU in Unsloth
- Check if
device_map='auto'spreads across GPUs - May need
acceleratelibrary configuration -
Look into FSDP (Fully Sharded Data Parallel)
-
Try Smaller Model First
- Use 7B or 14B abliterated model for testing
- Verify training pipeline works before scaling up
-
Models to try:
huihui-ai/Qwen2.5-Coder-7B-Instruct-abliterated -
Memory Optimization Options
- Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True - Use
gradient_checkpointingmore aggressively - Reduce
max_seq_lengthfrom 2048 to 1024 -
Use
load_in_8bit=Trueinstead of 4bit (less memory during quant) -
Alternative: CPU Offloading
- DeepSpeed ZeRO Stage 3 can offload to CPU/NVMe
- Slower but allows larger models
Infrastructure Notes
- Local-AI server: 100.89.34.86 (user: lars, pass: LARS25)
- 2x RTX 3090 (24GB each, 47GB total)
- HuggingFace cache moved to /data/models/huggingface (229GB NVMe)
- Root disk was 100% full, now 50% after cache move
- Ollama models work fine (inference-only, pre-quantized)
Models Available
- huihui_ai/qwen3-coder-abliterated:latest (Ollama, 18GB)
- qwen3:30b-a3b (Ollama, 18GB)
- qwen2.5-coder:32b (Ollama, 19GB)
- HuggingFace: 48GB safetensors (30B full weights)
Key Insight
Ollama loads fast (5 sec) because models are pre-quantized single files. HuggingFace/Unsloth loads slow because it quantizes 48GB of safetensors on-the-fly.
Test Script Location
/tmp/lars_unsloth_test.py (also copied to local-ai:/tmp/)
Next Session Actions
- Try 7B model first to verify training works
- Configure multi-GPU support with accelerate
- Test with memory optimizations
- If 30B still fails, consider 14B abliterated model