root

LARS Training Investigation - December 28

LARS Un-sloth Training Investigation

Problem Summary

The 30B abliterated model (huihui-ai/Huihui-Qwen3-Coder-30B-A3B-Instruct-abliterated) crashes with CUDA out of memory during training setup.

Root Cause

Single GPU Loading: Unsloth is only using GPU 0 (23.5GB) even though 2x RTX 3090s are available (47GB total). The 30B model with 4-bit quantization needs ~24GB for loading + optimizer states, exceeding single GPU capacity.

Error Details

CUDA out of memory. Tried to allocate 2.00 MiB.
GPU 0 has a total capacity of 23.56 GiB of which 1024.00 KiB is free.
10.35 GiB allocated by PyTorch, 12.89 GiB reserved but unallocated.

Investigation Points for Tomorrow

  1. Enable Multi-GPU in Unsloth
  2. Check if device_map='auto' spreads across GPUs
  3. May need accelerate library configuration
  4. Look into FSDP (Fully Sharded Data Parallel)

  5. Try Smaller Model First

  6. Use 7B or 14B abliterated model for testing
  7. Verify training pipeline works before scaling up
  8. Models to try: huihui-ai/Qwen2.5-Coder-7B-Instruct-abliterated

  9. Memory Optimization Options

  10. Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  11. Use gradient_checkpointing more aggressively
  12. Reduce max_seq_length from 2048 to 1024
  13. Use load_in_8bit=True instead of 4bit (less memory during quant)

  14. Alternative: CPU Offloading

  15. DeepSpeed ZeRO Stage 3 can offload to CPU/NVMe
  16. Slower but allows larger models

Infrastructure Notes

  • Local-AI server: 100.89.34.86 (user: lars, pass: LARS25)
  • 2x RTX 3090 (24GB each, 47GB total)
  • HuggingFace cache moved to /data/models/huggingface (229GB NVMe)
  • Root disk was 100% full, now 50% after cache move
  • Ollama models work fine (inference-only, pre-quantized)

Models Available

  • huihui_ai/qwen3-coder-abliterated:latest (Ollama, 18GB)
  • qwen3:30b-a3b (Ollama, 18GB)
  • qwen2.5-coder:32b (Ollama, 19GB)
  • HuggingFace: 48GB safetensors (30B full weights)

Key Insight

Ollama loads fast (5 sec) because models are pre-quantized single files. HuggingFace/Unsloth loads slow because it quantizes 48GB of safetensors on-the-fly.

Test Script Location

/tmp/lars_unsloth_test.py (also copied to local-ai:/tmp/)

Next Session Actions

  1. Try 7B model first to verify training works
  2. Configure multi-GPU support with accelerate
  3. Test with memory optimizations
  4. If 30B still fails, consider 14B abliterated model
ID: 8b07c671
Path: LARS Training Investigation - December 28
Updated: 2026-01-13T12:51:00