LARS Un-sloth Training Investigation

Problem Summary

The 30B abliterated model (huihui-ai/Huihui-Qwen3-Coder-30B-A3B-Instruct-abliterated) crashes with CUDA out of memory during training setup.

Root Cause

Single GPU Loading: Unsloth is only using GPU 0 (23.5GB) even though 2x RTX 3090s are available (47GB total). The 30B model with 4-bit quantization needs ~24GB for loading + optimizer states, exceeding single GPU capacity.

Error Details

CUDA out of memory. Tried to allocate 2.00 MiB.
GPU 0 has a total capacity of 23.56 GiB of which 1024.00 KiB is free.
10.35 GiB allocated by PyTorch, 12.89 GiB reserved but unallocated.

Investigation Points for Tomorrow

Enable Multi-GPU in Unsloth
Check if device_map='auto' spreads across GPUs
May need accelerate library configuration
Look into FSDP (Fully Sharded Data Parallel)
Try Smaller Model First
Use 7B or 14B abliterated model for testing
Verify training pipeline works before scaling up
Models to try: huihui-ai/Qwen2.5-Coder-7B-Instruct-abliterated
Memory Optimization Options
Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Use gradient_checkpointing more aggressively
Reduce max_seq_length from 2048 to 1024
Use load_in_8bit=True instead of 4bit (less memory during quant)
Alternative: CPU Offloading
DeepSpeed ZeRO Stage 3 can offload to CPU/NVMe
Slower but allows larger models

Infrastructure Notes

Local-AI server: 100.89.34.86 (user: lars, pass: LARS25)
2x RTX 3090 (24GB each, 47GB total)
HuggingFace cache moved to /data/models/huggingface (229GB NVMe)
Root disk was 100% full, now 50% after cache move
Ollama models work fine (inference-only, pre-quantized)

Models Available

huihui_ai/qwen3-coder-abliterated:latest (Ollama, 18GB)
qwen3:30b-a3b (Ollama, 18GB)
qwen2.5-coder:32b (Ollama, 19GB)
HuggingFace: 48GB safetensors (30B full weights)

Key Insight

Ollama loads fast (5 sec) because models are pre-quantized single files. HuggingFace/Unsloth loads slow because it quantizes 48GB of safetensors on-the-fly.

Test Script Location

/tmp/lars_unsloth_test.py (also copied to local-ai:/tmp/)

Next Session Actions

Try 7B model first to verify training works
Configure multi-GPU support with accelerate
Test with memory optimizations
If 30B still fails, consider 14B abliterated model