Local AI Training Infrastructure Research
Date: 2025-12-27 Purpose: Research for setting up self-hosted AI training pipeline for LARS and custom Nexus AI models
Executive Summary
This document covers the research for implementing a fully local, self-hosted AI training infrastructure. The goal is to: 1. Fine-tune local AI models (like LARS) with custom identity and tool knowledge 2. Potentially fork Llama for a custom "Nexus AI" base model 3. Run training continuously on local servers without cloud dependencies
Part 1: Unsloth - Primary Training Framework
What is Unsloth?
Unsloth is an open-source (Apache 2.0 license) framework for fine-tuning and reinforcement learning for LLMs. It's specifically optimized for efficient, low-memory training on NVIDIA GPUs.
Key Features
- 2x faster training with 70% less VRAM compared to standard methods
- No accuracy degradation (uses no approximation methods)
- Supports LoRA, QLoRA, and full fine-tuning
- Direct export to Ollama, GGUF, llama.cpp, vLLM
- Supports 100+ models including Qwen, Llama, DeepSeek, Gemma
Can It Run Locally? YES!
Installation:
# Linux/WSL
pip install unsloth
# Docker (recommended for servers)
docker pull unsloth/unsloth
GPU Requirements: - Minimum: CUDA Capability 7.0 - Supported: V100, T4, RTX 20/30/40 series, A100, H100, L40 - Also supports AMD and Intel GPUs - Our dual RTX 3090s are perfect for this
Export to Ollama Workflow
# After training in Unsloth
model.save_pretrained_gguf("output_dir", tokenizer, quantization_method="q8_0")
# Or for smaller size:
model.save_pretrained_gguf("output_dir", tokenizer, quantization_method="q4_k_m")
The exported GGUF file can then be loaded directly into Ollama.
Important: Chat Template Consistency
When exporting to Ollama, you MUST use the SAME chat template that was used during training. Mismatched templates cause gibberish output or infinite generation loops.
Part 2: Alternatives to Unsloth
1. Axolotl
- Best for: Beginners, multi-GPU setups (Unsloth is single-GPU only)
- License: Open source
- Strengths: Community-driven, rapid new model support
- Use case: If we need to use BOTH 3090s simultaneously
- GitHub: https://github.com/OpenAccess-AI-Collective/axolotl
2. LLaMA Factory
- Best for: Zero-code training via Web UI
- License: Apache 2.0
- Strengths: Supports 100+ LLMs, Gradio Web UI, no coding required
- Use case: Quick experiments, non-technical training
- GitHub: https://github.com/hiyouga/LLaMA-Factory
3. SWIFT (by ModelScope)
- Best for: Multi-GPU training at scale
- License: Open source
- Strengths: 300+ LLM support, full workflow (pretrain → fine-tune → RLHF → deploy)
- Use case: Enterprise-scale training
4. Torchtune
- Best for: PyTorch developers wanting deep customization
- License: BSD (PyTorch ecosystem)
- Strengths: Pure PyTorch, AMD/NVIDIA support, memory efficient
Comparison Table
| Framework | Multi-GPU | Web UI | License | Best For |
|---|---|---|---|---|
| Unsloth | No | No | Apache 2.0 | Speed/efficiency |
| Axolotl | Yes | No | Open | Multi-GPU |
| LLaMA Factory | Yes | Yes | Apache 2.0 | No-code |
| SWIFT | Yes | Partial | Open | Enterprise |
| Torchtune | Yes | No | BSD | Customization |
Part 3: Forking Llama for Custom Nexus AI
Llama 3 License Terms
Llama uses the Meta Llama 3 Community License - NOT true open source, but permits: - Commercial use ✓ - Fine-tuning ✓ - Creating derivative works ✓ - Distribution ✓
Requirements: 1. Must include "Llama" in model name (e.g., "Nexus-Llama") 2. Must display "Built with Llama" on related materials 3. Must include copy of license with distribution 4. Must comply with Acceptable Use Policy
How to Fork Llama
# Clone the repo
git clone https://github.com/meta-llama/llama3.git
# Request access to model weights at llama.com
# Run download script after approval
bash download.sh
Fully Permissive Alternatives
If we want true open source without attribution requirements: - Grok (xAI) - Apache 2.0, fully permissive - Mixtral 8x22B - Apache 2.0, fully permissive - Qwen - Various models with permissive licenses
Part 4: Recommended Architecture for Nexus
Option A: Dedicated Training Server
Run Unsloth on a dedicated server (could be Nexus server or backup) that: 1. Continuously monitors training data/feedback 2. Periodically retrains/fine-tunes LARS 3. Exports new GGUF to local-ai server 4. Ollama hot-reloads the updated model
Option B: On-Demand Training on local-ai
Run Unsloth directly on local-ai server (dual 3090s) when needed: 1. Collect training data via normal LARS usage 2. Schedule training during off-hours 3. Export directly to Ollama on same machine
Recommended: Option A
Dedicated training server prevents GPU contention with inference.
Training Data Sources for LARS Identity
- Curated Q&A pairs establishing LARS identity
- Tool usage examples ("When asked to read a file, call read_file tool")
- Nexus system knowledge
- Coding style preferences
- Voice/personality traits
Part 5: Implementation Plan
Phase 1: Setup
- Install Unsloth on local-ai server (or dedicated training server)
- Create training dataset for LARS identity
- Run initial fine-tune on Qwen 2.5 Coder 32B
- Export to GGUF and test in Ollama
Phase 2: Tool Knowledge
- Create tool-calling training examples
- Fine-tune LARS to understand Gateway tools
- Test tool invocation accuracy
Phase 3: Continuous Learning
- Set up feedback collection from LARS usage
- Implement periodic retraining pipeline
- A/B test new models vs current
Resources
Official Documentation
- Unsloth: https://unsloth.ai/docs
- Unsloth GitHub: https://github.com/unslothai/unsloth
- Llama 3 GitHub: https://github.com/meta-llama/llama3
- LLaMA Factory: https://github.com/hiyouga/LLaMA-Factory
Tutorials
- Unsloth + Ollama Tutorial: https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/tutorial-how-to-finetune-llama-3-and-use-in-ollama
- Saving to GGUF: https://docs.unsloth.ai/basics/running-and-saving-models/saving-to-gguf
- Saving to Ollama: https://docs.unsloth.ai/basics/running-and-saving-models/saving-to-ollama
Key Takeaways
- Unsloth CAN run locally - Apache 2.0 license, pip install, Docker available
- Our hardware is perfect - Dual RTX 3090s exceed requirements
- Llama can be forked - With attribution requirements ("Built with Llama")
- Qwen may be better - More permissive license, already running on LARS
- Training pipeline is achievable - Unsloth → GGUF → Ollama is well-documented