Local LLM Integration Guide
Overview
This guide covers integrating local LLMs into the Nexus 3.0 ecosystem using Ollama and the Local MCP Server.
Architecture
Hardware Requirements (Practice Rig)
- GPU: 8GB+ VRAM minimum (14GB recommended for 7B models)
- CPU: Any modern multi-core
- RAM: 16GB+ system RAM
- Storage: 50GB+ for models
Hardware Requirements (Production/Client)
- GPU: NVIDIA RTX Pro 6000 (96GB VRAM)
- CPU: AMD Threadripper (64 cores)
- RAM: 128GB+
- Storage: 1TB+ NVMe
Software Stack
Server (Headless Linux)
- Ubuntu Server 24.04 LTS
- NVIDIA Driver 550+
- CUDA Toolkit
- Ollama
- Docker (for Open WebUI)
Client (Windows Desktop)
- LM Studio - GUI for model testing
- Connects to remote Ollama API
Network Setup
- Tailscale for secure mesh networking
- Ollama API on port 11434
- Open WebUI on port 3000 (optional)
Model Recommendations by VRAM
| VRAM | Model | Tokens/sec (est) |
|---|---|---|
| 8GB | Qwen2.5-1.8B | 60+ |
| 8GB | Qwen2.5-7B-Q4 | 20-30 |
| 14GB | Qwen2.5-7B-Q8 | 25-35 |
| 14GB | Qwen2.5-14B-Q4 | 15-20 |
| 24GB | Llama2-13B-Q8 | 20-30 |
| 48GB | Llama2-70B-Q4 | 10-15 |
| 96GB | Llama2-70B-Q8 | 15-25 |
Local MCP Server
The local MCP server exposes these tools:
- local.chat - Conversational completion
- local.complete - Text completion
- local.models - List available models
- local.status - GPU memory, tokens/sec
- local.embed - Generate embeddings
What Small Models (1.8B-7B) Can Do
- Tool calling (add contact, create track, search Nexus)
- Short voice responses
- Structured data extraction
- Company knowledge Q&A (with training)
What Requires Larger Models
- Long document summarization
- Complex multi-step reasoning
- Use Claude API for these tasks
Training / Fine-Tuning
Use LoRA (Low-Rank Adaptation): - Base model stays frozen - Train small adapter (~50-100MB) - Works completely offline - Tools: Unsloth, Axolotl, LLaMA-Factory
Training Data Ideas
- Corlera company information
- User preferences and style
- Nexus tool usage patterns
- Contact and project context