Local LLM Integration Guide

Overview

This guide covers integrating local LLMs into the Nexus 3.0 ecosystem using Ollama and the Local MCP Server.

Architecture

Hardware Requirements (Practice Rig)

GPU: 8GB+ VRAM minimum (14GB recommended for 7B models)
CPU: Any modern multi-core
RAM: 16GB+ system RAM
Storage: 50GB+ for models

Hardware Requirements (Production/Client)

GPU: NVIDIA RTX Pro 6000 (96GB VRAM)
CPU: AMD Threadripper (64 cores)
RAM: 128GB+
Storage: 1TB+ NVMe

Software Stack

Server (Headless Linux)

Ubuntu Server 24.04 LTS
NVIDIA Driver 550+
CUDA Toolkit
Ollama
Docker (for Open WebUI)

Client (Windows Desktop)

LM Studio - GUI for model testing
Connects to remote Ollama API

Network Setup

Tailscale for secure mesh networking
Ollama API on port 11434
Open WebUI on port 3000 (optional)

Model Recommendations by VRAM

VRAM	Model	Tokens/sec (est)
8GB	Qwen2.5-1.8B	60+
8GB	Qwen2.5-7B-Q4	20-30
14GB	Qwen2.5-7B-Q8	25-35
14GB	Qwen2.5-14B-Q4	15-20
24GB	Llama2-13B-Q8	20-30
48GB	Llama2-70B-Q4	10-15
96GB	Llama2-70B-Q8	15-25

Local MCP Server

The local MCP server exposes these tools: - local.chat - Conversational completion - local.complete - Text completion - local.models - List available models - local.status - GPU memory, tokens/sec - local.embed - Generate embeddings

What Small Models (1.8B-7B) Can Do

Tool calling (add contact, create track, search Nexus)
Short voice responses
Structured data extraction
Company knowledge Q&A (with training)

What Requires Larger Models

Long document summarization
Complex multi-step reasoning
Use Claude API for these tasks

Training / Fine-Tuning

Use LoRA (Low-Rank Adaptation): - Base model stays frozen - Train small adapter (~50-100MB) - Works completely offline - Tools: Unsloth, Axolotl, LLaMA-Factory

Training Data Ideas

Corlera company information
User preferences and style
Nexus tool usage patterns
Contact and project context