page

Custom Voice Training

Custom Voice Training for Piper

Goal

Train local Piper voices that match InWorld voices (Lena, LARS) for consistent offline experience.

Requirements

  • NVIDIA GPU for training (CUDA)
  • ~50GB disk space
  • 5+ minutes of clean audio per voice
  • 16kHz or 22.5kHz mono WAV files

Training Tools

GitHub: https://github.com/domesticatedviking/TextyMcSpeechy - Easy voice creation workflow - Works with RVC voices - Can listen to model during training - Works offline on Raspberry Pi after training

Manual Training

  1. Collect audio samples with transcripts
  2. Download pre-trained checkpoint (medium quality)
  3. Fine-tune with your data
  4. Export to ONNX format

Data Format

  • Audio: 16-bit mono WAV, 16kHz or 22.5kHz
  • Text: LJSpeech format (metadata.csv)
  • Structure: wavs/filename.wav|transcript text

Process for Cloning InWorld Voices

Step 1: Collect Samples

  • Generate diverse text samples through InWorld
  • Save each audio clip with transcript
  • Aim for 30-60 minutes total
  • Cover various emotions, speeds, tones

Step 2: Prepare Dataset

dataset/
  wavs/
    clip_001.wav
    clip_002.wav
    ...
  metadata.csv

Step 3: Train Model

Use AI server (local-ai) with RTX 3090s:

python train.py \
  --dataset dataset/ \
  --checkpoint en_US-lessac-medium.ckpt \
  --output lena_custom.onnx

Step 4: Deploy

Copy trained .onnx + .json to: /opt/mcp-servers/voice/piper_models/

Resources

Future Work

  1. Clone Lena (female) voice from InWorld samples
  2. Clone LARS (male) voice from InWorld samples
  3. Update voice MCP to use custom models
  4. Seamless cloud/local voice consistency
ID: a79e2fb1
Path: LARS Voice Assistant > TTS Configuration > Custom Voice Training
Updated: 2025-12-30T20:57:53