Custom Voice Training for Piper

Goal

Train local Piper voices that match InWorld voices (Lena, LARS) for consistent offline experience.

Requirements

NVIDIA GPU for training (CUDA)
~50GB disk space
5+ minutes of clean audio per voice
16kHz or 22.5kHz mono WAV files

Training Tools

TextyMcSpeechy (Recommended)

GitHub: https://github.com/domesticatedviking/TextyMcSpeechy - Easy voice creation workflow - Works with RVC voices - Can listen to model during training - Works offline on Raspberry Pi after training

Manual Training

Collect audio samples with transcripts
Download pre-trained checkpoint (medium quality)
Fine-tune with your data
Export to ONNX format

Data Format

Audio: 16-bit mono WAV, 16kHz or 22.5kHz
Text: LJSpeech format (metadata.csv)
Structure: wavs/filename.wav|transcript text

Process for Cloning InWorld Voices

Step 1: Collect Samples

Generate diverse text samples through InWorld
Save each audio clip with transcript
Aim for 30-60 minutes total
Cover various emotions, speeds, tones

Step 2: Prepare Dataset

dataset/
  wavs/
    clip_001.wav
    clip_002.wav
    ...
  metadata.csv

Step 3: Train Model

Use AI server (local-ai) with RTX 3090s:

python train.py \
  --dataset dataset/ \
  --checkpoint en_US-lessac-medium.ckpt \
  --output lena_custom.onnx

Step 4: Deploy

Copy trained .onnx + .json to: /opt/mcp-servers/voice/piper_models/

Resources

Future Work

Clone Lena (female) voice from InWorld samples
Clone LARS (male) voice from InWorld samples
Update voice MCP to use custom models
Seamless cloud/local voice consistency