Custom Voice Training for Piper
Goal
Train local Piper voices that match InWorld voices (Lena, LARS) for consistent offline experience.
Requirements
- NVIDIA GPU for training (CUDA)
- ~50GB disk space
- 5+ minutes of clean audio per voice
- 16kHz or 22.5kHz mono WAV files
Training Tools
TextyMcSpeechy (Recommended)
GitHub: https://github.com/domesticatedviking/TextyMcSpeechy - Easy voice creation workflow - Works with RVC voices - Can listen to model during training - Works offline on Raspberry Pi after training
Manual Training
- Collect audio samples with transcripts
- Download pre-trained checkpoint (medium quality)
- Fine-tune with your data
- Export to ONNX format
Data Format
- Audio: 16-bit mono WAV, 16kHz or 22.5kHz
- Text: LJSpeech format (metadata.csv)
- Structure:
wavs/filename.wav|transcript text
Process for Cloning InWorld Voices
Step 1: Collect Samples
- Generate diverse text samples through InWorld
- Save each audio clip with transcript
- Aim for 30-60 minutes total
- Cover various emotions, speeds, tones
Step 2: Prepare Dataset
dataset/
wavs/
clip_001.wav
clip_002.wav
...
metadata.csv
Step 3: Train Model
Use AI server (local-ai) with RTX 3090s:
python train.py \
--dataset dataset/ \
--checkpoint en_US-lessac-medium.ckpt \
--output lena_custom.onnx
Step 4: Deploy
Copy trained .onnx + .json to:
/opt/mcp-servers/voice/piper_models/
Resources
Future Work
- Clone Lena (female) voice from InWorld samples
- Clone LARS (male) voice from InWorld samples
- Update voice MCP to use custom models
- Seamless cloud/local voice consistency