Inworld AI TTS - ElevenLabs Alternative

Research Date: 2025-12-28 Track Project: 390b2ed1 Status: Evaluation in progress

Executive Summary

Inworld AI TTS is dramatically cheaper than ElevenLabs: - Inworld: $5 per 1 million characters - ElevenLabs Pro: $100/month - Savings: Up to 95% cost reduction

Current Promotion: FREE until December 31, 2025!

Pricing Comparison

Service	Cost	Notes
Inworld AI	$5 / 1M chars	~$0.25 per audio hour
ElevenLabs Pro	$100/month	Fixed monthly
Inworld Promo	FREE	Until Dec 31, 2025

Inworld also includes: - 2 million free characters for new users - Free zero-shot voice cloning - No per-character charge during promo

API Structure

Endpoints

POST https://api.inworld.ai/tts/v1/voice          # Standard
POST https://api.inworld.ai/tts/v1/voice:stream   # Streaming

Authentication

Basic auth with Base64-encoded API key:

headers = {
    "Authorization": f"Basic {base64_api_key}",
    "Content-Type": "application/json"
}

Request Format

{
    "text": "Hello world",
    "voiceId": "Ashley",
    "modelId": "inworld-tts-1"
}

Response Format

{
    "result": {
        "audioContent": "<base64-encoded-audio>"
    }
}

Available Models

Model	ID	Features
Inworld TTS	`inworld-tts-1`	Rich, expressive, low-latency
Inworld TTS Max	`inworld-tts-1-max`	More expressive, better multilingual

Supported Languages: English, German, Spanish, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Chinese (12 total)

Voice Cloning

Instant Cloning (All Users)

Only 5-15 seconds of audio needed
Up to 3 samples
Formats: wav, mp3, webm
Max 16MB total

Professional Cloning (Contact Sales)

30+ minutes of audio
Higher quality output
For unique voices/accents

Expression & Markup Features

Emphasis (use asterisks)

We *need* a beach vacation

The word "need" will be emphasized.

Non-Verbal Tags

[breathe] [clear_throat] [cough] [laugh] [sigh] [yawn]

Example:

[clear_throat] Did you hear what I said? [sigh] You never listen!

Text Normalization (Speak Out)

Type	Written	Spoken
Phone	(123)456-7891	one two three, four five six...
Date	5/6/2025	may sixth twenty twenty five
Time	12:55 PM	twelve fifty-five PM
Email	test@example.com	test at example dot com
Money	$5,342.29	five thousand three hundred...
Math	2+2=4	two plus two equals four

Natural Speech

Add filler words for realism: "uh", "um", "well", "like"

Migration Path from ElevenLabs

What Changes

API endpoint URL
Authentication method (Bearer → Basic)
Request payload structure
Voice IDs (need to map or clone)

What Stays Similar

Send text, get base64 audio back
Streaming support available
Voice cloning available
Queue-based playback works the same

Migration Steps

Create Inworld account
Generate API key in Portal
Clone Nexus/LARS voices (instant cloning)
Update voice server to use Inworld API
Test with both services in parallel
Switch over when satisfied

Potential Voice Protocol Updates

If we switch to Inworld, add to workflow:

**inworld_formatting**: Use *asterisks* for emphasis. Normalize numbers/dates to spoken form. Add [sigh], [laugh] etc. for expression.

Resources

Pricing: https://inworld.ai/pricing
TTS Docs: https://docs.inworld.ai/docs/tts/tts
Quickstart: https://docs.inworld.ai/docs/quickstart-tts
Best Practices: https://docs.inworld.ai/docs/tts/best-practices/generating-speech
Voice Cloning: https://docs.inworld.ai/docs/tts/voice-cloning

Recommendation

Strong candidate for migration.

Pros: - 95% cost savings ($5/1M vs $100/mo) - Currently FREE through Dec 2025 - Voice cloning included free - Expression markup (asterisks, non-verbal tags) - Streaming support - 12 languages

Cons: - Need to clone our voices (quick process) - Different API structure (minor code changes) - Less mature ecosystem than ElevenLabs

Next Step: Create Inworld account, clone Nexus voice, test quality.