Tech moves fast, but you're still playing catch-up?
That's exactly why 100K+ engineers working at Google, Meta, and Apple read The Code twice a week.
Here's what you get:
Curated tech news that shapes your career - Filtered from thousands of sources so you know what's coming 6 months early.
Practical resources you can use immediately - Real tutorials and tools that solve actual engineering problems.
Research papers and insights decoded - We break down complex tech so you understand what matters.
All delivered twice a week in just 2 short emails.
Text-to-Audio in 2025: The Technical Landscape Engineers Actually Need
Open-source models now match proprietary ones in blind tests. Here's how to choose between them.
Text-to-speech technology has reached an inflection point. In blind listening tests, open-source models like Kokoro, Orpheus, and Chatterbox are now indistinguishable from proprietary services like ElevenLabs for most use cases.
That changes the economics of everything from voice assistants to audiobook production.
This article breaks down what's actually happening in TTS, the technical tradeoffs you need to understand, and which models make sense for different applications.
The Two Categories That Matter
Modern TTS models fall into two distinct categories serving fundamentally different purposes:
Real-time models — Cartesia Sonic, ElevenLabs Flash, Kokoro. These prioritize streaming audio generation, producing speech as text arrives. Essential for conversational AI where latency determines whether dialogue feels natural or awkward. Often sacrifice some prosodic quality for speed.
Quality-first models — Orpheus, Dia, VibeVoice. These optimize for naturalness, emotional range, and multi-speaker dialogue. Better for audiobooks, podcasts, and content where you can wait a few seconds for superior output.
The key metric dividing them: Time to First Byte (TTFB). Real-time models target sub-200ms TTFB. Quality models may take seconds but produce audio you'd mistake for human recordings.
The Open-Source Models Worth Knowing
Kokoro (82M parameters)
The speed champion. Processes text in under 0.3 seconds across all tested input lengths. Apache-licensed, meaning you can deploy it commercially without restrictions.
Languages: 8 (American/British English, French, Korean, Japanese, Mandarin, and others)
Voices: 54 preset options
Cost: ~$0.65-0.80 per million characters via hosted APIs, or free for self-hosting
Limitation: No voice cloning. Lower naturalness than larger models.
Kokoro's 82M parameter count makes it runnable on modest hardware. That's the entire point — it's optimized for edge deployment and cost-sensitive applications where you need TTS at scale.
Orpheus (3B, 1B, 400M, 150M variants)
Built on Llama, Orpheus excels at emotionally nuanced speech. It supports tag-based emotion control — you can specify that a line should sound excited, sad, or whispered.
Latency: ~200ms TTFB
License: MIT (fully permissive)
Voice cloning: Zero-shot, meaning it can replicate a voice from a short sample
Best for: Virtual assistants, gaming dialogue, audiobooks requiring emotional range
The configurable model sizes (150M to 3B) let you trade off quality against compute requirements. The 150M variant runs on consumer hardware; the 3B variant needs serious GPU resources but produces the most human-like output.
Chatterbox (500M parameters)
Developed by Resemble AI and built on Llama 0.5B. In side-by-side evaluations against ElevenLabs, listeners consistently preferred Chatterbox or couldn't distinguish between them.
Voice cloning: 5-second samples
License: MIT
Latency: Sub-200ms on suitable hardware
Notable: Configurable expressiveness via emotion prompts
If you're new to TTS and want a starting point that balances quality, speed, and ease of use, Chatterbox is the current recommendation from most practitioners.
Dia (1.6B parameters)
Purpose-built for multi-speaker dialogue. Unlike most TTS models, Dia generates flowing conversations between speakers from a single text script, including nonverbal elements like laughter, coughing, and sighing.
Speakers: Uses [S1] and [S2] tags to differentiate voices
Language: English only
Best for: Podcasts, audio dramas, game dialogues, conversational interfaces
VibeVoice
The long-form specialist. Can synthesize up to 90 minutes of speech with up to four distinct speakers — something most TTS models can't handle at all. Uses a 7.5Hz tokenizer that's dramatically more efficient than standard approaches.
Best for: Full podcast episodes, audiobook chapters, any content where you need extended multi-speaker output without stitching together shorter clips.
The Proprietary Benchmarks
For comparison, here's where the major commercial APIs stand:
ElevenLabs Flash v2.5 — Sub-100ms TTFB, 30+ languages, 5-second voice cloning. The current quality/speed leader for commercial use. ~$0.18-0.30 per 1,000 characters depending on plan.
Cartesia Sonic 2.0 — ~40ms TTFB in turbo mode. Fastest commercial option. 15 realistic voices with instant voice cloning.
OpenAI TTS (GPT-4o mini) — ~250ms TTFB, 32 languages. Expression customizable via prompting. Limited to 10 voices.
Deepgram Aura-2 — Sub-200ms, simple per-character pricing. Only 2 languages, no voice cloning. Enterprise-focused.
The Five Axes of TTS Selection
When evaluating models, these are the dimensions that actually matter:
1. Naturalness
How human does it sound? Community voting platforms like TTS Arena provide Elo ratings based on thousands of blind comparisons. Kokoro v1.0 currently has a 44% win rate — meaning it wins against other models in nearly half of head-to-head tests.
2. Voice Cloning Quality
Zero-shot cloning replicates a voice from seconds of reference audio. Quality is measured via speaker similarity scores — how well the generated voice matches the original when tested by automatic speaker recognition systems.
3. Word Error Rate (WER)
How accurately does the synthesized speech transcribe back to text? A crude but useful proxy for intelligibility. Tested by running TTS output through a speech-to-text model and comparing results.
4. Latency (RTFx)
Real-Time Factor X measures how fast a model generates audio relative to playback length. An RTFx of 10 means generating 1 second of audio takes 0.1 seconds of compute.
5. Parameter Count
Directly correlates with compute cost. Kokoro at 82M can run on a laptop. Orpheus 3B needs a serious GPU. For edge deployment or cost-sensitive applications, this is often the deciding factor.
Practical Deployment Patterns
Voice agents / chatbots: Prioritize latency. Cartesia Sonic or ElevenLabs Flash for commercial; Kokoro for self-hosted.
Audiobook narration: Prioritize naturalness and emotion. Orpheus 3B or Dia for dialogue-heavy content.
Podcast generation: VibeVoice for long-form multi-speaker, Dia for shorter conversational segments.
Multilingual applications: ElevenLabs (32 languages) or Google TTS (50+ languages) for breadth. XTTS-v2 for self-hosted multilingual with voice cloning.
Edge / embedded: Kokoro or Mimic 3 for resource-constrained environments.
The Voice Cloning Landscape
Zero-shot voice cloning has become table stakes. Here's how the options compare:
ElevenLabs: 5-second samples, highest clone fidelity (2.83% WER), industry benchmark
XTTS-v2: 6-second samples, 17 languages, cross-language cloning (clone an English voice, output in French). Non-commercial license only.
Chatterbox: 5-second samples, MIT licensed, quality approaching ElevenLabs
Orpheus: Zero-shot cloning with emotion control, MIT licensed
The licensing matters: XTTS-v2 uses the Coqui Public Model License, which restricts commercial use. If you're building a product, Chatterbox or Orpheus (both MIT) are safer choices.
Cost Comparison
Rough pricing as of late 2025:
ElevenLabs: ~$0.18-0.30 per 1K characters OpenAI TTS: ~$0.015 per 1K characters (standard) Google TTS: ~$0.016 per 1K characters (Neural) Deepgram Aura: ~$0.015 per 1K characters Kokoro (hosted): ~$0.0007 per 1K characters Kokoro (self): Free (compute costs only)
The difference between ElevenLabs at $0.30/1K and Kokoro at $0.0007/1K is roughly 400x. At scale, that's the difference between a viable product and bankruptcy.
Running Models Locally
Self-hosting TTS is straightforward for experimentation but complex at scale. The basic pattern:
# Kokoro example
pip install kokoro
from kokoro import KPipeline
pipeline = KPipeline(lang_code='en-us')
audio = pipeline("Your text here", voice='af_heart')
# Returns 24kHz audio
For production deployment, you'll need to handle GPU orchestration, load balancing, and request queuing. Platforms like Modal, Replicate, and DeepInfra provide hosted inference that handles the infrastructure complexity.
What's Coming
The trajectory is clear: open-source TTS quality is converging with proprietary offerings while costs collapse. Models are getting smaller (Kokoro proves 82M parameters can compete with billion-parameter models) and faster (sub-100ms is becoming standard).
The next frontier is controllability — fine-grained control over emotion, pacing, emphasis, and style without retraining. Models like Zonos already offer controls for happiness, fear, sadness, and anger via simple parameters.
For most applications, the question is no longer whether to use AI-generated speech, but which model fits your latency, quality, and cost constraints.
Quick Reference
Fastest: Cartesia Sonic (~40ms) or Kokoro (~300ms)
Most natural: ElevenLabs, Orpheus 3B, Chatterbox
Best voice cloning: ElevenLabs, XTTS-v2 (non-commercial)
Multi-speaker dialogue: Dia, VibeVoice
Cheapest at scale: Kokoro self-hosted
Most languages: Google TTS (50+), ElevenLabs (32)
All pricing and specifications current as of November 2025. The TTS landscape evolves rapidly — verify current capabilities before production deployment.

