/
Updated Feb 24, 2024
Voice Quality
In head-to-head evaluations, our blinded human tests showed that Sonic-2 was preferred over ElevenLabs's Flash V2 model by a significant margin (61.4% vs 38.6%).
Blinded human evaluation is a method where evaluators assess generative voice model outputs without knowing which model produced them, helping reduce bias. The process involves presenting outputs from different generative voice model anonymously. This approach prevents evaluators' preconceived notions about specific generative voice model or their developers from influencing their assessment.
Latency
We measure latency using the model latency and the Time to First Audio (TTFA) latency from Asia, US and Europe. We calculate the 90th percentile score (P90) from 100 measurements for each provider.
Features
Cartesia
ElevenLabs
Latency
40ms for the Sonic Turbo model, 90ms for the Sonic 2.0 model
75 ms for the lower quality Flash Model, and 300ms+ for the full model
Voice Quality
Consistently rated as more natural, expressive, and realistic in blinded human evaluations
Less depth and reliability ratings in human evals
Character Limits
Infinite request length
Limited to 40k characters per request
Instant Cloning
Requires 3 seconds of audio
Requires 10 seconds of audio
Professional Voice Cloning
Requires 30 minutes of audio
Requires 60 minutes of audio
Pronunciation Accuracy
IPA support with strong contextual understanding
IPA support, isolated pronunciation
Voice Customizations
Fully customizable voice with speed and emotion controls + synthetic voice mixing and design
Stability, similarity, and style exaggeration controls
Telephony Optimization
8kHz audio, telephony optimized voices
8kHz audio, telephony optimized voices
Flexible Deployments
Supports both on-prem and on-device deployments
No on-device or on-prem support
Languages Supported
15 languages with extensive dialect coverage
32 languages
Concurrency
Up to 15 on highest self-serve tier (60 parallel conversations), custom for enterprise
Up to 15 on highest self serve tier, custom for enterprise
Pronunciation Accuracy
Best-in-class pronunciations: Get complex phone numbers, addresses, and IDs right every time.
Voice Cloning
Cartesia requires only 3 seconds of audio recording to create high-quality instant voice clones, while ElevenLabs needs 30 seconds.
Voice Design
Cartesia stands out as the only provider offering emotion and speed modulation features, enabling refined voice adjustments.
Cartesia
ElevenLabs
Free
$0 per month with 10k free credits
$0 per month with 10k characters
Pro
$5 per month with 100k credits
$5 per month with 30k characters
Startup
$49 per month with 1.25M credits
$11 per month with 100k characters
Scale
$299 per month with 8M credits
$99 per month with 500k characters
Enterprise
Trusted by Fortune 500 companies
$330 per month with 2M characters