/
Updated Feb 24, 2024

Voice Quality
In head-to-head evaluations, our blinded human tests showed that Sonic-2 was preferred over ElevenLabs's Flash V2 model by a significant margin (61.4% vs 38.6%).
Blinded human evaluation is a method where evaluators assess generative voice model outputs without knowing which model produced them, helping reduce bias. The process involves presenting outputs from different generative voice model anonymously. This approach prevents evaluators' preconceived notions about specific generative voice model or their developers from influencing their assessment.
62%
preferred Cartesia over ElevenLabs (20%)


Latency
We measure latency using the model latency and the Time to First Audio (TTFA) latency from Asia, US and Europe. We calculate the 90th percentile score (P90) from 100 measurements for each provider.

Sonic 2 achieves a model latency of just 40ms

Stable latency between 128-135ms

State Space Models (SSMs) architecture
40ms
Cartesia's Sonic-2 model latency
Features
Cartesia
ElevenLabs
Latency
40ms for the Sonic Turbo model, 90ms for the Sonic 2.0 model
75 ms for the lower quality Flash Model, and 300ms+ for the full model
Voice Quality
Consistently rated as more natural, expressive, and realistic in blinded human evaluations
Less depth and reliability ratings in human evals
Character Limits
Infinite request length
Limited to 40k characters per request
Instant Cloning
Requires 3 seconds of audio
Requires 10 seconds of audio
Professional Voice Cloning
Requires 30 minutes of audio
Requires 60 minutes of audio
Pronunciation Accuracy
IPA support with strong contextual understanding
IPA support, isolated pronunciation
Voice Customizations
Fully customizable voice with speed and emotion controls + synthetic voice mixing and design
Stability, similarity, and style exaggeration controls
Telephony Optimization
8kHz audio, telephony optimized voices
8kHz audio, telephony optimized voices
Flexible Deployments
Supports both on-prem and on-device deployments
No on-device or on-prem support
Languages Supported
15 languages with extensive dialect coverage
32 languages
Concurrency
Up to 15 on highest self-serve tier (60 parallel conversations), custom for enterprise
Up to 15 on highest self serve tier, custom for enterprise
Pronunciation Accuracy
Best-in-class pronunciations: Get complex phone numbers, addresses, and IDs right every time.

CARTESIA

ELEVENLABS
Voice Cloning
Cartesia requires only 3 seconds of audio recording to create high-quality instant voice clones, while ElevenLabs needs 30 seconds.
Source

CARTESIA

ELEVENLABS
Source

CARTESIA

ELEVENLABS
Voice Design
Cartesia stands out as the only provider offering emotion and speed modulation features, enabling refined voice adjustments.

CARTESIA - NO DIALS

ELEVENLABS - NO DIALS

CARTESIA - Positivity Emotion Dial

ELEVENLABS - Highest Stability Dial

CARTESIA - Speed Dial

ELEVENLABS - Highest Similarity Dial
Cartesia
ElevenLabs
Free
$0 per month with 10k free credits
$0 per month with 10k characters
Pro
$5 per month with 100k credits
$5 per month with 30k characters
Startup
$49 per month with 1.25M credits
$11 per month with 100k characters
Scale
$299 per month with 8M credits
$99 per month with 500k characters
Enterprise
Trusted by Fortune 500 companies
$330 per month with 2M characters
What Cartesia Customers Say
How does voice cloning work?
What is the latency for voice generation?
Can I customize the cloned voice?
How many languages does Cartesia support?






