Cartesia vs. ElevenLabs
September 19, 2024
We conducted a comprehensive analysis to facilitate an informed comparison between Cartesia and ElevenLabs in the text-to-speech (TTS) sector. Our objective is to provide a thorough evaluation of these providers across several key performance indicators, including:
- Text-to-speech output quality and latency
- Comprehensive feature set analysis
- Pricing structure and value proposition
The intent of this report is to equip decision-makers with the necessary insights to determine which solution aligns best with their organizational requirements.
This comparison is particularly valuable for real-time conversational use cases. The adoption of voice agents for conversational AI has increased significantly, driven by advancements in pronunciation accuracy, reduced latency, and more realistic voices.
By systematically examining these critical aspects, we aim to support your strategic decision-making process in selecting the most suitable TTS technology provider for your business needs.
Quick Overview
Feature | Cartesia | ElevenLabs |
---|---|---|
Latency | 95 ms + network time | 300 ms + network time |
Voice Quality | Consistently rated in human evals as more natural and realistic | Less depth and reliability ratings in human evals |
Character Limits | Infinite request length | 40k characters for Turbo v2.5, request stitching |
Instant Cloning | Requires 10 seconds of audio | Requires 30 seconds of audio |
Professional Voice Cloning | Requires 10 minutes of audio | Requires 30 minutes (3 hours optimal) of audio |
Pronunciation Accuracy | IPA Support, strong contextual understanding | IPA Support, isolated pronunciation |
Voice Customizations | Speed and emotion controls + synthetic voice mixing and design | Stability, similarity, and style exaggeration controls |
Telephony Optimization | 8kHz audio, telephony optimized voices | 8kHz audio |
On-Device | Real-time generation on-device | None |
Languages Supported | 13 | 32 |
Concurrency | Up to 15 on highest self serve tier, custom for enterprise | Up to 15 on highest self serve tier, custom for enterprise |
Voice Quality
We conducted both a human preference ranking and an automated evaluation using NISQA, a widely recognized model for assessing speech quality and naturalness.
Expert human evaluators were enlisted from Mechanical Turk to review 50 diverse transcripts spanning general customer support, healthcare, payment information, maintenance, and other common conversational scenarios. Each transcript was evaluated three times by different annotators that passed the baseline test to compare both TTS providers. The ranking was based on quality preference defined by the voice clarity, naturalness, and emotional sensitivity. The results showed that out of 50 transcripts, ElevenLabs was preferred 14 times while Cartesia was preferred 36 times. Similarly, NISQA ratings awarded Cartesia a score of 4.7 for overall quality, compared to a score of 4.38 for ElevenLabs.
Independent platforms have produced similar evaluation results. According to LabelBox, a leading data labeling provider, Cartesia ranked first 27.93% of the time, compared to ElevenLabs at only 10.68%, across six different providers (Figure 1). In LabelBox's Speech Generation rankings on September 9, 2024, Cartesia also has the highest ELO rating, with ElevenLabs coming in at fourth place (Figure 2). Note also that ElevenLabs is a customer of LabelBox while Cartesia is not. Cartesia also holds the highest win rate of 75% on Artificial Analysis' Text-to-Speech Arena, a leading provider of independent model evaluations (Figure 3). Their Text-to-Speech Arena conducts a blind human preference test across different speech providers.
Latency
We measure latency using the Time to First Audio (TTFA) metric and calculate the 90th percentile score from 100 TTFA measurements for each provider. With a TTFA of 199 ms, Cartesia's Sonic model is significantly faster than ElevenLabs' 832 ms at the self-serve tier (with ElevenLabs reserving slightly lower latency for enterprise customers). Cartesia’s Sonic model is built on a radically more efficient architecture, known as State Space Models (SSMs), which allows for far greater latency optimization than the traditional transformer architecture used by ElevenLabs and other providers. For a side by side comparison, watch the latency demonstration in this video on X.
Pronunciation Accuracy
Cartesia and ElevenLabs exhibit slight differences in sentence pronunciation. Cartesia excels at accurately pronouncing challenging content, such as acronyms, phone numbers, and uncommon words, and supports the International Phonetic Alphabet (IPA) for specialized use cases, like prescription drug names in healthcare. While ElevenLabs also offers reasonably accurate pronunciation, it shows less contextual awareness. For example, when pronouncing an abbreviated date like "Dec. 25, 2022," Cartesia delivers a more human-like pronunciation of "December," whereas ElevenLabs tends to interpret it more literally.
ElevenLabs:
Cartesia:
Voice Cloning
Cartesia provides unlimited instant voice cloning. In contrast, ElevenLabs restricts cloning capabilities with tiered plans allowing for 10, 30, 160, or 660 custom voices. Additionally, Cartesia utilizes advanced embedding technology to deliver consistent, high-quality voice clones. Cartesia preserves accents and voice quality even if the original audio clip is noisy. Cartesia's voice mixing and design capabilities also open up a more comprehensive set of diverse voices. In the following samples of Engelbart's clones, it's apparent that Cartesia produces a significantly less noisy and higher quality clone than ElevenLabs.
Source Audio:
ElevenLabs:
Cartesia:
Controllability
Cartesia stands out as the only provider offering emotion and speed modulation features, enabling refined voice adjustments while preserving a natural and seamless auditory experience. Cartesia also allows you to localize the voice to match different accents — you can start with an American voice and have it speak in a French accent, for instance. In comparison, ElevenLabs only offers controls for stability, similarity, and style exaggeration, all of which do not offer clear control for the voice. In the following example where the speaker is addressing a customer complaint, we compare the dial effects from ElevenLabs and Cartesia to find that the ElevenLabs voice sounds similar with the various dials applied while Cartesia's emotion and speed dials convey very noticeable changes.
ElevenLabs:
- No Dials:
- Highest Stability Dial:
- Highest Similarity Dial:
Cartesia:
- No Dials:
- Positivity Emotion Dial:
- Speed Dial:
Maximum Request Length and Prosody
Cartesia provides unlimited context length, in contrast to ElevenLabs, which restricts users to generating up to 40,000 characters per request. Extended context length allows for more consistent prosody, enhancing conversational and narrative experiences.
Conclusion
Cartesia is a premium text-to-speech service that is both developer-first and perfect for real-time conversation use cases. Companies partner with Cartesia for their speech needs over other providers for three main reasons:
- Industry-Leading Latency: Cartesia is the only provider with end-to-end latency consistently under 200 ms across all languages. This enables response times that match the natural pacing of human conversation.
- Realistic Voices: Cartesia’s voices are nearly indistinguishable from human speech. The library features voices with a conversational tone with natural pausing and intonation learned from the context of the audio.
- Accurate Pronunciation: Cartesia nails the pronunciation of challenging content such as acronyms, phone numbers, and rare words. It also supports IPA (International Phonetic Alphabet) for specialized use cases such as prescription drug names in the healthcare sector.
If you’re building real-time voice agents, Sonic is purpose-built for you and offers significantly superior performance.
For any questions on our benchmarks or how to evaluate your TTS options, please reach out to support@cartesia.ai.