How does voice cloning work?

Voice cloning uses advanced AI algorithms to analyze and replicate a person's voice. By inputting a short audio sample, the system learns the unique characteristics of the voice, such as tone, pitch, and accent. This allows for the generation of new speech that sounds like the original speaker. Cartesia's technology can create a high-quality clone from just3 seconds of audio, making it efficient and user-friendly.

Which provide is the fastest text to speech voice model?

Cartesia's voice AI boasts a low latency of sub 40 ms. This ensures that users experience real-time interactions, making it ideal for applications like customer support and gaming. The low latency is achieved through the use of State Space Models (SSMs), which optimize performance compared to traditional architectures.

Can I customize the voice output?

Yes, Cartesia allows for extensive customization of voice outputs. Users can adjust pitch, speed, and emotion to create a more engaging audio experience. This level of control is particularly beneficial for applications that require specific tonal qualities or emotional expressions, enhancing the overall effectiveness of the voice AI.

Updated Feb 19, 2025

Comparing ElevenLabs and Deepgram Voice AI Models

Q: What's a better alternative to ElevenLabs and Deepgram?

Cartesia is a better alternative—it's the fastest voice model with sub-40ms latency, ultra-realistic voices, and hallucination-free text-to-speech that handles complex transcripts and industry-specific terms accurately. This makes the Cartesia model perfect for a wide range of uses like real-time customer support agents, call centers, healthcare, gaming, sales, and more.

Explore the differences between ElevenLabs and Deepgram. Learn more about pricing, model performances and product features.

Try Cartesia Talk to Sales

Comparing ElevenLabs and Deepgram Voice AI Models

Both platforms offer advanced voice AI capabilities, but ElevenLabs excels with fast voice generation and Deepgram can create more natrual voices.

Latency

75 ms for the lower quality Flash Model, and 300ms+ for the full model

Less than 250 ms

Voice Quality

Natural and realistic, widely used by all types of content creators

Human-like tone, rhythm, and emotion

Character Limits

Limited to 40k characters per request

Limited to 2k characters per request

Instant Cloning

Requires 10 seconds of audio

No voice cloning support

Professional Voice Cloning

Requires 60 minutes of audio

No voice cloning feature

Pronunciation Accuracy

IPA support but isolated pronunciation

No IPA support

Voice Customizations

Stability, similarity, and style exaggeration controls

Customization options may be limited

Telephony Optimization

8kHz audio, telephony optimized voices

Audio quality may not meet all telephony needs

Flexible deployments

No on-device or on-prem support

Supports on-premise and limited on-device capabilities

Languages Supported

English only

Concurrency

Up to 15 on highest self serve tier, custom for enterprise

Up to 2 concurrent requests

Latency

ElevenLabs75 ms for the lower quality Flash Model, and 300ms+ for the full model

DeepgramLess than 250 ms

Voice Quality

ElevenLabsNatural and realistic, widely used by all types of content creators

DeepgramHuman-like tone, rhythm, and emotion

Character Limits

ElevenLabsLimited to 40k characters per request

DeepgramLimited to 2k characters per request

Instant Cloning

ElevenLabsRequires 10 seconds of audio

DeepgramNo voice cloning support

Professional Voice Cloning

ElevenLabsRequires 60 minutes of audio

DeepgramNo voice cloning feature

Pronunciation Accuracy

ElevenLabsIPA support but isolated pronunciation

DeepgramNo IPA support

Voice Customizations

ElevenLabsStability, similarity, and style exaggeration controls

DeepgramCustomization options may be limited

Telephony Optimization

ElevenLabs8kHz audio, telephony optimized voices

DeepgramAudio quality may not meet all telephony needs

Flexible deployments

ElevenLabsNo on-device or on-prem support

DeepgramSupports on-premise and limited on-device capabilities

Languages Supported

ElevenLabs32

DeepgramEnglish only

Concurrency

ElevenLabsUp to 15 on highest self serve tier, custom for enterprise

DeepgramUp to 2 concurrent requests

Look for a ElevenLabs and Deepgram Alternatives?

Voice Clone with 3s of Audio

Cartesia's voice cloning captures emotional depth with just 3s audio, with professional-grade cloning available for 60-min samples.

Ultra-Realistic Voices

Experience lifelike voices that sound almost identical to human speech—perfect for creating engaging content and interactive voice agents.

No Hallucinations Text to Speech

Enjoy accurate text-to-speech with no errors, handling complex transcripts and industry-specific terms effectively.

Enterprise Ready

Enterprise-grade reliability with 99.9% uptime, SOC2 compliance, and full on-premises support.

How they stack up

Voice Quality Comparison

In terms of speech naturalness, ElevenLabs scored Medium in 44.98% of cases, while Deepgram achieved a High in 57.78% of cases, making Deepgram's voices more natural than ElevenLabs.

ElevenLabs also demonstrated excellent pronunciation accuracy at 81.97%, whereas Deepgram's pronunciation accuracy was slightly lower at 64.43%.

Overall, ElevenLabs excels in accuracy, while Deepgram shows promise in producing more natural-sounding speech.

Latency Evaluation Insights

In our latency evaluation, we measured the Time to First Audio (TTFA) for both ElevenLabs and Deepgram.

By calculating the 90th percentile score from 100 TTFA measurements for each provider, we found that ElevenLabs had a TTFA of 135ms, indicating a quick response time. Deepgram, while slightly slower, still performed well with a TTFA of 150ms.

This evaluation highlights ElevenLabs' advantage in low-latency performance, making it a strong choice for applications requiring immediate audio feedback.

Hallucination Rate Analysis

The hallucination rate was assessed for both ElevenLabs and Deepgram to determine how often the models generated inaccurate or nonsensical outputs.

ElevenLabs demonstrated a lower hallucination rate, with a WER of 2.83%, indicating a strong performance in generating coherent speech. Deepgram, however, had a higher WER of 5.67%, suggesting a greater tendency for inaccuracies.

This evaluation underscores ElevenLabs' strength in producing reliable outputs, while Deepgram may need further refinement to reduce hallucination occurrences.

Voice Design Control Test

In evaluating voice design controllability, ElevenLabs and Deepgram were assessed based on their ability to adapt voice characteristics.

ElevenLabs scored high in context awareness, with 63.37% of cases showing excellent adaptation to tone and emphasis. Deepgram, while performing adequately, had a lower context awareness score of 53.18%.

Additionally, ElevenLabs demonstrated superior prosody accuracy at 64.57%, compared to Deepgram's 55.52%. This evaluation highlights ElevenLabs' advantage in providing users with more control over voice design.

Explore Pricing Comparisons for ElevenLabs and Deepgram

Free - $0 per month with 10k characters

Free - $200 of credit

Starter - $5 per month with 30k characters

Growth - $4k+/year with discounted credits

Creator - $11 per month with 100k characters

Enterprise - $15k+ / year

Pro - $99 per month with 500k characters

Custom solutions for large-scale needs

Scale - $330 per month with 2M characters

Trusted by leading enterprises. Speaking from experience.

Discover success stories

“Cartesia’s state-space models bring enterprise-grade speed and quality to our AI Voice Agents… making it possible for businesses to deploy secure, scalable voice agents that can understand, act, and adapt in real time.”

Ravi Krishnamurthy

VP Product

“Cartesia Sonic 3.5 has become one of the top-performing models for us by combining low latency with natural pacing… helping us deliver strong voice quality across a growing set of languages where other models often fall short.”

Lydia Zarcone

Voice Product Manager

“With Cartesia's Sonic model, users can interact with a wide range of high-quality, human-like voices in multiple languages, enhancing their experience on our platform.”

Spencer Chan

Head of Poe Product

“Cartesia is leading the charge of building efficient, multimodal models from first principles, starting with their Sonic TTS model.”

Vipul Ved Prakash

CEO

“We didn’t switch to Sonic 3.5 because it was incrementally better, we switched because nothing else came close… we’ve seen a 2.9% lift in our conversion and a 12.2% increase in customer engagement.”

Akshay Ramaswamy

Staff Product Manager

“Sonic 3.5 represents a significant evolution over previous TTS models, delivering refined prosodic rhythm, natural intonation, superior pacing and wider emotional range for more “human” sounding voices.”

Tyler Brown

Voice Expert

Frequently asked questions

Architecting AI that learns and interacts like humans.

Products

Company

Resources

Solutions

Capabilities

Products

Solutions

Capabilities

Resources

Company