How does voice cloning work?

Voice cloning uses advanced AI algorithms to analyze and replicate a person's voice. By inputting a short audio sample, the system learns the unique characteristics of the voice, including tone, pitch, and accent. This allows for the generation of new speech that sounds like the original speaker. Cartesia's technology can create high-quality clones quickly, making it suitable for various applications, from entertainment to customer service.

Which provide is the fastest text to speech voice model?

Cartesia's voice AI boasts a low latency of sub 40 ms. This ensures that users experience real-time interactions, making it ideal for applications like customer support and gaming. The low latency is achieved through the use of State Space Models (SSMs), which optimize performance compared to traditional architectures.

Can I customize the voice output?

Yes, Cartesia allows for extensive customization of voice output. Users can adjust various parameters such as pitch, speed, and emotion to create a more personalized audio experience. This level of control ensures that the generated speech meets specific needs, whether for storytelling, customer service, or other applications. The ability to localize voices to different accents further enhances the customization options available.

Updated Feb 14, 2025

Compare ElevenLabs and Google TTS

Q: What's a better alternative to ElevenLabs and Google Speech to Text?

Cartesia is a better alternative—it's the fastest voice model with sub-40ms latency, ultra-realistic voices, and hallucination-free text-to-speech that handles complex transcripts and industry-specific terms accurately. This makes the Cartesia model perfect for a wide range of uses like real-time customer support agents, call centers, healthcare, gaming, sales, and more.

Learn about the differences between ElevenLabs and Google TTS in voice AI models.

Try Cartesia Talk to Sales

Compare ElevenLabs and Google TTS

Eleven Labs offers highly natural and expressive voices with emotional range, while Google TTS provides reliable, clear speech with extensive language support but less emotional variation. Both excel in different use cases.

Latency

75 ms for the lower quality Flash Model, and 300ms+ for the full model

200 ms to 1000 ms

Voice Quality

Natural and realistic, widely used by all types of content creators

More robotic voices

Character Limits

Limited to 40k characters per request

5,000 bytes per request

Instant Cloning

Requires 10 seconds of audio

Not supported

Professional Voice Cloning

Requires 60 minutes of audio

20 to 30 minutes

Pronunciation Accuracy

IPA support but isolated pronunciation

Offers IPA support but less contextual awareness

Voice Customizations

Stability, similarity, and style exaggeration controls

Limited customization options for voice adjustments

Telephony Optimization

8kHz audio, telephony optimized voices

Standard telephony optimization with 8kHz audio

Flexible deployments

No on-device or on-prem support

Available on Android devices

Languages Supported

50+

Concurrency

Up to 15 on highest self serve tier, custom for enterprise

300 concurrent sessions per 5 minutes and a limit of 3,000 requests per minute

Latency

ElevenLabs75 ms for the lower quality Flash Model, and 300ms+ for the full model

Google Speech to Text200 ms to 1000 ms

Voice Quality

ElevenLabsNatural and realistic, widely used by all types of content creators

Google Speech to TextMore robotic voices

Character Limits

ElevenLabsLimited to 40k characters per request

Google Speech to Text5,000 bytes per request

Instant Cloning

ElevenLabsRequires 10 seconds of audio

Google Speech to TextNot supported

Professional Voice Cloning

ElevenLabsRequires 60 minutes of audio

Google Speech to Text20 to 30 minutes

Pronunciation Accuracy

ElevenLabsIPA support but isolated pronunciation

Google Speech to TextOffers IPA support but less contextual awareness

Voice Customizations

ElevenLabsStability, similarity, and style exaggeration controls

Google Speech to TextLimited customization options for voice adjustments

Telephony Optimization

ElevenLabs8kHz audio, telephony optimized voices

Google Speech to TextStandard telephony optimization with 8kHz audio

Flexible deployments

ElevenLabsNo on-device or on-prem support

Google Speech to TextAvailable on Android devices

Languages Supported

ElevenLabs32

Google Speech to Text50+

Concurrency

ElevenLabsUp to 15 on highest self serve tier, custom for enterprise

Google Speech to Text300 concurrent sessions per 5 minutes and a limit of 3,000 requests per minute

Look for a ElevenLabs and Google Speech to Text Alternatives?

Voice Clone with 3s of Audio

Cartesia's voice cloning delivers lifelike, accurate voice replicas.

Ultra-Realistic Voices

Enjoy expressive voices that sound nearly indistinguishable from humans.

No Hallucinations Text to Speech

Enjoy accurate text-to-speech with no errors, handling complex transcripts and industry-specific terms effectively.

Enterprise Ready

Enterprise-grade reliability with 99.9% uptime, SOC2 compliance, and full on-premises support.

How they stack up

Voice Quality Comparison

When comparing voice quality between ElevenLabs and Google Speech to Text, we found significant differences.

ElevenLabs demonstrated high pronunciation accuracy, with 81.97% of words pronounced correctly, while Google TTS achieved 77.30%. In terms of speech naturalness, ElevenLabs scored high in 44.98% of cases, whereas Google TTS had a low naturalness rating in 78.01% of instances.

Additionally, ElevenLabs had no detectable noise in 80.27% of its outputs, while Google TTS maintained a similar standard with 89.46%. Overall, ElevenLabs provides a more natural and accurate voice quality experience.

Latency Assessment

In our latency evaluation, we measured the Time to First Audio (TTFA) for both ElevenLabs and Google Speech to Text.

We calculated the 90th percentile score from 100 TTFA measurements for each provider. ElevenLabs exhibited a TTFA of 150ms, indicating a quick response time, while Google TTS recorded a TTFA of 200ms.

This shows that ElevenLabs is more efficient in delivering audio output, making it a better choice for applications requiring low latency.

Hallucination Rate Analysis

The hallucination rate was assessed for ElevenLabs and Google Speech to Text to determine how often the models generated incorrect or nonsensical outputs.

ElevenLabs had a hallucination rate of 5%, indicating a strong performance in maintaining accuracy. In contrast, Google TTS exhibited a higher hallucination rate of 10%.

This evaluation suggests that ElevenLabs is more reliable in producing coherent and contextually appropriate speech, making it the preferred option for applications where accuracy is critical.

Voice Cloning

In our evaluation of voice cloning capabilities, ElevenLabs and Google Speech to Text were put to the test. ElevenLabs achieved an impressive Word Error Rate (WER) of 2.83%, showcasing its accuracy in generating coherent speech.

In contrast, Google TTS recorded a WER of 3.36%, indicating slightly less precision. When it comes to speech naturalness, ElevenLabs scored high in 44.98% of cases, while Google TTS struggled, with low naturalness in 78.01% of instances.

This evaluation highlights ElevenLabs as the more effective choice for voice cloning applications, particularly in terms of accuracy and naturalness.

Voice Design Control

In evaluating voice design controllability, ElevenLabs and Google Speech to Text were assessed on their ability to adapt voice characteristics based on user input.

ElevenLabs scored high in context awareness, achieving a 63.37% rating, while Google TTS lagged behind with a 39.25% score. Additionally, ElevenLabs demonstrated superior prosody accuracy at 64.57%, compared to Google TTS's 45.83%.

This indicates that ElevenLabs offers more flexibility and control in voice design, making it a better choice for customized voice applications.

Pricing Comparison: ElevenLabs vs Google TTS

Free - $0 per month with 10k characters

Standard voices - $4 per 1 million characters

Starter - $5 per month with 30k characters

WaveNet, Neural2, Polyglot (Preview) voices - $16 per 1 million characters

Creator - $11 per month with 100k characters

Chirp HD (Preview) voices - $30 per 1 million characters

Pro - $99 per month with 500k characters

Studio voices - $160 per 1 million characters

Scale - $330 per month with 2M characters

Trusted by leading enterprises. Speaking from experience.

Discover success stories

“Cartesia’s state-space models bring enterprise-grade speed and quality to our AI Voice Agents… making it possible for businesses to deploy secure, scalable voice agents that can understand, act, and adapt in real time.”

Ravi Krishnamurthy

VP Product

“Cartesia Sonic 3.5 has become one of the top-performing models for us by combining low latency with natural pacing… helping us deliver strong voice quality across a growing set of languages where other models often fall short.”

Lydia Zarcone

Voice Product Manager

“With Cartesia's Sonic model, users can interact with a wide range of high-quality, human-like voices in multiple languages, enhancing their experience on our platform.”

Spencer Chan

Head of Poe Product

“Cartesia is leading the charge of building efficient, multimodal models from first principles, starting with their Sonic TTS model.”

Vipul Ved Prakash

CEO

“We didn’t switch to Sonic 3.5 because it was incrementally better, we switched because nothing else came close… we’ve seen a 2.9% lift in our conversion and a 12.2% increase in customer engagement.”

Akshay Ramaswamy

Staff Product Manager

“Sonic 3.5 represents a significant evolution over previous TTS models, delivering refined prosodic rhythm, natural intonation, superior pacing and wider emotional range for more “human” sounding voices.”

Tyler Brown

Voice Expert

Frequently asked questions

Architecting AI that learns and interacts like humans.

Products

Company

Resources

Solutions

Capabilities

Products

Solutions

Capabilities

Resources

Company