How does voice cloning work?

Voice cloning uses advanced AI algorithms to analyze and replicate a person's voice. By inputting a short audio sample, the system learns the unique characteristics of the voice, including tone, pitch, and accent. This allows for the generation of new speech that sounds like the original speaker. Cartesia's technology can create high-quality clones quickly, making it suitable for various applications, from entertainment to customer service.

What is the latency of Cartesia's TTS?

Cartesia's TTS system boasts a latency of just 199 ms, measured using the Time to First Audio (TTFA) metric. This low latency is crucial for real-time applications, ensuring that users receive audio responses almost instantaneously. The efficiency of Cartesia's architecture, based on State Space Models, allows for this rapid performance, making it ideal for interactive voice applications where timing is essential.

Can I customize the voice output?

Yes, Cartesia allows for extensive customization of voice output. Users can adjust various parameters such as pitch, speed, and emotion to create a more personalized audio experience. This level of control ensures that the generated speech meets specific needs, whether for storytelling, customer service, or other applications. The ability to localize voices to different accents further enhances the customization options available.

What languages does Cartesia support?

Cartesia supports seamless speech in 13 languages, including English, Spanish, French, German, Japanese, and more. This multilingual capability allows users to reach a global audience, making it easier to create content that resonates with diverse populations. The platform is continually updated, with more languages being added to enhance accessibility and usability for users worldwide.

Updated Feb 14, 2025

Compare Cartesia and Google TTS

Learn about the differences between Cartesia and Google TTS in voice AI models.

Try Cartesia Talk to Sales

Compare Cartesia and Google TTS

Explore features like latency and voice quality for both models.

Latency

40ms for the Sonic Turbo model, 90ms for the Sonic 2 model

200 ms to 1000 ms

Voice Quality

Consistently rated as more natural, expressive, and realistic in blinded human evaluations

Voice quality is often rated lower, lacking depth and reliability

Character Limits

Infinite request length

5,000 bytes per request

Instant Cloning

Requires 3 seconds of audio

Not supported

Professional Voice Cloning

Requires 30 minutes of audio

20 to 30 minutes

Pronunciation Accuracy

IPA support with strong contextual understanding

Offers IPA support but less contextual awareness

Voice Customizations

Fully customizable voice with speed and emotion controls + synthetic voice mixing and design

Limited customization options for voice adjustments

Telephony Optimization

8kHz audio, telephony optimized voices

Standard telephony optimization with 8kHz audio

Flexible deployments

Supports both on-prem and on-device deployments

Available on Android devices

Languages Supported

15 languages with extensive dialect coverage

50+

Concurrency

Up to 15 on highest self-serve tier (60 parallel conversations), custom for enterprise

300 concurrent sessions per 5 minutes and a limit of 3,000 requests per minute

Latency

Cartesia40ms for the Sonic Turbo model, 90ms for the Sonic 2 model

Google TTS200 ms to 1000 ms

Voice Quality

CartesiaConsistently rated as more natural, expressive, and realistic in blinded human evaluations

Google TTSVoice quality is often rated lower, lacking depth and reliability

Character Limits

CartesiaInfinite request length

Google TTS5,000 bytes per request

Instant Cloning

CartesiaRequires 3 seconds of audio

Google TTSNot supported

Professional Voice Cloning

CartesiaRequires 30 minutes of audio

Google TTS20 to 30 minutes

Pronunciation Accuracy

CartesiaIPA support with strong contextual understanding

Google TTSOffers IPA support but less contextual awareness

Voice Customizations

CartesiaFully customizable voice with speed and emotion controls + synthetic voice mixing and design

Google TTSLimited customization options for voice adjustments

Telephony Optimization

Cartesia8kHz audio, telephony optimized voices

Google TTSStandard telephony optimization with 8kHz audio

Flexible deployments

CartesiaSupports both on-prem and on-device deployments

Google TTSAvailable on Android devices

Languages Supported

Cartesia15 languages with extensive dialect coverage

Google TTS50+

Concurrency

CartesiaUp to 15 on highest self-serve tier (60 parallel conversations), custom for enterprise

Google TTS300 concurrent sessions per 5 minutes and a limit of 3,000 requests per minute

Cartesia - Advanced AI Voice Capabilities

High-Quality Voice Cloning

Cartesia's voice cloning delivers lifelike, accurate replicas.

Ultra-Realistic Voices

Enjoy expressive voices that sound nearly indistinguishable from humans.

No Hallucinations

Experience clear audio without errors, ensuring authentic voice output.

Enterprise Ready

Enterprise-grade reliability with 99.9% uptime, SOC2 compliance, and full on-premises support.

How they stack up

Voice Quality Comparison

In terms of voice quality, Cartesia consistently outperforms Google TTS. Cartesia's Sonic model has been rated 4.7 in independent evaluations, while Google TTS scores lower. Cartesia's voices are noted for their naturalness and emotional sensitivity, making them ideal for real-time applications. The advanced state space model architecture used by Cartesia allows for better clarity and depth in voice generation, ensuring a more engaging user experience compared to Google TTS.

Latency Analysis

Latency is a critical factor in voice AI performance. Cartesia measures latency using the Time to First Audio (TTFA) metric, achieving a remarkable 199 ms. In contrast, Google TTS has a higher latency, which can hinder real-time interactions. Cartesia's Sonic model is built on State Space Models (SSMs), allowing for greater latency optimization compared to traditional transformer architectures. This efficiency ensures that users experience faster response times, making Cartesia a preferred choice for applications requiring low latency.

Hallucination Rate Review

Cartesia excels in minimizing hallucination rates in voice generation. With its advanced AI voice cloning technology, it ensures crystal-clear audio without errors, maintaining authenticity in voice replication. This is a significant advantage over Google TTS, which may produce less reliable outputs. Cartesia's commitment to high-quality voice cloning means users can trust the accuracy and clarity of the generated voices, making it a reliable choice for various applications.

Voice Cloning Showdown

When it comes to voice cloning, Cartesia leads the way with its ability to create an instant clone from just 3 seconds of audio. Unlike Google TTS, which has restrictions on cloning capabilities, Cartesia offers unlimited instant voice cloning. This advanced embedding technology ensures high-quality voice replication, preserving accents and voice quality even in noisy conditions. Additionally, Cartesia's voice mixing and design capabilities provide a diverse range of voices, making it a superior choice for voice cloning applications.

Voice Design Control

Cartesia stands out by offering unique features like emotion and speed modulation, allowing users to refine voice adjustments while maintaining a natural sound. This level of control enables users to customize voices to match specific needs, such as localizing an American voice to speak with a French accent. In contrast, Google TTS provides limited control options, which may not meet the diverse requirements of users looking for tailored voice experiences.

Pricing Comparison: Cartesia vs Google TTS

Free - $0 per month with 20K free credits

Standard voices - $4 per 1 million characters

Pro - $5 per month with 100K credits

WaveNet, Neural2, Polyglot (Preview) voices - $16 per 1 million characters

Startup - $49 per month with 1.25M credits

Chirp HD (Preview) voices - $30 per 1 million characters

Scale - $299 per month with 8M credits

Studio voices - $160 per 1 million characters

Enterprise - trusted by Fortune 500 companies

Trusted by leading enterprises. Speaking from experience.

Discover success stories

“Cartesia’s state-space models bring enterprise-grade speed and quality to our AI Voice Agents… making it possible for businesses to deploy secure, scalable voice agents that can understand, act, and adapt in real time.”

Ravi Krishnamurthy

VP Product

“Cartesia Sonic 3.5 has become one of the top-performing models for us by combining low latency with natural pacing… helping us deliver strong voice quality across a growing set of languages where other models often fall short.”

Lydia Zarcone

Voice Product Manager

“With Cartesia's Sonic model, users can interact with a wide range of high-quality, human-like voices in multiple languages, enhancing their experience on our platform.”

Spencer Chan

Head of Poe Product

“Cartesia is leading the charge of building efficient, multimodal models from first principles, starting with their Sonic TTS model.”

Vipul Ved Prakash

CEO

“We didn’t switch to Sonic 3.5 because it was incrementally better, we switched because nothing else came close… we’ve seen a 2.9% lift in our conversion and a 12.2% increase in customer engagement.”

Akshay Ramaswamy

Staff Product Manager

“Sonic 3.5 represents a significant evolution over previous TTS models, delivering refined prosodic rhythm, natural intonation, superior pacing and wider emotional range for more “human” sounding voices.”

Tyler Brown

Voice Expert

Frequently asked questions

Architecting AI that learns and interacts like humans.

Products

Company

Resources

Solutions

Capabilities

Products

Solutions

Capabilities

Resources

Company