How does voice cloning work?

Voice cloning uses advanced AI algorithms to analyze and replicate a person's voice. By inputting a short audio sample, the system learns the unique characteristics of the voice, including tone, pitch, and accent. This allows for the generation of new speech that sounds like the original speaker. Cartesia's technology ensures high fidelity and accuracy, making the cloned voice indistinguishable from the original.

What is the latency for voice generation?

Cartesia's voice generation technology boasts an impressive latency of just 40ms for time-to-first-audio. This means that users can expect near-instantaneous responses when generating speech. This low latency is crucial for applications requiring real-time interaction, such as customer support and gaming, where delays can disrupt the user experience.

Can I customize the cloned voice?

Yes, Cartesia allows users to customize the cloned voice by adjusting parameters such as pitch, speed, and emotion. This flexibility enables users to create a voice that fits their specific needs, whether for storytelling, customer service, or other applications. The customization options enhance the overall user experience by making the generated speech more relatable and engaging.

How many languages does Cartesia support?

Cartesia supports seamless speech in 13 languages, including English, Spanish, French, German, Japanese, and more. This multilingual capability allows users to reach a global audience, making it easier to create content that resonates with diverse populations. The platform continuously adds more languages, ensuring that users can communicate effectively across different regions.

Updated Feb 14, 2025

Compare Cartesia and OpenAI TTS Voice Models

Discover key differences between Cartesia and ElevenLabs voice AI models.

Try Cartesia Talk to Sales

Compare Cartesia and OpenAI TTS Voice Models

Explore the differences between Cartesia and OpenAI TTS voice models. Discover features, pricing, and performance metrics.

Latency

40ms for the Sonic Turbo model, 90ms for the Sonic 2 model

200 ms + network time, slower response times

Voice Quality

Consistently rated as more natural, expressive, and realistic in blinded human evaluations

Lower quality ratings in human evaluations, and limited emotional control

Character Limits

Infinite request length

Limited character count for longer texts

Instant Cloning

Requires 3 seconds of audio

Requires 13 seconds of audio

Professional Voice Cloning

Requires 30 minutes of audio

Requires 60 minutes of audio

Pronunciation Accuracy

IPA support with strong contextual understanding

Less contextual awareness in pronunciation

Voice Customizations

Slider control for speed and emotion + synthetic voice mixing and design

Basic controls for speed, emotion and similarity

Telephony Optimization

8kHz audio, telephony optimized voices

Standard audio quality without optimization

Flexible deployments

Supports both on-prem and on-device deployments

No on-device generation available

Languages Supported

15 languages with extensive dialect coverage

Concurrency

Up to 15 on highest self-serve tier (60 parallel conversations), custom for enterprise

3-200 per mins

Latency

Cartesia40ms for the Sonic Turbo model, 90ms for the Sonic 2 model

OpenAI Speech to Text200 ms + network time, slower response times

Voice Quality

CartesiaConsistently rated as more natural, expressive, and realistic in blinded human evaluations

OpenAI Speech to TextLower quality ratings in human evaluations, and limited emotional control

Character Limits

CartesiaInfinite request length

OpenAI Speech to TextLimited character count for longer texts

Instant Cloning

CartesiaRequires 3 seconds of audio

OpenAI Speech to TextRequires 13 seconds of audio

Professional Voice Cloning

CartesiaRequires 30 minutes of audio

OpenAI Speech to TextRequires 60 minutes of audio

Pronunciation Accuracy

CartesiaIPA support with strong contextual understanding

OpenAI Speech to TextLess contextual awareness in pronunciation

Voice Customizations

CartesiaSlider control for speed and emotion + synthetic voice mixing and design

OpenAI Speech to TextBasic controls for speed, emotion and similarity

Telephony Optimization

Cartesia8kHz audio, telephony optimized voices

OpenAI Speech to TextStandard audio quality without optimization

Flexible deployments

CartesiaSupports both on-prem and on-device deployments

OpenAI Speech to TextNo on-device generation available

Languages Supported

Cartesia15 languages with extensive dialect coverage

OpenAI Speech to Text57

Concurrency

CartesiaUp to 15 on highest self-serve tier (60 parallel conversations), custom for enterprise

OpenAI Speech to Text3-200 per mins

Cartesia - Advanced AI Voice Capabilities

Voice Clone

Cartesia's voice cloning delivers lifelike, accurate voice replication with unmatched fidelity.

Ultra-Realistic Voices

With a time-to-first-audio of just 40ms, Sonic provides the fastest and most realistic voice generation available.

No Hallucinations

Cartesia's AI text to speech eliminates errors, ensuring authentic and clear audio output every time.

Enterprise Ready

Enterprise-grade reliability with 99.9% uptime, SOC2 compliance, and full on-premises support.

How they stack up

Latency

We measure latency using the Time to First Audio (TTFA) metric and calculate the 90th percentile score from 100 TTFA measurements for each provider. With a TTFA of 199 ms, Cartesia's Sonic model is significantly faster than ElevenLabs' 832 ms at the self-serve tier (with ElevenLabs reserving slightly lower latency for enterprise customers). Cartesia’s Sonic model is built on a radically more efficient architecture, known as State Space Models (SSMs), which allows for far greater latency optimization than the traditional transformer architecture used by ElevenLabs and other providers. For a side by side comparison, watch the latency demonstration in this video on X.

Pronunciation Accuracy

Using the Time to First Audio (TTFA) metric, Cartesia's Sonic model achieves a TTFA of 199 ms, significantly faster than ElevenLabs' 832 ms. This efficiency is due to Cartesia's advanced State Space Models (SSMs), optimizing latency beyond traditional architectures.

Voice Cloning

Cartesia provides unlimited instant voice cloning. In contrast, ElevenLabs restricts cloning capabilities with tiered plans allowing for 10, 30, 160, or 660 custom voices. Additionally, Cartesia utilizes advanced embedding technology to deliver consistent, high-quality voice clones. Cartesia preserves accents and voice quality even if the original audio clip is noisy. Cartesia's voice mixing and design capabilities also open up a more comprehensive set of diverse voices. In the following samples of Engelbart's clones, it's apparent that Cartesia produces a significantly less noisy and higher quality clone than ElevenLabs.

Voice Design Controllability

Cartesia stands out as the only provider offering emotion and speed modulation features, enabling refined voice adjustments while preserving a natural and seamless auditory experience. Cartesia also allows you to localize the voice to match different accents — you can start with an American voice and have it speak in a French accent, for instance. In comparison, ElevenLabs only offers controls for stability, similarity, and style exaggeration, all of which do not offer clear control for the voice. In the following example where the speaker is addressing a customer complaint, we compare the dial effects from ElevenLabs and Cartesia to find that the ElevenLabs voice sounds similar with the various dials applied while Cartesia's emotion and speed dials convey very noticeable changes.

Pricing Plans for Cartesia and OpenAI TTS

Free - $0 per month with 20K free credits

TTS - $15 per 1M characters

Pro - $5 per month with 100K credits

TTS HD - $30 per 1M characters

Startup - $49 per month with 1.25M credits

Scale - $299 per month with 8M credits

Enterprise - trusted by Fortune 500 companies

Trusted by leading enterprises. Speaking from experience.

Discover success stories

“Cartesia’s state-space models bring enterprise-grade speed and quality to our AI Voice Agents… making it possible for businesses to deploy secure, scalable voice agents that can understand, act, and adapt in real time.”

Ravi Krishnamurthy

VP Product

“Cartesia Sonic 3.5 has become one of the top-performing models for us by combining low latency with natural pacing… helping us deliver strong voice quality across a growing set of languages where other models often fall short.”

Lydia Zarcone

Voice Product Manager

“With Cartesia's Sonic model, users can interact with a wide range of high-quality, human-like voices in multiple languages, enhancing their experience on our platform.”

Spencer Chan

Head of Poe Product

“Cartesia is leading the charge of building efficient, multimodal models from first principles, starting with their Sonic TTS model.”

Vipul Ved Prakash

CEO

“We didn’t switch to Sonic 3.5 because it was incrementally better, we switched because nothing else came close… we’ve seen a 2.9% lift in our conversion and a 12.2% increase in customer engagement.”

Akshay Ramaswamy

Staff Product Manager

“Sonic 3.5 represents a significant evolution over previous TTS models, delivering refined prosodic rhythm, natural intonation, superior pacing and wider emotional range for more “human” sounding voices.”

Tyler Brown

Voice Expert

Frequently asked questions

Architecting AI that learns and interacts like humans.

Products

Company

Resources

Solutions

Capabilities

Products

Solutions

Capabilities

Resources

Company