How Cartesia powers the world's most responsive AI avatars

The company

Cerebrium builds serverless infrastructure for AI teams to allow them to deploy applications in minutes. As a team, they’re excited about real-time AI and the possibilities it unlocks to simulate very realistic human interactions.

They showcased some practical applications of AI avatars in this demo they created for sales training and conducting user interviews.

Here's the Cerebrium demo:

Building the world's most responsive AI avatar

Cerebrium had a vision: an AI avatar that could train sales reps and coach job seekers with the responsiveness of a real person.

Every millisecond of latency counts because humans are impatient, and long pauses are a dead giveaway that you’re talking to an AI. That’s why, when building their tech stack for this demo, Cerebrium sought to optimize latency while maintaining the quality of a natural human conversation.

Cerebrium combined three key technologies to create their groundbreaking demo:

Mistral - 7B language model
Tavus - AI avatars
Cartesia - low-latency, ultra-realistic voice API

With less than 100 ms to first audio on model latency, Cartesia is the fastest generative voice solution in the market. It also offers the most realistic, natural voices, as confirmed by several third party model evaluation platforms like Artificial Analysis, which conducts blind human preference tests across every main text-to-speech provider.

Cartesia is also the only provider that allows for fine-grained voice design controls like speed and emotion. This allows users of this demo to practice different sales and interview scenarios where the person they’re speaking to might be angry, speak too fast, etc.

We’re the only provider that’s able to balance speed with quality because we built our voice API on state space models (SSMs), a fundamentally more efficient architecture for AI models

The result? Voice interactions that are indistinguishable from speaking to a human coach. It can:

Respond to user input in under 500ms end to end
Adjust its tone from angry customer to supportive coach
Handle complex, context-aware conversations

Experience it yourself: https://coaching.cerebrium.ai/

Why Latency Matters

Engagement: Faster responses mean more natural conversations. Users stay engaged longer.
Scalability: Lower latency means you can handle more concurrent users without sacrificing quality.
User experience: In a world of instant gratification, even small delays can lead to frustration and dropoffs.

Whether you're building a customer service bot, a virtual assistant, or the next big thing in EdTech, speed is the key to a realistic experience.

Ready to supercharge your AI with Cartesia's voices?

Cerebrium's demo is just the beginning. Here’s other examples of use cases our community is building with our voices today:

A language learning app that responds instantly to pronunciation errors

A mental health chatbot that picks up on emotional cues in real-time

A study buddy that quizzes you on the content of dense research papers

For a deeper dive into the technical details of how Cerebrium built their avatar, Check out their in-depth blog post.