LiveKit
LiveKit Pioneers Next‑Generation AI Agents with Sonic
"Internet applications weren’t built for how we’ll use computers in the future. Those computers will see, hear, and speak like we do. We’ll interact with them like we do with each other. We designed LiveKit’s Agents framework to make it easy to build applications for this new paradigm. Cartesia—pioneers of the SSM architecture—shared our belief that real-time, multimodal AI models would be at the center of computing, making them the perfect Agents launch partner."
- Russ CEO & LiveKit co-founder
About the company
LiveKit is a leading real-time platform that enables developers to build video, voice, and data capabilities into their applications. Founded in 2021 as an open-source project using WebRTC, LiveKit is a pioneering force in real-time voice and video AI. Today, thousands of developers and teams rely on LiveKit's infrastructure, from individual developers to major enterprises that are pushing the boundaries of real time voice like OpenAI.
Introduction
When most companies saw video calling simply as a tool for online meetings, LiveKit's founders envisioned a future where real-time voice and video would be critical to any application. This vision culminated in the January 2024 launch of LiveKit Agents, a groundbreaking framework for building programmable, multimodal AI agents.
Try having a live conversation with a LiveKit Agent powered by Cartesia here:
The Challenge
LiveKit set out to create AI agents with human-level reasoning capabilities, requiring solutions to several complex technical challenges:
Low-Latency: For real-time reasoning, agents needed hyper-responsive processing with human-level response times. As LiveKit orchestrates multiple modes of AI like video and text in addition to audio, they need to transport data between the server and the client as quickly as possible.
Natural Voices: Livekit’s agents often replace humans in interactions like phone calls, so quality of speech needs to be incredibly lifelike.
Multilingual: Agents needed to operate seamlessly across multiple languages as LiveKit’s customers are distributed across the world.
Scalability: High-stakes use cases like voice agents for 911 emergency dispatch required support for high volumes of concurrent users with guaranteed uptime and performance during peak usage.
Context Window Limitations: Livekit’s long-running agents need to maintain coherent conversations and context while processing multimodal streams of audio, video, and text in real time. Transformer models struggled with this, requiring constant reloading of context and facing hard limits on sequence length that made continuous interaction impossible.
The Solution
LiveKit chose to partner with Cartesia because Sonic, built on State Space Models, a fundamentally new architecture for AI, perfectly aligned with LiveKit’s vision for next-generation AI agents.
Advanced Architecture: SSMs allow Sonic to maintain state and process streaming data natively, enabling agents to maintain context over hours or days of interaction without performance degradation.
Ultra-Low Latency: Cartesia offered the lowest-latency model in the market at sub 100 ms to first audio, allowing multimodal LiveKit agents to consistently achieve superior end-to-end performance.
Natural Voice Generation: Cartesia's models produce contextually aware, human-like speech that maintains consistency across long conversations.
Multilingual: Cartesia offers 14 languages, with the same industry-leading latency, quality and accuracy as its English model.
Enterprise-Grade Scalability: Robust infrastructure supports high-volume, mission-critical applications.
The results
Cartesia is available today as an integration for Livekit Agents. The partnership has enabled LiveKit to deploy their agents across diverse, demanding use cases, including:
Emergency Services: AI voice agents for 911 emergency dispatch, requiring perfect reliability and natural interaction.
Gaming: AI-driven NPCs providing immersive gaming experiences.
Autonomous Vehicles: Real-time telemetry processing and decision-making.
Enterprise Solutions: Seamless integration of voice and image capabilities with existing AI systems
The benefit to using LiveKit and Cartesia together is that users get the network quality of LiveKit's global WebRTC network, together with the speed of Cartesia's Sonic text to speech model.
Developers can run Agents workers on their own laptops during development, and deploy them to their own servers for production use. Agents themselves connect to the Cartesia API and use the LiveKit network to stream audio to users.
The result is a system where developers get a state of the art text-to-speech model for their voice agent, while being able to control business logic such as RAG or data lookup in code that they control. The simplest proof of concept voice agent is just fifty lines of code (not counting whitespace or comments):
"Russ and David at LiveKit share Cartesia's vision around the need for new architectures to allow multimodal AI to reach its true potential. With Livekit, our customers can build sophisticated agents that take phone calls in just a few hours. We're thankful for the opportunity to power their Agents to accomplish more complex reasoning tasks than ever before.
- Karan Goel, CEO, Cartesia
“Using Livekit and Cartesia enables us to effectively deploy reliable, conversational agents with natural voices at scale to serve our customers 24/7 by improving patient access and experience”
- Jeffery Liu, CEO at Assort Health
What our customers say
Join the growing list of companies opting for Sonic.
"We're thrilled to partner with Cartesia - their technology has dramatically improved the accuracy and reliability of our call center agents. Beyond just providing best-in-class voice AI, the Cartesia team has been a true partner in helping us transform 24/7 patient support for over 215,000 patients. Their support has been instrumental in making exceptional care accessible anytime, anywhere."
Jeffrey Liu, Founder and co-CEO, Assort Health
"This partnership represents a transformative moment in enterprise AI adoption," said Melissa Gordon, CEO of Rasa. "By combining Rasa’s strengths in enterprise conversational AI with Cartesia's innovative voice technology, we're fundamentally changing how enterprises can deploy and scale AI assistants across their organizations."
"Together AI's mission has always been to provide developers with the most powerful and efficient tools for building AI applications," says Vipul Ved Prakash, Together AI's CEO. "Cartesia is leading the charge of building efficient, multimodal models from first principles, starting with their Sonic TTS model. By integrating Sonic into our platform, we're enabling developers to create sophisticated multi-modal applications that leverage the most advanced and lowest latency voice model available today, all while maintaining the simplicity and reliability our users expect."