LiveKit pioneers next‑generation AI agents with Sonic

LiveKit is a leading real-time platform that enables developers to build video, voice, and data capabilities into their applications. Founded in 2021 as an open-source project using WebRTC, LiveKit is a pioneering force in real-time voice and video AI. Today, thousands of developers and teams rely on LiveKit's infrastructure, from individual developers to major enterprises that are pushing the boundaries of real time voice like OpenAI.

When most companies saw video calling simply as a tool for online meetings, LiveKit's founders envisioned a future where real-time voice and video would be critical to any application. This vision culminated in the January 2024 launch of LiveKit Agents, a groundbreaking framework for building programmable, multimodal AI agents.

Try having a live conversation with a LiveKit Agent powered by Cartesia here:

The challenge

LiveKit set out to create AI agents with human-level reasoning capabilities, requiring solutions to several complex technical challenges:

Low-Latency: For real-time reasoning, agents needed hyper-responsive processing with human-level response times. As LiveKit orchestrates multiple modes of AI like video and text in addition to audio, they need to transport data between the server and the client as quickly as possible.
Natural Voices: Livekit’s agents often replace humans in interactions like phone calls, so quality of speech needs to be incredibly lifelike.
Multilingual: Agents needed to operate seamlessly across multiple languages as LiveKit’s customers are distributed across the world.
Scalability: High-stakes use cases like voice agents for 911 emergency dispatch required support for high volumes of concurrent users with guaranteed uptime and performance during peak usage.
Context Window Limitations: Livekit’s long-running agents need to maintain coherent conversations and context while processing multimodal streams of audio, video, and text in real time. Transformer models struggled with this, requiring constant reloading of context and facing hard limits on sequence length that made continuous interaction impossible.

The solution

LiveKit chose to partner with Cartesia because Sonic, built on State Space Models, a fundamentally new architecture for AI, perfectly aligned with LiveKit’s vision for next-generation AI agents.

Advanced Architecture: SSMs allow Sonic to maintain state and process streaming data natively, enabling agents to maintain context over hours or days of interaction without performance degradation.
Ultra-Low Latency: Cartesia offered the lowest-latency model in the market at sub 100 ms to first audio, allowing multimodal LiveKit agents to consistently achieve superior end-to-end performance.
Natural Voice Generation: Cartesia's models produce contextually aware, human-like speech that maintains consistency across long conversations.
Multilingual: Cartesia offers 14 languages, with the same industry-leading latency, quality and accuracy as its English model.
Enterprise-Grade Scalability: Robust infrastructure supports high-volume, mission-critical applications.

The results

Cartesia is available today as an integration for Livekit Agents. The partnership has enabled LiveKit to deploy their agents across diverse, demanding use cases, including:

Emergency Services: AI voice agents for 911 emergency dispatch, requiring perfect reliability and natural interaction.
Gaming: AI-driven NPCs providing immersive gaming experiences.
Autonomous Vehicles: Real-time telemetry processing and decision-making.
Enterprise Solutions: Seamless integration of voice and image capabilities with existing AI systems

For Developers

The benefit to using LiveKit and Cartesia together is that users get the network quality of LiveKit's global WebRTC network, together with the speed of Cartesia's Sonic text to speech model.

Developers can run Agents workers on their own laptops during development, and deploy them to their own servers for production use. Agents themselves connect to the Cartesia API and use the LiveKit network to stream audio to users.

The result is a system where developers get a state of the art text-to-speech model for their voice agent, while being able to control business logic such as RAG or data lookup in code that they control. The simplest proof of concept voice agent is just fifty lines of code (not counting whitespace or comments):

load_dotenv(dotenv_path=".env.local")
logger = logging.getLogger("voice-agent")


def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()


async def entrypoint(ctx: JobContext):
    initial_chat_context = llm.ChatContext().append(
        role="system",
        text=(
            "You are a voice assistant created by LiveKit. Your interface with users will be voice. "
            "You should use short and concise responses, and avoid usage of unpronounceable punctuation. "
        ),
    )

    logger.info(f"connecting to room {ctx.room.name}")
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    # Wait for the first participant to connect
    participant = await ctx.wait_for_participant()
    logger.info(f"starting voice assistant for participant {participant.identity}")

    # Set up the Agent
    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=deepgram.STT(),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=cartesia.TTS(
            model="sonic",
            voice="794f9389-aac1-45b6-b726-9d9369183238",
        ),
        chat_ctx=initial_chat_context,
    )

    agent.start(ctx.room, participant)

    # Once connected, we start by hardcoding a greeting
    await agent.say(f"Hey {participant.identity}! How can I help you today?", allow_interruptions=True)


if __name__ == "__main__":
    cli.run_app(
        WorkerOptions(
            entrypoint_fnc=entrypoint,
            prewarm_fnc=prewarm,
        ),
    )

"Internet applications weren’t built for how we’ll use computers in the future. Those computers will see, hear, and speak like we do. We’ll interact with them like we do with each other. We designed LiveKit’s Agents framework to make it easy to build applications for this new paradigm. Cartesia—pioneers of the SSM architecture—shared our belief that real-time, multimodal AI models would be at the center of computing, making them the perfect Agents launch partner."

Russ d'Sa CEO & LiveKit co-founder

"Russ and David at LiveKit share Cartesia's vision around the need for new architectures to allow multimodal AI to reach its true potential. With Livekit, our customers can build sophisticated agents that take phone calls in just a few hours. We're thankful for the opportunity to power their Agents to accomplish more complex reasoning tasks than ever before.
Karan Goel, CEO, Cartesia

“Using Livekit and Cartesia enables us to effectively deploy reliable, conversational agents with natural voices at scale to serve our customers 24/7 by improving patient access and experience”
Jeffery Liu, CEO at Assort Health