Announcing our seed round
December 12, 2024
We’re excited to announce our $27M seed round, led by Index Ventures with participation from Lightspeed, Factory, Conviction, General Catalyst, A*, SV Angel, and 90 amazing angel investors.
At Cartesia, our mission is to build real-time intelligence with long memory, that runs wherever you are. Artificial intelligence has gone through an extraordinary transition by scaling Transformer models. However, these models have significant limitations: they can only process and generate short amounts of information at a time (minutes of audio, seconds of video), can’t persist state in an efficient way between interactions, and are too expensive to run in real-time on most hardware. The next generation of AI will require innovation in the base architectures of intelligence.
We’ve pioneered new architectural advances (e.g. S4 and Mamba) over the past few years towards this vision. These new architectures have several important properties – their computational cost scales linearly with sequence length rather than quadratically, they learn to compress long sequences of data into a fixed sized state, and they are fast and efficient to inference.
We’re charting a path to a future with models that are capable of continuously ingesting massive context about the world, compress them into working memory, and generate across multiple modalities in real-time on a device near you.
A large part of our mission is bringing these cutting edge models to our customers building the future of real-time AI applications. This year, we released Sonic, the fastest hyper-realistic voice generation model and brought it on-device. Sonic is in production with thousands of customers ranging from individual creators and startups to large enterprises, powering the next generation of voice agents, digital media, and assistants.
Building long context generative models for rich multimodal signals like audio and video remains challenging – models are hard to control and quickly go off track. Over the past few months, we’ve created a new SSM architecture for multi-stream models that continuously reason and generate over multiple data streams of different modalities in parallel. This enables us to train end-to-end models with highly efficient streaming inference and unprecedented control across modalities.
We’ve trained an end-to-end voice generation model with our multi-stream architecture that enables fine-grained control over text to prevent hallucinations, while maintaining the unmatched realism of end-to-end generation. This represents a fundamental improvement over previous architectures for end-to-end audio generation, which can struggle to follow complex, long, and repetitive transcripts. This is broadly important for our users, and particularly so for those building voice agents where correctness is critical. Here are a few examples:
Are you available at any of the following times? 10:00am, 10:05am, 10:10am, 10:15am, 10:20am, 10:25am, 10:30am, 10:35am, 10:40am, 10:45am, 10:50am, or 10:55am?
My phone number is 8888888888.
My email address is HELLOWORLD at CARTESIA dot AI.
How much wood could a woodchuck chuck if a woodchuck could chuck wood? A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.
You can try this new model in the playground today under Sonic Preview and it will be available via our real-time API over the next few weeks.
We’re continuing to scale our multi-stream SSM architectures across multiple input and output modalities, and have an ambitious plan to deliver the next generation of real-time multimodal AI. If you’re excited about our work, please consider joining us.