Ink: The fastest and most accurate speech to text model
Ranked #1 on accuracy, built for voice agents with semantic endpointing and industry-leading latency.
One transcription model for every environment your business takes you to
Trains rumble past, announcements crackle overhead. Ink-2 transcribes every word the caller says.
Noisy city
Built for Voice Agents
Four capabilities that make Ink the transcription layer production agents rely on.
Dates, alphanumerics, IDs
Accuracy
Heard right the first time.
In practice
In a voice agent, the transcript is the foundation everything else builds on. A transcription error undermines the LLM input and takes the interaction in the wrong direction.
The inverse is equally true — accuracy compounds, and a precise transcript means a better response and a call that resolves.
Ink-2's approach
Ink has the lowest Word Error Rate (WER) of any streaming STT model, natively handling structured data — phone numbers, dates, emails, currencies, and UUIDs. Built for real-world audio settings — telephony, background noise, varied accents, and more.
Cartesia
00:00
+5.5s
LLM begins drafting
+9.3s
LLM refines draft
9.5s head start
First action ready
Competitor
00:00
+15.0s
LLM starts after turn.end
Conversational flow
Knows when you start and finish.
In practice
A conversation has two critical moments — when a caller starts talking and when they finish. Miss the start and the agent misses the turn entirely. Trigger too early on the end and the agent jumps in mid-thought. The right transcription model gets both right without the wait.
Ink-2's approach
Ink-2 is built with native turn detection — turn.start and turn.end signaled directly by the model, with no external VAD to integrate or maintain. For lower latency, turn.eager_end gives your LLM a head start before the turn is confirmed complete.
Semantic endpointing determines turn end by meaning, not silence — so pauses mid-thought don't trigger the agent prematurely.
Speed
The caller stops talking.
The agent starts thinking.
In practice
When transcription is fast and consistent, the agent's response feels immediate. One slow transcript in ten means one call in ten where that readiness breaks. Nine great calls don't cancel out the one that didn't feel right.
Ink-2's approach
Ink is the fastest streaming ASR model - built on a custom inference engine purpose-built for real-time conversation. Time to final transcript is 0.1s, with turn.eager_end reducing the gap between the last word and the first response.
Cost
Quality that doesn't cost more as you grow.
In practice
Voice is the most natural interface for communication. Getting cost and quality right at scale enables voice everywhere — the default interface across every agentic interaction.
Ink-2's approach
Ink's State Space Model architecture delivers 10-100x the throughput of transformers — lower compute cost at scale, with no quality tradeoffs. Ongoing optimization of our model stack means better unit economics as you scale.
Enterprise-grade security. From Cloud to Local.
-
HIPAA compliant
-
SOC 2 Type 2
-
GDPR
-
PCI
FAQs
Frontier research, deployed in every conversation.





