Ink: The fastest and most accurate speech to text model

Ranked #1 on accuracy, built for voice agents with semantic endpointing and industry-leading latency.

One transcription model for every environment your business takes you to

Trains rumble past, announcements crackle overhead. Ink-2 transcribes every word the caller says.

Noisy city

Built for Voice Agents

Four capabilities that make Ink the transcription layer production agents rely on.

Dates, alphanumerics, IDs

Accuracy

Heard right the first time.

In practice

In a voice agent, the transcript is the foundation everything else builds on. A transcription error undermines the LLM input and takes the interaction in the wrong direction.

The inverse is equally true — accuracy compounds, and a precise transcript means a better response and a call that resolves.


Ink-2's approach

Ink has the lowest Word Error Rate (WER) of any streaming STT model, natively handling structured data — phone numbers, dates, emails, currencies, and UUIDs. Built for real-world audio settings — telephony, background noise, varied accents, and more.

Cartesia

00:00

turn.eager_end

+5.5s

LLM begins drafting

turn.eager_end

+9.3s

LLM refines draft

turn.end

9.5s head start

First action ready

Competitor

00:00

turn.end

+15.0s

LLM starts after turn.end

Conversational flow

Knows when you start and finish.

In practice

A conversation has two critical moments — when a caller starts talking and when they finish. Miss the start and the agent misses the turn entirely. Trigger too early on the end and the agent jumps in mid-thought. The right transcription model gets both right without the wait.


Ink-2's approach

Ink-2 is built with native turn detection — turn.start and turn.end signaled directly by the model, with no external VAD to integrate or maintain. For lower latency, turn.eager_end gives your LLM a head start before the turn is confirmed complete.

Semantic endpointing determines turn end by meaning, not silence — so pauses mid-thought don't trigger the agent prematurely.

Sonic
82ms
Blink of an eye
100ms
Human response threshold
150ms

Speed

The caller stops talking.
The agent starts thinking.

In practice

When transcription is fast and consistent, the agent's response feels immediate. One slow transcript in ten means one call in ten where that readiness breaks. Nine great calls don't cancel out the one that didn't feel right.


Ink-2's approach

Ink is the fastest streaming ASR model - built on a custom inference engine purpose-built for real-time conversation. Time to final transcript is 0.1s, with turn.eager_end reducing the gap between the last word and the first response.

Cost

Quality that doesn't cost more as you grow.

In practice

Voice is the most natural interface for communication. Getting cost and quality right at scale enables voice everywhere — the default interface across every agentic interaction.


Ink-2's approach

Ink's State Space Model architecture delivers 10-100x the throughput of transformers — lower compute cost at scale, with no quality tradeoffs. Ongoing optimization of our model stack means better unit economics as you scale.

Enterprise-grade security. From Cloud to Local.

  • HIPAA compliant

  • SOC 2 Type 2

  • GDPR

  • PCI

FAQs

Why is Ink-2 the best STT for voice agents?
Do I still need Krisp or Silero with Ink-2?
Can Ink-2 run on-prem or in my own cloud (VPC)?
What languages does Ink-2 support?
How much does Ink-2 cost?
When should I contact Sales?

Frontier research, deployed in every conversation.