Question 1

Why is Ink-2 the best STT for voice agents?

Accepted Answer

Ink-2 is Cartesia's streaming speech-to-text model, purpose-built for production voice agents. Like our TTS, it's built on State Space Model (SSM) architecture pioneered by our founding team at Stanford, which is what enables Ink-2 to deliver three things at once:

Lowest WER of any streaming STT model. Ink-2 outperforms Deepgram Flux, Soniox RT-V4, AssemblyAI RT Pro, ElevenLabs Scribe-2-realtime, and other production streaming models across line recordings, accented speech, noisy conditions, and earnings calls, including alphanumerics like phone numbers, emails, and UUIDs.

Best-in-class turn detection, built in. Ink-2 has model-integrated end-of-turn detection, meaning you don't need a separate turn-taking model like Silero in your stack. Fewer dependencies, lower latency, more accurate turn-taking.

Built-in noise robustness. Ink-2 is robust to background noise without requiring Krisp or other audio filters, removing cost and latency from your stack.

Question 2

Do I still need Krisp or Silero with Ink-2?

Accepted Answer

No. This is one of the main reasons teams switch to Ink-2:

Turn-taking is built in. Most STT models require an external turn detector (like Silero) to know when the user is done speaking. Ink-2 handles this natively, which is both faster and more accurate than running a separate model.

Noise robustness is built in. Most voice agents stack Krisp or similar audio filters on top of their STT to remove background noise, adding cost, latency, and complexity. Ink-2 is robust to background noise out of the box.

Fewer models in your stack means lower latency, lower cost, and fewer points of failure.

Question 3

Can Ink-2 run on-prem or in my own cloud (VPC)?

Accepted Answer

Yes, Ink-2 can be deployed:

On-prem inside your data center, including air-gapped environments

In your own VPC on AWS, GCP, or Azure

Via OEM licensing for embedding Ink-2 directly into your product

This makes Ink-2 viable for government contracting, regulated industries (healthcare, financial services, insurance), and customers with data sovereignty or residency requirements. On-prem and OEM deployments are available under enterprise contracts.

Question 4

What languages does Ink-2 support?

Accepted Answer

Ink-2 currently supports English. Multilingual Ink-2 will be available soon.

Question 5

How much does Ink-2 cost?

Accepted Answer

Ink-2 is priced at 3 credits/second. Pricing details and volume discounts are on the pricing page at https://www.cartesia.ai/pricing. Enterprise contracts include custom pricing based on usage and deployment model.

Cartesia also offers a startup grants program at https://www.cartesia.ai/startups with credits for qualifying early-stage companies.

Question 6

When should I contact Sales?

Accepted Answer

Reach out to the Cartesia team if any of the following apply:

You're running high-volume production workloads (>50M credits)

You need on-prem, VPC, or OEM deployment

You need a BAA, zero data retention, or other contractual compliance terms for healthcare, financial services, or regulated sectors

You're in government, federal, or public sector procurement

For everything else — evaluation, prototyping, smaller production workloads — the self-serve plans & technical documentation at https://docs.cartesia.ai/build-with-cartesia/stt-models/latest will get you started.

Ink: The fastest and most accurate speech to text model

One transcription model for every environment your business takes you to

Built for Voice Agents

Heard right the first time.

Knows when you start and finish.

The caller stops talking.
The agent starts thinking.

Quality that doesn't cost more as you grow.

Enterprise-grade security. From Cloud to Local.

FAQs

Get started today