Voice & Speech · AI & Automation

Talk to your software.
Speech that acts.

Real-time voice systems that hear a caller, transcribe every word, understand the intent, and respond out loud — fast enough to feel like a conversation. Streaming ASR, Claude for understanding, natural speech back, built for low latency and shipped to production.

Start a project → Hear it speak →

/01Drag the volume · watch the pipeline keep up

Live · streaming speech, on the wire

Accuracy 98%

⏺ "Move my appointment to next Tuesday."

Words / day0

WER (accuracy)0%

Intents resolved0%

p95 latency0ms

Speech minutes / day 2,400 min

Transcription accuracy98%

Handle time cut−54%

Minutes transcribed120M

Median time-to-speak0.6s

02 — Outcomes

Voice that did the work.

A ledger of named systems where the line that moved was a call answered, a note captured, an intent resolved — handle time, accuracy, CSAT, accessibility. 6 of 30 shown · ledger updates as systems scale.

Northwind Care

IVR replacement · Inbound support

Streaming ASR plus a Claude voice agent answers and resolves routine calls end-to-end — no phone-tree menus, no hold music

−54%Average handle time

Cobalt Clinical

Meeting notes · Visit transcription

Diarized transcription of patient visits with speaker labels, then a structured summary drafted for the clinician to sign off

98%Word accuracy

Vera Bank

Voice agent · Account servicing

Barge-in voice agent handles balance, transfers, and disputes with verification, escalating to a human only when grounding is weak

71%Calls self-served

Lumen Live

Accessibility · Live captions

Low-latency live captioning for events and webinars, streaming partial transcripts under a second behind the speaker

0.6sCaption latency

Drift Retail

Drive-thru · Order taking

Noise-robust ASR with a constrained menu grammar takes orders in the lane, confirms out loud, and posts straight to the POS

+38%Orders per hour

Forge Field

Voice notes · Hands-free logging

Field techs dictate job logs hands-free; speech is transcribed, structured into the work-order fields, and synced offline-first

+62%Reports completed

03 — The speech loop, live

It doesn't just hear.
It understands, then speaks.

Press play and watch the loop run: audio streams in as a waveform, the transcript appears word-by-word, the intent is detected, a response is composed, and a voice speaks it back. Toggle diarization to label speakers; switch latency mode to trade accuracy for speed.

Live audio · streaming · 16 kHz

ASR · Whisper / Deepgram

Press play to start the utterance…

ASR

→

Understand

→

Speak

Detected intent · Claude

—

Spoken response · ElevenLabs speaking

—

Words—

WER—

Intent—

Latency—

Streaming ASR emits partial words as the caller speaks, so understanding starts before they finish — that's what keeps a voice agent feeling like a conversation, not a form.

04 — Anatomy of the pipeline

Built like a pipeline,
not a phone tree.

Quality hides in the stages between the microphone and the speaker — how you capture, where you transcribe, who decides the intent, how fast you speak back. This is the room we work in: each stage measured, each tool chosen for a reason.

Voice pipeline · Northwind Care Agent

WER 2.0% · Intents 94% · p95 0.6s

StageToolWhat it doesSignal

CaptureWebRTC · Twilio MediaStream low-latency audio from browser or phone over the wire16 kHz

VADSilero VAD · WebRTC VADDetect speech vs. silence so we transcribe only real utterancesgated

TranscribeWhisper · DeepgramStreaming ASR emits partial then final words with timestamps2.0% WER

Diarizepyannote · DeepgramAttribute each segment to speaker A or B for multi-party audioA / B

UnderstandClaude Opus 4.8 · 1M ctxClassify intent, fill slots, decide the next action with tool use94%

RespondClaude · structured outputsCompose a grounded reply or call a tool, abstain when unsuregrounded

SpeakElevenLabs · streaming TTSSynthesize natural speech, streamed so the reply starts instantly0.6s

EvaluateWER harness · LLM-as-judgeScore accuracy, intent resolution & latency every release98%

green measured & in target

live the stage running in the demo above

amber watch · above latency budget

05 — Ship to production

Scope the call.

Before a single byte of audio is captured, we map what callers actually say, what counts as a resolved intent, and where the latency budget bites. Then we build an evaluation set — the utterances we'll grade every release against.

/ Week 00 · Scope & eval set

Utterances120 real calls pulled from recordings, IVR logs, and interviews

IntentsThe actions callers actually want — mapped to tools and slots

Eval setGolden audio with reference transcripts & expected intents

TargetWER ≤ 5%, p95 < 800 ms, abstain when confidence is low

Capture & transcribe.

Stream real audio — browser, phone, or field device — gate it with voice-activity detection, then transcribe with streaming ASR so partial words land before the caller finishes. Noisy audio transcribes to noise; this is where most voice systems quietly fail.

/ Week 01 · Capture & ASR

CaptureWebRTC · Twilio Media Streams · 16 kHz PCM

VADSilero · WebRTC VAD · endpointing on silence

ASRWhisper · Deepgram · streaming partials

NoiseRNNoise · gain control · echo cancellation

BufferRedis stream · backpressure on slow downstream

Diarize & understand.

Attribute each segment to a speaker, then hand the transcript to Claude to classify intent, fill slots, and decide the next action. Hybrid from day one — fast classifiers for the common asks, Claude for the fuzzy ones.

/ Week 02 · Diarize & understand

Diarization · pyannote · speaker A / B labels

Intent · Claude · structured outputs, strict tools

Slots · dates, accounts, amounts · validated

Grounding · cite from account data, not memory

Confidence gate tuned against the eval set

Respond & speak.

Compose the reply — answer, confirm, or hand to a human — then stream it through text-to-speech so the voice starts before the sentence is finished. Barge-in lets the caller interrupt; an honest "let me get a person" beats a confident wrong answer.

/ Week 03 · Respond & speak

Evaluate honestly.

Run the frozen eval set every change and score it — word error rate, intent resolution, p95 latency, abstain rate. No "sounded fine in the demo." A number that moves, or the change doesn't ship.

/ Week 04 · Evaluate

Word error2.0% — accurate transcription on real audio

Intent resolved94% — the right action chosen and filled

p95 latency0.6s — speech-to-speech round trip

Abstain rate5% — handed to a human rather than guessed

Ship & monitor.

Live with caching for cost, guardrails for safety, and logging on every turn. We watch WER and latency as call volume grows, catch drift before callers do, and keep the system listening as accents, noise, and asks change underneath it.

/ Ongoing · Ship & monitor

Prompt caching · system

Turn logging

Latency alerts

PII redaction

Barge-in by default

Abstain on low confidence

Weekly WER run

Human handoff path

06 — Why it compounds

An evaluated voice system improves.

Every eval run feeds the next: missed words sharpen the audio pipeline, failed intents tighten the prompts, slow turns justify the streaming work. Ship-and-forget voice drifts as accents, noise, and new asks pile up. Evaluated and tended, accuracy compounds.

Eval-driven by Stack Innovations — accuracy climbs as the audio pipeline and prompts tighten

Ship-and-forget — plateaus, then drifts as accents, noise, and edge cases pile up

Representative of a typical 12-month engagement · transcription accuracy on a frozen evaluation set.

Talk to your software.
Speech that acts.

Voice that did the work.

It doesn't just hear.
It understands, then speaks.

Built like a pipeline,
not a phone tree.

Scope the call.

Capture & transcribe.

Diarize & understand.

Respond & speak.

Evaluate honestly.

Ship & monitor.

An evaluated voice system improves.

The kit, shown.

Stop the phone tree.
Start talking. Out loud.

Talk to your software. Speech that acts.

Voice that did the work.

It doesn't just hear.It understands, then speaks.

Built like a pipeline,not a phone tree.

Scope the call.

Capture & transcribe.

Diarize & understand.

Respond & speak.

Evaluate honestly.

Ship & monitor.

An evaluated voice system improves.

The kit, shown.

Stop the phone tree.Start talking. Out loud.

Talk to your software.
Speech that acts.

It doesn't just hear.
It understands, then speaks.

Built like a pipeline,
not a phone tree.

Stop the phone tree.
Start talking. Out loud.