Stack Innovations / Services / AI & Automation / Voice & Speech
Voice & Speech · AI & Automation

Talk to your software.
Speech that acts.

Real-time voice systems that hear a caller, transcribe every word, understand the intent, and respond out loud — fast enough to feel like a conversation. Streaming ASR, Claude for understanding, natural speech back, built for low latency and shipped to production.

/01Drag the volume · watch the pipeline keep up
Live · streaming speech, on the wire
Accuracy 98%
"Move my appointment to next Tuesday."
Words / day0
WER (accuracy)0%
Intents resolved0%
p95 latency0ms
Speech minutes / day 2,400 min
Transcription accuracy98%
Handle time cut−54%
Minutes transcribed120M
Median time-to-speak0.6s
Trusted by teams shipping voice to production at
02 — Outcomes

Voice that did the work.

A ledger of named systems where the line that moved was a call answered, a note captured, an intent resolved — handle time, accuracy, CSAT, accessibility. 6 of 30 shown · ledger updates as systems scale.

Northwind Care
IVR replacement · Inbound support
Streaming ASR plus a Claude voice agent answers and resolves routine calls end-to-end — no phone-tree menus, no hold music
−54%Average handle time
Cobalt Clinical
Meeting notes · Visit transcription
Diarized transcription of patient visits with speaker labels, then a structured summary drafted for the clinician to sign off
98%Word accuracy
Vera Bank
Voice agent · Account servicing
Barge-in voice agent handles balance, transfers, and disputes with verification, escalating to a human only when grounding is weak
71%Calls self-served
Lumen Live
Accessibility · Live captions
Low-latency live captioning for events and webinars, streaming partial transcripts under a second behind the speaker
0.6sCaption latency
Drift Retail
Drive-thru · Order taking
Noise-robust ASR with a constrained menu grammar takes orders in the lane, confirms out loud, and posts straight to the POS
+38%Orders per hour
Forge Field
Voice notes · Hands-free logging
Field techs dictate job logs hands-free; speech is transcribed, structured into the work-order fields, and synced offline-first
+62%Reports completed
03 — The speech loop, live

It doesn't just hear.
It understands, then speaks.

Press play and watch the loop run: audio streams in as a waveform, the transcript appears word-by-word, the intent is detected, a response is composed, and a voice speaks it back. Toggle diarization to label speakers; switch latency mode to trade accuracy for speed.

Live audio · streaming · 16 kHz
ASR · Whisper / Deepgram
Press play to start the utterance…
ASR
Understand
Speak
Detected intent · Claude
Spoken response · ElevenLabs speaking
Words
WER
Intent
Latency
Streaming ASR emits partial words as the caller speaks, so understanding starts before they finish — that's what keeps a voice agent feeling like a conversation, not a form.
04 — Anatomy of the pipeline

Built like a pipeline,
not a phone tree.

Quality hides in the stages between the microphone and the speaker — how you capture, where you transcribe, who decides the intent, how fast you speak back. This is the room we work in: each stage measured, each tool chosen for a reason.

Voice pipeline · Northwind Care Agent
WER 2.0% · Intents 94% · p95 0.6s
StageToolWhat it doesSignal
CaptureWebRTC · Twilio MediaStream low-latency audio from browser or phone over the wire16 kHz
VADSilero VAD · WebRTC VADDetect speech vs. silence so we transcribe only real utterancesgated
TranscribeWhisper · DeepgramStreaming ASR emits partial then final words with timestamps2.0% WER
Diarizepyannote · DeepgramAttribute each segment to speaker A or B for multi-party audioA / B
UnderstandClaude Opus 4.8 · 1M ctxClassify intent, fill slots, decide the next action with tool use94%
RespondClaude · structured outputsCompose a grounded reply or call a tool, abstain when unsuregrounded
SpeakElevenLabs · streaming TTSSynthesize natural speech, streamed so the reply starts instantly0.6s
EvaluateWER harness · LLM-as-judgeScore accuracy, intent resolution & latency every release98%
green measured & in target
live the stage running in the demo above
amber watch · above latency budget
01
05 — Ship to production

Scope the call.

Before a single byte of audio is captured, we map what callers actually say, what counts as a resolved intent, and where the latency budget bites. Then we build an evaluation set — the utterances we'll grade every release against.

/ Week 00 · Scope & eval set
Utterances120 real calls pulled from recordings, IVR logs, and interviews
IntentsThe actions callers actually want — mapped to tools and slots
Eval setGolden audio with reference transcripts & expected intents
TargetWER ≤ 5%, p95 < 800 ms, abstain when confidence is low

Capture & transcribe.

Stream real audio — browser, phone, or field device — gate it with voice-activity detection, then transcribe with streaming ASR so partial words land before the caller finishes. Noisy audio transcribes to noise; this is where most voice systems quietly fail.

/ Week 01 · Capture & ASR
CaptureWebRTC · Twilio Media Streams · 16 kHz PCM
VADSilero · WebRTC VAD · endpointing on silence
ASRWhisper · Deepgram · streaming partials
NoiseRNNoise · gain control · echo cancellation
BufferRedis stream · backpressure on slow downstream

Diarize & understand.

Attribute each segment to a speaker, then hand the transcript to Claude to classify intent, fill slots, and decide the next action. Hybrid from day one — fast classifiers for the common asks, Claude for the fuzzy ones.

/ Week 02 · Diarize & understand
Diarization · pyannote · speaker A / B labels
Intent · Claude · structured outputs, strict tools
Slots · dates, accounts, amounts · validated
Grounding · cite from account data, not memory
Confidence gate tuned against the eval set

Respond & speak.

Compose the reply — answer, confirm, or hand to a human — then stream it through text-to-speech so the voice starts before the sentence is finished. Barge-in lets the caller interrupt; an honest "let me get a person" beats a confident wrong answer.

/ Week 03 · Respond & speak

Evaluate honestly.

Run the frozen eval set every change and score it — word error rate, intent resolution, p95 latency, abstain rate. No "sounded fine in the demo." A number that moves, or the change doesn't ship.

/ Week 04 · Evaluate
Word error2.0% — accurate transcription on real audio
Intent resolved94% — the right action chosen and filled
p95 latency0.6s — speech-to-speech round trip
Abstain rate5% — handed to a human rather than guessed

Ship & monitor.

Live with caching for cost, guardrails for safety, and logging on every turn. We watch WER and latency as call volume grows, catch drift before callers do, and keep the system listening as accents, noise, and asks change underneath it.

/ Ongoing · Ship & monitor
Prompt caching · system
Turn logging
Latency alerts
PII redaction
Barge-in by default
Abstain on low confidence
Weekly WER run
Human handoff path
06 — Why it compounds

An evaluated voice system improves.

Every eval run feeds the next: missed words sharpen the audio pipeline, failed intents tighten the prompts, slow turns justify the streaming work. Ship-and-forget voice drifts as accents, noise, and new asks pile up. Evaluated and tended, accuracy compounds.

Eval-driven by Stack Innovations — accuracy climbs as the audio pipeline and prompts tighten
Ship-and-forget — plateaus, then drifts as accents, noise, and edge cases pile up
Representative of a typical 12-month engagement · transcription accuracy on a frozen evaluation set.
07 — Tools · honest kit

The kit, shown.

The models, services, and tools we actually wire together to capture, transcribe, diarize, understand, and speak. No mystery framework — just the kit that keeps voice fast and accurate.

ASR
Whisper
ASR
Deepgram
Understanding
Claude Opus 4.8
TTS
ElevenLabs
Transport
WebRTC
Telephony
Twilio
Endpointing
Silero VAD
Diarization
pyannote
Cache
Redis
Serving
FastAPI
Runtime
Python
Evaluation
WER harness
Start the build

Stop the phone tree.
Start talking. Out loud.

A free voice audit to start — bring a stack of your real call recordings and a list of real asks, and we'll show you what a streaming, speaking system would transcribe and resolve, and where today's IVR falls down. A prototype, not a pitch.

Get a voice audit
Accent
Hero shader
Motion