ChatGPT made text-based AI seem almost trivially easy. Type a question, get an answer. So why does AI voice having a natural conversation over the phone remain such a difficult problem?
Building an AI call center that actually works requires solving problems across acoustics, linguistics, machine learning, distributed systems and telecommunications simultaneously.
The Fundamental Difference: Real-Time vs. Turn-Based
Text-based AI operates in a comfortable turn-based paradigm. You type. You wait. The AI responds. A few seconds of latency is perfectly acceptable, even expected.
Voice is fundamentally different.
Human conversation operates in real-time. Overlapping speech. Subtle timing cues. An expectation of immediate response. When you ask someone a question, you expect them to start responding within 300-500 milliseconds.
Any longer feels awkward. Much longer, and people assume the connection dropped.
This real-time requirement transforms every aspect of the AI pipeline from "nice to have" optimisation into absolute necessity.
The Latency Stack: Death by a Thousand Milliseconds
A voice AI system involves multiple sequential processes. Each adds latency. What Is Latency in AI Voice Calls?
Audio Capture and Transmission (50-150ms) Sound travels from the caller's mouth to their phone's microphone, gets encoded, and transmitted across the network. In Australia, where distances are vast and infrastructure varies, this alone introduces significant delays.
Voice Activity Detection (30-100ms) The system must determine when someone has finished speaking. Too quick, and you cut people off mid-sentence. Too slow, and the conversation feels sluggish. What Is VAD (Voice Activity Detection)?
Speech-to-Text Transcription (100-500ms) Converting audio to text requires acoustic models that process sound waves and language models that interpret probable words. Real-time transcription must balance accuracy against speed.
Large Language Model Processing (200-2000ms) The AI "brain" understands context, formulates a response, and generates appropriate text. Modern LLMs are powerful but computationally expensive. Every additional feature (personality, context memory, tool use) adds processing time.
Text-to-Speech Synthesis (100-300ms) Converting text back to natural-sounding speech requires neural networks that model prosody, emphasis, and intonation. Robotic text-to-speech is fast but off-putting. Natural speech takes more processing.
Audio Transmission Back (50-150ms) The synthesised audio travels back across the network to the caller's ear.
Add these together: anywhere from 500ms to over 3 seconds of latency.
