Sorry I Didn't Catch That: Why Latency Makes People Hang Up on AI Call Centers
Sorry I Didn't Catch That: Why Latency Makes People Hang Up on AI Call Centers
V
Voxworks Team
·
You're talking to an automated system, answer a question, and there's an uncomfortable pause.
Then: "Sorry, I didn't catch that. Could you repeat that?"
You repeat yourself. Another pause. Same response. You hang up. This experience is exceptionally frustrating, impersonal, and futile. It's the single biggest reason AI call center systems fail. It almost always comes down to one factor: latency.
What Latency Actually Is
In voice AI, latency is the delay between when you finish speaking and when the AI responds. It encompasses everything in between:
Your voice travels from phone to AI system
System detects you've stopped talking
Speech converts to text using a Speech-To-Text (STT) AI model
AI processes text using a prompt and generates response using a Large Language Model (LLM)
Response converts back to speech using a Text-To-Speech (TTS) model
Speech travels back to your phone
Each step takes time. Add them up. That's latency.
How Latency Destroys Conversations
Human conversation operates on precise timing. Research shows:
Normal turn-taking: 200-300ms
Perceived natural pause: Under 500ms
Noticeable delay: 500-800ms
Uncomfortable delay: 800-1200ms
Conversation breakdown: Over 1200ms
To be clear, we're not obsessing over latency just because we want to build a system that sounds like a person, or that we're single minded about shaving milliseconds off a meaningless technical metric. The reason we obsess about latency is that when AI call center latency exceeds 800ms, you will automatically start experiencing a breakdown in the effectiveness of the system to perform its job.
User Perception Drives Conversation Quality
When you substitute a human operator for a machine, the system needs to maintain the same or higher standard of service for its allocated role. IVR call menu systems have existed for decades, yet you wouldn't replace your receptionist with one. The reason is there's a fundamental difference in the quality of the communication between a human conversation versus a machine.
AI fundamentally can meet the standards of a human operator, however a poorly optimised voice system will still result in customers responding to that conversation in a robotic way. It is only once the suspension of disbelief is achieved in the user's mind that the conversational quality can revert to a human standard.
Speakers Start Overlapping
In normal conversation, we anticipate when someone will finish and begin formulating our response. Long AI delays throw off this timing.
The human starts speaking again, thinking the AI didn't hear them. But the AI was about to respond. Now both are talking simultaneously. Confusion ensues.
This experience is clunky and frustrating for the caller, completely undermining the 'X' in 'CX'.
Speech Recognition Degrades
Modern speech recognition works best with clean, complete utterances. When humans fill silence with "um," repeat themselves, or add clarifications because they think they weren't heard, recognition accuracy drops.
The very behaviour latency causes (repetition) makes the latency problem worse.
Frustration Escalates
Each delayed response builds frustration. By the third or fourth long pause, callers are primed to hang up. Their threshold for any additional friction approaches zero.
Trust Evaporates
Conversations require trust that you're being heard and understood. Latency destroys this trust. Each pause signals "maybe this system doesn't understand me."
Once trust is gone, callers won't engage meaningfully even if subsequent responses are perfect. And sometimes responses aren't perfect (see What Is LLM Hallucination?).
On the contrary, if the system feels extremely sharp and responsive, the customers will automatically start to lean on this snappiness to get the answers they want. They will volunteer to talk about things that they otherwise wouldn't bother with. And if they know it's a system, they will push the limits of that system to extract the maximum value out of the service you're offering, leading to better CX.
Where Latency Comes From
Understanding the sources helps explain why it's hard to fix:
Network Latency (50-300ms)
Voice data must travel from caller to AI system and back. Distance matters:
The system must detect when you've stopped talking. In the real world this is much trickier than it sounds:
Pauses within sentences (breathing, thinking)
Background noise that might be speech
Trailing sounds at sentence ends
Cross-talk with other people
Conservative VAD waits longer to be sure—adding latency. Aggressive VAD cuts you off mid-sentence, creating different problems. What Is VAD (Voice Activity Detection)?
Speech-to-Text (100-500ms)
Converting audio to text requires:
Audio buffering and processing
Acoustic model inference
Language model application
Punctuation and formatting
Real-time streaming is faster but less accurate. Batch processing is more accurate but slower. Some systems settle somewhere in between. We employ a hybrid approach that takes the best of both worlds.
LLM Processing (200-2000ms)
Often the biggest latency source. The AI must:
Parse and understand input
Consider conversation context
Generate appropriate response
Format it for speech
Larger, more capable models take longer, simpler models are faster but less helpful. The quality-speed tradeoff is fundamental.
Furthermore, there is a linear relationship between the amount of text fed into the model and the output latency. Knowledge base and context windows therefore must be carefully managed.
Text-to-Speech (100-300ms)
Converting text to natural speech requires:
Text analysis and normalisation
Prosody prediction (how to say it)
Neural synthesis
Audio encoding
Higher-quality voices take longer and require specific fine tuning to extract best performance. Robotic voices from cheaper open source models are faster but are off-putting to the user and not considered viable in real-world settings.
Total: 480-3,350ms
Add up the ranges and you see the challenge. Even well-optimised systems struggle to achieve sub-500ms latency. Poorly optimised systems can exceed 3 seconds.
Why Australian Businesses Face Bigger Challenges
Minimum viable latency (US platform):
Network: 300ms (round trip)
VAD: 50ms
STT: 200ms
LLM: 400ms
TTS: 150ms
Total: 1,100ms minimum
Minimum viable latency (Australian platform):
Network: 20ms (round trip)
VAD: 50ms
STT: 200ms
LLM: 400ms
TTS: 150ms
Total: 820ms minimum
That 280ms difference doesn't sound like much. But it's the difference between "slightly slow" and "noticeably awkward." It pushes conversations from tolerable to frustrating, and makes the system completely unworkable in the real world.
We believe the main reason why Australian businesses haven't yet tapped the productivity gains that can be achieved through automated voice systems that we are seeing offshore. The tech fundamentally didn't work in Australia until now.
How We Reduce Latency
So now you understand the scale complexity of this challenge, you can start to see why Voxworks was created.
Our team has worked extremely hard to fix this one problem and provide the ultimate solution for Australian businesses wanting to deploy this game-changing technology.
At Voxworks, we've implemented multiple approaches:
1. Local Infrastructure
All processing happens on Australian servers. Network latency stays minimal regardless of where callers are located within Australia.
2. Streaming Everything
Instead of waiting for complete inputs and outputs:
Speech-to-text starts before you finish talking
LLM generates responses token by token
Text-to-speech begins before full response is complete
This "streaming" approach overlaps processing stages rather than running them sequentially.
3. Optimised Voice Activity Detection
Our VAD uses:
Machine learning models trained on Australian speech patterns
Contextual awareness (expecting short vs. long responses)
Adaptive thresholds based on connection quality
4. Response Anticipation
For common or expected queries, the system pre-computes likely responses while the caller is still speaking. If the actual query matches predictions, response is nearly instant.
5. Fallback Responses
When processing takes longer than expected, natural filler phrases:
"Let me check that for you..."
"One moment..."
"Just looking that up..."
This acknowledges the caller while buying processing time, far better than raw silence.
Measuring Latency
If you're evaluating AI call center platforms, test rigorously:
Time to First Response: Measure from when you stop speaking to when AI audio begins.
Consistency: Average latency matters less than worst-case. A system averaging 600ms but occasionally hitting 2 seconds will frustrate callers.
Response Appropriateness: Fast garbage is worse than slow quality.
For most AI call center use cases, balanced latency (500-800ms) works well.
The Bottom Line
Latency is the silent killer of AI voice experiences. Callers don't say "the latency was too high." They just hang up frustrated, wondering why they bothered.
Fixing latency requires expertise across networking, speech processing, machine learning, and telecommunications. It requires Australia-specific infrastructure investment and ongoing optimisation. There are no shortcuts.
For Australian businesses, the choice of AI call center platform is fundamentally a choice about latency. Before we even start talking about compliance, US-based platforms carry unavoidable network delay. Only Australian-hosted solutions can deliver the sub-second response times that conversations require.
When evaluating voice AI platforms, be careful about latency and most importantly, test it yourself. Compare it to our conversational demo on the Voxworks website.
The best AI in the world is worthless if people hang up before it can help them.
Experience low-latency AI voice built for Australia. Start your free trial at voxworks.ai.