As 2025 draws to a close, it’s pertinent to reflect on what was hyped to be “the year of AI agents”. Instead, it seems 2025 could be more accurately characterised as the year many teams realised the limitations of AI agents operating in the real world.
Companies spent millions discovering that getting agents reliably working in production takes a lot more effort than initially expected or perhaps budgeted for. Then in July researchers at MIT went viral for reporting that only ~5% of custom enterprise GenAI tools make it through to successful implementation.
There were obvious exceptions to this rule, none more evident than Anthropic’s Claude Opus 4.5 model and its use in their agentic coding tool Claude Code, which is broadly redefining the job description of software development. Claude has proven that AI agents can be supremely useful for long chain, complex tasks to produce work of meaningful value. At Voxworks we use Claude to build out product features at a pace that was unimaginable just 12 months ago. It's a glimpse of where the rest of the world is headed.
In the meantime, the fundamental question is why haven’t AI agents had much broader impact on other jobs and industries? Many are wondering what will it take to get a Claude for insert my job description.
The Reliability Cliff
The fundamental problem with AI agents is the stochastic nature of AI models (as opposed to being deterministic) and the way in which AI agents chain together multiple responses to complete a complex task.
Hallucinations and model errors are largely acceptable in a parallel process such as a maths test, where a 96% score is considered a pass and in some cases better than most humans could manage. But in a sequence, where each answer is dependent on a correct prior answer, even a small error rate results in massive deterioration in reliability.
The reason for this can be stated in simple mathematical terms: the probability of success of the entire process is equal to the success rate of a single step raised to the power of the number of sequential steps. From the prior example, if the LLM hallucinates only 4% of the time, after 10 iterations the probability of failure for the entire sequence is more than one in three. If you reduce the error rate to 1%, that same probability drops to one in ten.
We see this problem as particularly acute in Voice AI. Voxworks is building an AI calling platform that can theoretically automate the majority of any business’s low complexity or routine call volumes. The challenge is in modelling conversational turns that are dependent on prior turns. If the AI agent makes a mistake on any one turn, the entire conversation is often compromised.
On the flipside, if we could make minor improvements in the success rate of each individual step, the overall performance of the agentic system is exponentially improved. So then we should just use smarter models, right?
Well, the AI industry as a whole has been narrowly focused on improving intelligence as measured by “evals”, that is improving the response quality of a single step in a maths test type environment, by scaling these neural nets to unimaginable size. Then came the “thinking models” like GPT o3, which entered deep thought patterns with recursive LLM calls. Running these models to tap intelligence gains is impossible in voice AI because we can’t run them with low enough latency to sound natural in conversation.
