Why most AI agents fail in production
Building a demo agent is easy. Keeping one running reliably in a real business is not. Here's what actually breaks and how to design around it.
Everyone has built a demo agent. You string together a few tool calls, the LLM reasons through a task, and it works beautifully in your notebook. Then you try to run it in production — against real users, real data, real edge cases — and it falls apart within a week.
I've deployed agents into production across a dozen different businesses. The failures are almost never about the model. They're about the infrastructure around it.
The problems I see most
1. No observability
When an agent misbehaves, you need to know what it called, what it was given, what it decided, and why. Most teams log the final output and nothing else. That's like debugging a server with no access logs.
Every agent I ship has structured traces: each tool call, its inputs and outputs, token counts, latency, and the model's reasoning where available. Boring to instrument. Invaluable when something goes wrong at 2 AM.
2. Tools that aren't agent-safe
Tools designed for humans rely on context the human provides implicitly — they ask clarifying questions, they check before destructive operations, they know when something looks wrong. An LLM calling those same tools gets no such safety net.
Before an agent touches any tool, I ask: what's the worst thing it can do with this? If the answer is "delete a customer's data" or "send an email to 10,000 people", the tool needs guards — confirmation steps, dry-run modes, hard rate limits — that wouldn't exist in the human-facing version.
3. Poorly scoped tasks
The broader the task, the more ways the agent can interpret it — and the more opportunities for it to go off-script. "Handle customer support tickets" is a recipe for chaos. "Categorise tickets into these five buckets, draft a reply from this template, flag anything outside the buckets for human review" is something you can actually evaluate.
Narrow the scope until you can write a concrete acceptance test. If you can't describe failure, you can't detect it.
4. No evals
Prompts drift. Models update. Tool APIs change. Without a regression suite, you won't know your agent broke until a user tells you.
Evals don't have to be complex — even 20 representative inputs with expected outputs catches most regressions. The discipline of writing them forces you to be precise about what the agent is supposed to do, which is usually the most valuable part of the process.
5. Treating the agent as a black box
Teams that hand off agent development entirely — "build me an agent, I'll trust the output" — almost always end up with something brittle. The people closest to the workflow need to be involved in designing the tools, writing the evals, and reviewing early runs. The model is the easy part. The domain knowledge is not.
What good looks like
A production agent is a system, not a prompt. It has:
- Clear scope — a specific task with known inputs and measurable outputs
- Instrumented traces — every decision visible and queryable
- Safe tools — with guards, rate limits, and audit logs
- An eval suite — run on every deploy
- A human escalation path — when confidence is low, hand off gracefully
None of this is glamorous. But it's the difference between a demo and something you can actually stake your business on.
If you're building an agent and want a second opinion on the architecture, get in touch. I'm usually quick to spot the parts that will break first.