There's a comforting ritual in every operations team: you pull up the dashboard, the tiles are green, and you go home. Latency is healthy, error rates are flat, the agent is up. The machine is working. Everybody relax.
I want to ruin that ritual for you, at least where AI agents are concerned. Because a green tile on an agent fleet is answering a question nobody in a regulated industry actually cares about. It tells you the service responded. It says nothing about whether the response was right, whether the agent was allowed to do what it did, or whether you could prove any of it to a skeptical examiner six months later. The dashboard is green and it is lying to you — not maliciously, just by omission. It's measuring the wrong layer.
Uptime measures the pipe, not the water
Traditional observability was built for deterministic systems. A request comes in, code runs the same way every time, and a 200 means the thing did what it was supposed to. In that world, availability is a reasonable proxy for correctness. If the service is up and not throwing errors, it's probably doing its job.
Agents break that assumption completely. An LLM-backed agent can return a confident, well-formatted, HTTP-200 answer that is also wrong, fabricated, non-compliant, or quietly outside its authority — and your monitoring will register all of that as success. The pipe delivered water. Nobody checked whether the water was clean. Worse, the failure modes that matter most for agents are precisely the ones that don't trip a latency or error alarm: the tool call that touched a record it shouldn't have, the summary that hallucinated a number, the chain of reasoning that took a shortcut around a control. Those are silent successes on a uptime dashboard. They're the whole ballgame in a risk review.
So here's the reframe I push with my own teams: stop treating an agent like a service and start treating it like an employee with system access. You don't evaluate a new analyst by confirming they showed up. You evaluate the work. Agent observability has to do the same thing, and almost nobody's tooling does it out of the box.
The model-risk review is where the dashboard dies
If you operate anywhere near financial services — and serving 1,500+ financial institutions means I live in this world — you eventually meet a model-risk reviewer. Could be your own second line, could be an examiner, could be a customer's diligence team. They are not impressed by your uptime. They have a different and much harder set of questions, and they're the right ones:
- Did this agent do what it claimed to do? Show me the run, not the rollup.
- How do you know the output was good enough to act on? What was checked, and against what bar?
- Who or what authorized each action it took? And can you reconstruct that after the fact, for a specific decision, on demand?
None of those map to a tile. They map to three capabilities most agent deployments are missing: run-level evals, output trust checks, and audit trails that prove the work underneath the green actually happened. That's the real mechanic. If you can't produce those three artifacts, you don't have an observability gap — you have an unprovable system, and unprovable systems don't survive contact with a regulator or a serious customer.
Run-level evals: grade the work, not the wrapper
The first thing missing is evaluation at the granularity of an individual run. Most teams test their agent the way they tested a model: a benchmark suite at build time, a quality score in a slide deck, and then production becomes a black box. But the run that matters is the one that just happened in front of a customer, not the average over a test set from last quarter.
Run-level evals mean every meaningful execution gets scored against expectations — automatically, in production, at volume. Did the output match the schema and the task? Did the agent stay inside its tool scope? Did it ground its claims in retrieved context or invent them? You can do a lot of this cheaply: deterministic checks for structure and policy, retrieval-grounding checks for factuality, and a sampled LLM-as-judge layer for the squishier quality calls — with humans reviewing the judge's judgment, because an unaudited grader is just another unprovable system. The point isn't perfection. The point is that every run leaves a grade behind, so "the agent is working" becomes a measurable claim instead of a vibe.
Output trust checks: a gate, not a gauge
Evals tell you how you did. Trust checks decide whether the output is allowed to leave the building. This is the part teams skip because it adds latency and friction, and it is the part I refuse to ship without.
Before an agent's output reaches a customer or triggers a downstream action, it should pass a gate: confidence and grounding above a threshold, no PII or sensitive data leaking where it shouldn't, the action inside policy, and — for anything consequential — a human in the loop. When a check fails, the system degrades gracefully: it abstains, escalates, or falls back, rather than confidently doing the wrong thing at machine speed. A genuinely good answer to "I can't verify this, routing to a person" beats a fast, fluent, wrong one every single time. That's not me being conservative for its own sake. At agent speed and agent scale, an ungated mistake doesn't happen once — it happens ten thousand times before your green dashboard so much as flickers.
Audit trails: the receipts that make the green real
The third piece is the one that turns all of this from engineering hygiene into governance: the audit trail. For any given decision, you need to be able to reconstruct what the agent saw, what it decided, which tools it called with which parameters, what evals scored it, which trust checks it passed, and who or what authorized the action. Full lineage, queryable after the fact, retained as long as your regulators expect.
This is what actually backs the green tile. A dashboard says "trust me." An audit trail says "verify me" — and in regulated work, verifiable beats trustworthy every time, because trust without evidence is exactly what the review exists to puncture. The cost of building this is real. The cost of not having it shows up at the worst possible moment: when something goes wrong, or when a customer's diligence team asks you to prove a single agent decision and you discover the only thing you logged was a 200.
The challenge
So go look at your agent dashboard. If it's all green, ask the one question that matters: green according to what? If the honest answer is "the service was up and didn't error," you are monitoring the pipe and calling it the water.
The bar for AI in any trust-based business isn't availability — it's provability. Run-level evals, output trust checks, and audit trails are how you earn the right to that green tile instead of just painting it on. Build those three things now, while the stakes are still a demo and not a deal. Because the model-risk review is coming, and it has never once been impressed by an uptime number.
