Agent Observability Starts Where the Run Went Wrong
Without run-level observability, every agent issue feels like a model problem. Usually it is not.
The first production complaint about an AI agent usually arrives as a sentence, not a metric. "It gave the wrong answer." "It took too long." "It used the wrong tool." "It should have escalated." The team opens the logs and finds a wall of tokens, timestamps, and maybe a model name. Everyone has a theory. Nobody has a trace.
This is why agent observability has to start at the run, not at the dashboard. A run is the unit of customer experience. It includes the user input, the workflow path, the sources retrieved, the tool calls, the guardrails triggered, and the final output. If you cannot reconstruct that path, you cannot tell whether the issue came from the model, the prompt, the knowledge base, the tool, or the policy layer.
Four numbers are enough to begin
Teams often overbuild observability because the system feels uncertain. They add token charts, model breakdowns, cost graphs, latency histograms, and custom event streams before they can answer the basic question: did the agent complete the job correctly?
- Success rate: the percentage of runs that reached the intended outcome.
- Fallback rate: how often the agent could not answer or had to use a default path.
- Escalation rate: how often the workflow handed off to a human or approval gate.
- Median completion time: how long the happy path actually takes.
Run health
Support agent · last 24h
Fallback rate
8.2%
Median run time
1.4s
Escalations
14
Policy blocks
3
v1.8 deployed · fallback rate unchanged
Tool timeout spike · CRM lookup retried 12 times
Policy block · refund request over threshold
Version everything that can explain a regression
A surprising number of AI incidents are not caused by the model changing. They are caused by a prompt edit, a new knowledge source, a tool timeout, or a workflow branch that quietly started receiving more traffic. If every run is linked to a prompt version, workflow version, knowledge index version, and tool policy version, regressions become much easier to isolate.
This is especially important for teams that are improving agents continuously. The more often you ship small behavior changes, the more important it becomes to know exactly what changed. Observability is not separate from release management. It is how release management stays honest.
Log decisions, not every thought
There is a temptation to log everything an agent produces internally. That creates noise and can create privacy problems. A better approach is to log the decisions that changed the run: which route was selected, which source was retrieved, which tool was called, which guardrail fired, which approval was requested, and which fallback was used.
Those events create a narrative a human can follow. They also create useful dimensions for analysis. If one tool is responsible for most timeouts, that is an engineering problem. If one policy block triggers constantly, that may be a product rule that needs clarification. If one prompt version raises fallback rate, rollback becomes obvious.
Observability should change what happens next
A dashboard that nobody uses during deployment is a poster. The best agent observability is connected to gates: do not promote a new version if fallback rate rises, if latency crosses the product budget, or if escalation drops in a workflow where escalation is a safety feature.
In Trumpets, logs, workflow identifiers, versions, and learning episodes are part of the same operating picture. That matters because the goal is not to admire the system from a distance. The goal is to find the exact place where the run went wrong and make the next release safer.