The Problem Nobody Expected
When organisations first deployed large language models in production, observability felt like a solved problem. You had a prompt, you had a response, you had latency and token counts. Standard monitoring tools could handle it. Log the input and output, track the cost, alert on errors. Done.
Then agents arrived.
An agentic AI system does not take one input and produce one output. It reasons, plans, selects tools, calls APIs, reads files, writes to databases, spawns subagents, retries on failure, and produces a final result that may be many decision steps removed from the original instruction. A single user request can trigger dozens of intermediate actions, each one shaping the next, each one capable of producing unexpected results that cascade through the rest of the execution.
Traditional monitoring tracks what a system does. Agentic AI observability requires understanding why the system did it, what it was reasoning about at each step, and whether the path it took was safe, correct, and aligned with what the user actually intended.
This is a genuinely new problem. And for UK organisations deploying agentic AI on cloud infrastructure, getting it right is not optional. It is the difference between AI that creates value reliably and AI that creates liability silently.
For engineering teams working with a cloud consultancy to deploy agentic systems on AWS, Azure, or GCP, observability is increasingly the first conversation, not the last.
What Makes Agentic Systems Different
To understand why observability is hard for agentic AI, it helps to understand what makes agentic systems categorically different from the software that came before them.
Non-determinism at every step. The same input can produce wildly different execution paths. An agent deciding which tool to call, which API endpoint to hit, or how to interpret an ambiguous instruction may reach entirely different conclusions across two identical runs. This means you cannot snapshot a failure and replay it reliably. You cannot write a unit test that proves the agent will behave correctly next time.
Action with real-world consequences. Agentic systems are not read-only. They write to databases, send emails, provision cloud resources, call external APIs, and interact with production systems. A hallucinated decision in step three of a twelve-step workflow does not just produce a bad answer. It can corrupt data, trigger downstream processes, or initiate actions that are difficult or impossible to reverse.
Multi-agent complexity. Production agentic systems increasingly involve multiple agents working in parallel or in sequence, with one agent delegating tasks to others. When something goes wrong, the failure may originate three agents deep in a workflow that spans multiple services, accounts, and cloud regions.
Emergent behaviour. Agents can combine tools and reasoning in ways that were not anticipated during design. An agent given access to a database query tool and a web search tool and a code execution tool may devise novel strategies that work brilliantly, or produce unexpected outcomes that no individual system was designed to prevent.
According to a 2025 McKinsey Global AI survey, 51% of organisations using AI experienced at least one negative consequence from AI inaccuracy. For agentic systems, where decisions compound across multiple steps and actions have real consequences, the stakes are meaningfully higher than for a chatbot producing an unhelpful response.
The Three Observability Gaps
Most organisations deploying agentic AI discover three specific observability gaps that their existing infrastructure does not address.
Gap 1: Trace coverage ends at the API call
Traditional distributed tracing tracks requests through services. It shows you that Service A called Service B, which called Service C, and tells you how long each call took. This works well for microservices.
For an agent, the unit of interest is not the API call. It is the reasoning step: why did the agent choose to make that call, what was it trying to accomplish, and did the result change its subsequent behaviour? Without tracing at the reasoning level, you can see that an agent called a database tool at 14:23:07, but not why it chose that tool over the three other tools available to it, or what it concluded from the result.
This is what the engineering community means when they say the tracing infrastructure for agentic AI is still immature. Most teams currently assemble a combination of LangSmith for LangChain applications, custom logging, and manual inspection of agent outputs. It works at small scale. It does not work in production at enterprise volume.
Gap 2: No standard for what to instrument
Until recently, there was no agreed standard for what telemetry an agentic AI system should emit. Different frameworks emitted different signals, in different formats, with different semantics. Comparing observability data across a LangGraph agent and an AutoGen agent and a custom-built agent was essentially impossible without significant custom integration work.
This is changing. OpenTelemetry's GenAI observability project is actively defining semantic conventions for AI agent observability, covering agent spans, tool calls, model invocations, and memory operations. The emerging standard defines attributes for tracing tasks, actions, agents, teams, artifacts, and memory, and is intended to work across frameworks including CrewAI, AutoGen, LangGraph, Semantic Kernel, and others.
Organisations building on AWS can take advantage of AWS Bedrock's native tracing, which emits OpenTelemetry-compatible telemetry for Bedrock Agent invocations. This is the direction the industry is moving: vendor implementations of OpenTelemetry GenAI conventions, enabling consistent observability regardless of which foundation model or framework is in use.
Gap 3: Evaluation is not monitoring
The third gap is conceptual. Many engineering teams treat observability and evaluation as the same thing. They are not.
Monitoring tells you that something went wrong. Evaluation tells you whether the agent is doing the right thing. For agentic AI, both are required, and they require different infrastructure.
A monitoring system might alert you that an agent's error rate has increased. An evaluation system asks whether the agent's outputs were actually correct, relevant, safe, and aligned with the user's intent, across thousands of interactions, using a combination of deterministic checks and LLM-as-judge assessments. Without evaluation infrastructure, organisations are flying blind: they know the agent is running, but not whether it is working.
What Good Agentic Observability Looks Like
The organisations that have invested in agentic observability seriously tend to converge on a set of capabilities. These are not all achievable on day one, but they represent the target state for any team operating agentic AI in production at meaningful scale.
End-to-end trace capture
Every agent execution should produce a complete, structured trace: from the initial user input, through every reasoning step, tool call, API request, and intermediate result, to the final output. The trace should capture:
- Which model was used at each step, and which version
- The full input to each model invocation, including system prompt and context
- The model's output and any tool calls it requested
- The result of each tool call
- The agent's subsequent reasoning given that result
- Token usage, cost, and latency at each step
- Any errors, retries, or unexpected branches
This level of trace granularity is expensive to store and process at high volume. Most organisations implement tiered retention: full traces for a short window, aggregated metrics for longer periods, and selective full-trace retention for failure cases and sampled successful executions.
Guardrail telemetry
Every guardrail in an agentic system, whether input filtering, output validation, tool use restrictions, or safety classifiers, should emit observable signals. You need to know not just that a guardrail fired, but how often it fires, for what types of inputs, and whether it is blocking legitimate agent behaviour or catching genuine safety issues.
Without guardrail telemetry, you cannot tune your safety mechanisms. A guardrail that fires too aggressively degrades the user experience. A guardrail that fires too rarely is not providing the protection it was designed for. Both failure modes are invisible without instrumentation.
Cost and token attribution
For agentic systems, cost is non-trivial and can be surprising. A single user request that triggers a multi-step agent workflow may invoke a foundation model ten or fifteen times, each invocation consuming tokens. At scale, the cost difference between a well-optimised agent workflow and a poorly designed one can be an order of magnitude.
Token usage and cost should be attributed at the task level, not just the API call level. You need to know the cost per completed workflow, not just the cost per model invocation, so you can identify which agent behaviours are economically viable and which are not.
Anomaly detection on execution paths
Because agentic behaviour is non-deterministic, traditional threshold-based alerting is insufficient. A single metric like "average number of tool calls per request" may obscure the variance that signals a problem. What you need is baseline modelling of normal execution path distributions, and alerting when individual executions deviate significantly from that baseline.
This is one of the areas where platforms like Arize AI, LangSmith, and Weights and Biases are investing most heavily in 2026. The goal is to move from manual inspection of agent traces to automated detection of anomalous agent behaviour patterns.
The AWS Architecture for Agentic Observability
For organisations building agentic AI on AWS, the observability stack has become more structured in the past twelve months. An experienced AWS cloud consultancy will typically recommend an architecture that combines AWS-native services with purpose-built AI observability tooling.
Foundation layer: AWS Bedrock Agents with tracing enabled. Bedrock emits trace events for agent steps, model invocations, tool calls, and knowledge base retrievals. These are captured in CloudWatch and can be exported to a central observability backend via Firehose.
Orchestration observability: For agents built on LangGraph or custom orchestration frameworks, OpenTelemetry instrumentation using the emerging GenAI semantic conventions. The OTel collector exports to your chosen backend, maintaining consistency with the rest of your service telemetry.
Evaluation pipeline: Automated evaluation runs against sampled production traces, using a combination of deterministic checks (did the agent complete the task? did it call the expected tools?) and LLM-as-judge evaluation (was the output accurate? safe? relevant?). AWS Step Functions can orchestrate evaluation workflows triggered by trace ingestion.
Cost attribution: AWS Cost Explorer with resource tagging at the agent level, supplemented by token-level cost tracking within the observability platform. Every agent execution should carry a tag that allows its cloud cost to be attributed to the feature, team, and use case that generated it.
Alerting: CloudWatch alarms on key agent metrics, supplemented by anomaly detection within the observability platform. Critical alerts (agent taking actions outside expected scope, guardrail firing at unusual rates, cost per execution exceeding thresholds) should route to the same incident management system as infrastructure alerts.
The Governance Dimension
For UK organisations, agentic AI observability is not purely a technical concern. It is a governance requirement that is increasingly being shaped by regulatory context.
The EU AI Act, which applies to UK organisations operating in European markets, classifies many enterprise agentic AI applications as high-risk systems requiring documented evidence of human oversight, explainability, and auditability. Observability infrastructure is the technical foundation for meeting these requirements.
Specifically, the ability to answer "why did the agent take that action?" for any production execution is not just useful for debugging. It is increasingly a compliance requirement for organisations deploying agents in regulated sectors including financial services, healthcare, and legal.
AI agent observability is the ability to see, understand, and explain what enterprise AI agents are doing across systems with enough detail to debug issues, enforce guardrails, and prove compliance. It is the infrastructure that answers "why did the agent do that?" before regulators, auditors, or end users have to ask.
For UK organisations working with a cloud consultancy UK practice that understands the intersection of cloud architecture and regulatory compliance, this framing is important. Observability is not an engineering nicety. It is an audit trail.
What the Market Is Converging On
In November 2025, Palo Alto Networks announced a $3.35 billion acquisition of Chronosphere, one of the fastest-growing observability platforms of the AI era. The acquisition signals something important: observability is becoming security infrastructure. You cannot secure what you cannot see, and as organisations deploy agentic AI systems with real-world access and consequences, the need for real-time visibility into agent behaviour has become mission-critical.
The broader tooling landscape is consolidating around a handful of patterns:
OpenTelemetry as the instrumentation standard. The GenAI semantic conventions are still stabilising, but the direction is clear. Organisations that instrument their agents against the emerging OTel GenAI standard now will avoid painful migration work as the conventions reach stability.
Evaluation platforms separate from monitoring platforms. LangSmith, Arize, Braintrust, and Weights and Biases are all investing in evaluation infrastructure that sits alongside, not inside, traditional monitoring. The pattern is: monitor for anomalies and errors, evaluate for quality and alignment.
Guardrails as observable infrastructure. Services like AWS Bedrock Guardrails, Anthropic's Constitutional AI constraints, and purpose-built filtering layers are being instrumented and monitored as first-class infrastructure components, not invisible middleware.
Cost as a first-class observability signal. Token costs, cloud compute costs, and per-workflow unit economics are being tracked with the same rigour as latency and error rates. This is partly a FinOps discipline and partly an architectural feedback mechanism: if a specific agent workflow is unexpectedly expensive, that is a signal about its design, not just its bill.
Getting Started: A Practical Sequence
For engineering teams at the beginning of their agentic observability journey, the practical starting sequence is:
Instrument before you scale. The worst time to add observability to an agentic system is after it is running at production volume. The instrumentation adds overhead and complexity, and the absence of historical baseline data makes it hard to distinguish normal from anomalous. Build observability into the agent from the first production deployment.
Start with trace capture and cost attribution. These are the two signals with the clearest immediate value. Full execution traces let you debug failures. Cost attribution lets you identify economically unviable agent behaviours before they become budget problems.
Add evaluation infrastructure before you iterate aggressively. If you are making changes to agent prompts, tools, or reasoning patterns without an evaluation pipeline, you have no systematic way to know whether changes are improvements or regressions. Even a basic evaluation suite, covering the most common task types with a mix of deterministic and LLM-as-judge checks, dramatically improves the reliability of iteration.
Instrument your guardrails. Whatever safety and security controls you have in place, make them observable. Track firing rates, false positive rates, and the categories of inputs that trigger them.
Align with OpenTelemetry GenAI conventions. Even if the conventions are not yet stable, aligning your instrumentation with the emerging standard now reduces future migration cost and enables interoperability with the growing ecosystem of tools that will consume OTel GenAI telemetry.
For organisations that want to compress this timeline, engaging a specialist AWS cloud consultancy with experience in agentic AI architectures can help avoid the instrumentation mistakes that are expensive to undo and establish the observability foundations that governance and compliance requirements will eventually demand.
The Bottom Line
Agentic AI is moving from experimental to production faster than the observability infrastructure needed to govern it. The gap between prototype and reliable production system is wider for agentic AI than for any previous generation of software, and observability is a large part of what determines whether organisations successfully cross it.
Traditional monitoring tells you a system is running. Agentic AI observability tells you what it is doing, why it made the decisions it made, whether those decisions were safe and correct, and what they cost. That is a different capability, built on different infrastructure, requiring different expertise.
The organisations that build this infrastructure deliberately, as a design requirement rather than an afterthought, will have both the operational reliability and the audit trail that regulators, customers, and boards increasingly expect from AI systems that act with real-world consequences.
For UK organisations navigating this investment on AWS, the combination of Bedrock's native agent tracing, OpenTelemetry's emerging GenAI conventions, and purpose-built evaluation platforms provides a coherent foundation. Getting the architecture right from the start is significantly less expensive than retrofitting observability into a production agentic system that has already accumulated months of unobserved behaviour.



