Deploying AI agents in production: what enterprise teams need to get right

The demo worked. The production deployment didn't.

There's a pattern that keeps repeating across enterprise AI programmes in 2026. A small team connects a foundation model to a handful of internal APIs, runs it against a clean dataset, watches it execute a workflow autonomously, and declares the pilot a success. Six months later, the same system is either still in the pilot environment or has been quietly shelved.

Recent agent reports all point in the same direction: enterprise interest is high, but durable production adoption is much harder than running a successful pilot. Many teams can show a working agent demo; far fewer have the operational controls, evaluation process, and ownership model needed to run agents across real workflows.

That gap isn't a technology problem. The models are capable. The tooling has improved substantially. The gap is an infrastructure and governance problem, and it has specific, diagnosable causes.

For UK enterprise teams building on AWS, Azure, or GCP, the decisions that determine whether an agent deployment succeeds happen mostly before the first production request arrives. An experienced cloud consultancy has seen this pattern repeatedly: the teams that get production right invest heavily in the architecture and governance layer. The teams that don't invest in the demo and discover the gaps under load.

Why pilots succeed and production fails

Pilots succeed because they're designed to. Clean data. Controlled inputs. Small user groups. Sympathetic evaluators. A team that's available to intervene if something goes wrong.

Production removes all of those conditions. Real users send ambiguous requests. Real data has inconsistencies and edge cases. The team that built the agent isn't watching every request. And the consequences of a bad output aren't limited to a disappointing demo. They might mean a corrupted database record, a sent email that shouldn't have been sent, or a cloud resource provisioned without authorisation.

The biggest bottleneck of 2025 was the integration wall: every agent needed a custom connector for every tool. That changed with the widespread adoption of the Model Context Protocol (MCP). But the integration layer is only one of several places where production deployments come apart. The others are harder to fix with a protocol.

The integration wall and how MCP changes it

Before MCP, connecting an AI agent to enterprise systems meant writing custom connectors for each tool, each data source, each API. As the number of agents and internal tools grew, the integration surface expanded quickly, creating maintenance liability and more failure points.

MCP, introduced by Anthropic and now governed by the Linux Foundation under the Agentic AI Foundation, provides a standardised interface: one protocol that any compliant agent can use to interact with any MCP-compatible tool.

For enterprise teams, MCP changes the integration calculus substantially. Instead of building point-to-point connectors, you build MCP servers for your internal systems once, and any agent framework (LangGraph, AutoGen, CrewAI, Bedrock Agents) can use them.

MCP does not eliminate integration complexity. It standardises it, which is not the same thing. You still need to decide what tools your agents can access, how access is authenticated, what data they can read and write, and what audit trail you'll maintain for every tool invocation. MCP's focus is on simplicity and ease, not authentication and encryption. The security layer has to be built on top of the protocol, not assumed to come with it.

The security problem most teams discover too late

As enterprises adopt AI and ML tools, more operational data moves through systems that were not originally designed for agentic access. Much of that movement happens before security, audit, and governance controls have caught up.

For agents that have tool access to production systems, the security surface is larger than it appears. An agent granted access to a database query tool and an email sending tool and a file system tool has, in combination, capabilities that no individual tool grants alone. It can read sensitive data, summarise it, and send it somewhere. That's not a hypothetical. Security researchers have documented prompt injection attacks against MCP-enabled agents, where malicious content in a tool's response instructs the agent to take actions the user didn't request.

The security controls that work for agentic systems are different from traditional application security. They need to operate at the reasoning level, not just the API level.

In practice, that means a few things. Agents should operate under a minimal-tool principle: give each agent access to the smallest set of tools required for its task, not the largest set that might be useful. An agent doing document summarisation does not need write access to your CRM. Every tool invocation should produce an audit log entry capturing what was called, with what parameters, by which agent, in which session, and at what time. In regulated industries this is a compliance requirement; across all industries it's the only way to investigate an unexpected agent action after the fact. Tool permissions should be defined as code and reviewed like infrastructure changes, not granted informally and left undocumented. MCP servers should run in isolated environments with egress filtering, so a compromised or manipulated server cannot reach internal network resources beyond its defined scope.

Teams often discover a gap between confidence and actual governance only after an agent is already connected to production tools. That gap is where most security incidents originate.

Output quality at volume

In a pilot, quality problems are visible. Someone reviews outputs, notices an error, flags it. In production, an agent processing thousands of requests a day produces errors that nobody notices until they've accumulated into something significant.

Quality is the production killer. This encompasses accuracy, relevance, consistency, and an agent's ability to maintain the right tone and adhere to brand or policy guidelines. Latency usually follows close behind because slow agents are hard to trust inside real workflows.

The response to output quality problems in production is an evaluation infrastructure: systematic, automated assessment of agent outputs at volume. This is distinct from observability (knowing what the agent did) and monitoring (knowing when something went wrong). Evaluation asks whether the agent is doing the right thing across thousands of interactions, using a combination of deterministic checks and LLM-as-judge assessments.

Many teams add observability before they add evaluation, but observability and evaluation answer different questions. Observability tells you a problem exists. Evaluation helps you catch the problem before it reaches a user.

A minimum viable evaluation suite for a production agent covers: did the agent complete the task, did it use the expected tools, did the output meet accuracy requirements on a sample of known-answer cases, and did it stay within the scope of its defined behaviour. This doesn't require a sophisticated platform on day one. A structured test set and a weekly review cadence catches more regressions than most teams expect.

The ownership gap

The same gaps appear repeatedly in agentic AI scaling failures: integration complexity with legacy systems, inconsistent output quality at volume, absence of monitoring tooling, unclear organisational ownership, and insufficient domain training data. Ownership gaps tend to leave monitoring gaps unfilled, which in turn makes quality problems invisible until they compound.

Ownership is the least technical problem on that list and the most common root cause of the others. When nobody is specifically responsible for the production health of an AI agent, the monitoring doesn't get built, the evaluation cadence doesn't happen, and the integration issues don't get escalated until a user complains.

The same logic applies to production operations. An agent in production needs an owner: someone who receives alerts, reviews evaluation results, triages unexpected behaviour, and has the authority to roll back or restrict the agent's capabilities if something goes wrong.

This person doesn't need to be a machine learning engineer. They need to understand what the agent is supposed to do, have access to the observability data that shows what it's actually doing, and have the authority to act on a discrepancy.

For teams working with a cloud consultancy UK on agentic deployments, establishing this ownership structure is something experienced advisors will identify as a prerequisite, not an afterthought. The architecture questions are the easier ones to answer. The governance questions determine whether the deployment sustains.

Narrow scope is not a limitation. It's the strategy.

The pattern across agentic AI project outcomes is consistent: narrower scope is easier to operate, evaluate, and improve. Scope is not a secondary variable. It is one of the primary determinants of AI project outcomes.

This runs counter to how most enterprise AI programmes are sold internally. The business case usually involves a broad vision: an agent that handles all of X, or automates the entire process of Y. The delivery reality is that agents which try to handle all of X fail at the edges of X, and those failures are what users remember.

The teams delivering production value reliably in 2026 are running agents on narrowly defined workflows with clear success criteria and immediate human escalation paths for anything outside scope. IT service desk, knowledge management, case routing, response generation, and escalation management are all stronger candidates than broad "automate everything" mandates because they have clear inputs and outputs.

That's not a failure of ambition. It's what production-ready agentic AI looks like in 2026.

The build-vs-operate imbalance

The survey found that successful scalers spent proportionally more on evaluation infrastructure, monitoring tooling, and operational staffing, and proportionally less on model selection and prompt engineering. The data suggests that scaling failure is a build-vs-operate imbalance, not an underspending problem.

Most enterprise AI programmes are weighted heavily toward build. The engineering investment goes into the agent: prompt design, tool selection, framework choice, integration work. The operational investment in evaluation pipelines, monitoring infrastructure, incident response procedures, and human escalation paths gets less attention and less budget, because it's less visible and less exciting.

The consequence is a production system that works adequately at low volume and degrades at scale. The failure modes accumulate in the operational layer that wasn't built: undetected quality regressions, security gaps in tool permissions, lack of rollback capability, no clear process for handling unexpected agent behaviour.

Getting this balance right from the start is faster and cheaper than retrofitting operational infrastructure onto a production agent that's already serving users. The architecture decisions that support good operations (structured logging, tool permission models, evaluation hooks, rollback mechanisms) are significantly harder to add after deployment than to include from the beginning.

What the AWS architecture looks like

For organisations building on AWS, the production architecture for agentic AI has become more structured in the past year. The teams that have successfully scaled tend to combine the same core components in similar ways.

AWS Bedrock Agents handles managed agent orchestration, with tracing enabled from day one. Bedrock emits structured trace events for every agent step, model invocation, and tool call, which flow into CloudWatch and can be exported to a centralised observability backend via Kinesis Firehose. MCP servers sit alongside this as Lambda functions or ECS services, each scoped with an IAM role that enforces minimal tool permissions rather than inheriting from a permissive default. An evaluation pipeline built on Step Functions, triggered by trace ingestion, runs sampled outputs through deterministic checks and LLM-as-judge assessments on a cadence that fits deployment volume. Resource tagging at the agent session level lets CloudWatch and Cost Explorer attribute infrastructure cost to specific agent workflows, so token costs flow through the observability platform alongside latency and error rates. Alerting treats unexpected agent behaviour with the same priority as infrastructure alerts: an agent invoking tools outside its expected pattern, or a guardrail firing far above its normal rate, warrants the same response as a database error rate spike.

For UK organisations working with an AWS cloud consultancy on this architecture, the value of external expertise is concentrated at the design stage. The IAM boundary conditions for MCP servers, the evaluation pipeline architecture, and the rollback strategy are decisions with multi-year consequences, and they are significantly cheaper to get right in design than to fix in production.

Getting started without getting stuck

The teams that successfully bridge pilot to production share a few structural practices that teams stuck in perpetual piloting don't have.

The first is scope discipline. Start with one narrowly defined workflow, not a category of work. A specific, repeatable task with clear inputs, outputs, and success criteria. An agent that handles one specific type of customer support request is a production deployment. An agent that handles customer support is a pilot.

The second is building evaluation infrastructure before scaling volume. Before increasing agent traffic or expanding scope, establish a baseline of what correct behaviour looks like and a mechanism to detect deviations from it. This can start small: a focused test set of known-answer cases and a weekly review cadence. It matures as the deployment does.

Tool access needs to be treated as a security surface from the first production request. Define what tools the agent can access, authenticate that access explicitly, log every invocation, and review the logs. Starting restricted and expanding based on demonstrated need is significantly safer than granting broad access and trying to restrict it after something goes wrong.

Ownership needs to be named before deployment, not assigned retroactively. Somebody needs to be specifically responsible for the production health of the agent: receiving alerts, reviewing evaluation results, and making calls when something unexpected happens.

Finally, the architecture layer is where external expertise pays back fastest. The experience gap is real, and the decisions that benefit most from external expertise (cloud infrastructure design, security boundaries, evaluation pipeline architecture) are exactly the ones with the longest-term consequences.

The organisations that get agentic AI into reliable production in 2026 won't be the ones with the most sophisticated models. They'll be the ones that built the operational and governance infrastructure that lets those models run without constant human rescue.