The Scars of Autonomy: Field Reports from Three Agentic Deployments

Thomas Prommer Technology Executive & CTO Connect on LinkedIn

Published: April 18, 2026

Updated: May 25, 2026

Three agents, three scars

Over the last eighteen months I have been close enough to three agentic deployments to see them break. The organizations were not similar: a mid-market fintech, a large enterprise software vendor, a global consumer retailer. The agents were not similar either: a customer-operations triage agent, an internal research agent with tool access, an inventory-reconciliation agent with write authority. The failures, when I look back at the post-mortems, were structurally identical.

In every case, someone could point to a policy document that, read strictly, should have prevented the incident. In every case, the policy was too coarse for an autonomous system to parse. In every case, the organization had moved from copilots to agents without moving from copilot-grade operational infrastructure to agent-grade operational infrastructure. I ended up writing the framework that now lives on ctaio.dev, the Agentic Readiness Index, because I could not find another way to explain what kept going wrong.

This piece is the inverse of that framework. It is the scars. Three stories, anonymized, each mapped to the lever that failed.

1. The fintech that refunded too much (Policy Granularity)

The fintech had a customer-ops agent that could triage support tickets, classify them by severity, and, for a specific subset of billing disputes, issue a refund up to a small threshold without human approval. Everything else escalated. The policy was explicit. The threshold was deliberately conservative. Internal counsel had signed off.

On a Tuesday in the spring of 2025, the agent issued forty-one refunds in sixteen minutes to the same customer. Each refund was individually under the threshold. Collectively they added up to a mid-five-figure number. The customer, who had spent the previous afternoon talking to three different human agents, had finally found the pattern that cooperated.

The policy was written for humans making refund decisions one at a time. Nothing in it said "also, don't issue forty of them in a row to the same person in the same hour." A human would have caught this; no human saw it. The agent did not catch it because the agent\'s operational memory was scoped to the ticket, not the customer.

We fixed it in two weeks. A per-customer rate limit on agent-issued refunds. A cross-ticket policy check that ran before any action. A post-mortem that all of customer operations read. The interesting part, though, is what it exposed: the company had an AI policy, and the AI policy was thorough, and the AI policy was written as if the agent were a well-intentioned human employee with a memory. Agents do not have a memory. Agents have the context window they were handed. Policy granularity is whether your rules account for that difference.

The lesson the leadership team took: policies written at the "employee handbook" level of abstraction cannot bind autonomous systems. Agent policies live one level lower, at the level of specific tools, specific operations, and specific aggregated-state checks that would be obvious to a human reviewing the behavior but invisible to the agent operating it.

2. The software vendor whose agent split-brained (Toolchain Interoperability)

The software vendor had an internal research agent. It could query their issue tracker, their codebase, and three external APIs. It was used heavily by support engineers to surface historical context on customer incidents. Latency was fine. Quality was good. Adoption was high.

Then, over a weekend, one of the external APIs, a vendor-operated knowledge base, deployed a breaking change to its response schema. The change was announced with ninety days of notice. The announcement went to the API contact on file, which was an engineer who had left the company eight months earlier, whose inbox was auto-forwarding to a distribution list that had been decommissioned in a separate consolidation. Nobody saw the notice.

Monday morning, the agent started returning half-valid responses. The new schema was a superset of the old one; the fields the agent used were still present, but some had shifted meaning. For forty-eight hours, support engineers were quietly given wrong answers, which they trusted because the agent had been accurate for months. A senior engineer caught it when two different colleagues showed her contradictory agent outputs for the same query.

The immediate fix was unglamorous: pin the schema version, add a contract test that ran hourly against the vendor API, and move the API contact to a distribution list with three owners. The structural fix took longer. They built a tool registry. Every agent declared the tools and tool versions it was tested against. A change-detection pipeline flagged vendor schema drift before it reached production. Tool definitions moved out of prompt strings and into versioned files in the repo.

The interesting forensic detail: until that weekend, the team believed their toolchain was robust because it was "just APIs." APIs are robust. Agents with tool access are a different thing. The agent does not know that a schema drifted; it reads the response and keeps going. The same infrastructure that was safe for humans reading API documentation was not safe for agents reading API responses. Toolchain interoperability is whether your tool layer is load-bearing or incidental, and the test is what happens when a vendor changes the ground under your feet.

3. The retailer whose escalation path led nowhere (Human-Agent Handoff)

The retailer had an inventory-reconciliation agent. It ran every morning across their warehouses, cross-checked on-hand counts against expected counts, and flagged discrepancies. High-confidence matches were auto-corrected; low-confidence cases escalated to a human reviewer. The agent was considered one of the company's AI success stories. Savings were real. The executive team was bullish.

In late 2025, over a long weekend, the agent escalated eighty-three cases. The escalation path was a Slack channel called #agent-reviews with thirty-two members and no named owner. Four of those members were on PTO. Six had muted the channel. The on-call engineer did not know the channel existed; it had been set up by a team that had since been restructured. By Tuesday morning, the eighty-three flagged items had been auto-resolved by the agent\'s timeout policy, the policy said "if no human has responded in forty-eight hours, apply the most likely correction and log it." Most of those corrections were fine. Seven of them, distributed across three warehouses, were wrong in ways that compounded with later transactions.

The problem was not the agent. The agent did what the policy said. The problem was the handoff, and specifically the gap between "we have an escalation path" and "the escalation path is a living thing exercised by real humans on a known schedule." The Slack channel had the word "review" in it. Nobody was reviewing.

The fix involved a genuine on-call rotation for the agent, separate from the general engineering on-call, because the skills required are different. Escalations route to a named human, not a channel. SLAs on time-to-human-decision. A monthly drill where someone in the ops team deliberately triggers an escalation during a moment when the primary approver is marked unavailable, to confirm that the secondary path works. The drill is the point. Handoffs that have never been tested are not handoffs. They are documents that describe handoffs.

I have seen variations of this story at four other organizations since. It is the most common, most underestimated, most expensive failure mode in agentic deployments. The handoff you have on paper is not the handoff you have in practice, and the only way to know is to exercise it.

The fourth lever that did not fail (and why that is also a finding)

I do not have a cost-escalation-trigger war story of my own from these three engagements. That is partly because all three organizations had cost alerting in place, coarse, lagging, but present, and partly because the incidents that did hit cost thresholds were caught before they became the story. The absence is informative. Of the four levers, cost is the one most likely to have some infrastructure, because finance has been watching cloud spend for a decade and agentic spend slots into that machinery. What is missing in most organizations is not the trigger but the latency of the trigger. The monthly invoice is not a trigger. The per-run budget enforced in code is.

I have seen two near-misses that did not make the main stories because they were averted in time: a research agent that hit a recursion and burned a month\'s budget in forty minutes, caught by an alert that fired at minute thirty-seven; and a data-processing agent that got stuck in a refinement loop over a malformed document, caught because a per-run token cap triggered a soft stop. In both cases, the cost trigger worked. It also worked by accident, neither organization had formally designed the protection. They had opinionated engineers who built it in anyway. Strong readiness is when that protection is specified, not improvised.

The pattern across all three

None of the three organizations had skipped AI governance. All three had AI policies. Two of the three had recently completed external AI readiness assessments. One was halfway through an EU AI Act compliance program. By the standards of the governance maturity model, two were at Level 3 and one was at Level 4. None of that prevented the incidents.

It did not prevent them because the governance maturity model measures the institutional scaffolding, policy, risk register, compliance mapping, board reporting, that an organization builds to manage AI as a category of activity. Agents are a specific class of AI system with specific operational requirements that institutional scaffolding does not address. You can have excellent governance and still not be able to run agents, in the same way you can have excellent financial controls and still not be able to run an active trading desk. The frameworks are adjacent, not redundant.

This is why the Agentic Readiness Index is scoped the way it is. Four operational levers. Not because other things do not matter, but because in the post-mortems of every agentic incident I have been close to, the cause mapped to one of four places. Policy was too coarse. Toolchain was incidental rather than load-bearing. Handoff was a document rather than a practice. Cost trigger fired after the fact.

What to do on Monday

If you have agents in production, or you have decided to put agents in production in the next two quarters, the useful thing to do is run the diagnostic in the readiness index honestly. The trap is to score generously, to mark "partial" for things that are "aspirational" and "yes" for things that exist as documents. The diagnostic is only useful if the score reflects what actually happens, not what is written down.

The second useful thing, if the score surfaces a gap you cannot close with internal resources in a quarter, is to name that fact out loud to leadership. Most agentic failures I have seen are not failures of capability. They are failures of nerve, the moment where a CTO or CAIO saw a gap, calculated that raising it would slow a roadmap the board was expecting, and chose to hope. Hoping is a reasonable strategy for most things. It is a poor strategy for autonomous systems that can take actions on their own at machine speed.

If the gap is wider than an internal team can close, the ctaio.dev side of my work offers two escalation paths. The 30-day AI Readiness Audit runs a version of this diagnostic at enterprise depth, stakeholder interviews, architecture review, a board-ready report. A fractional engagement is the right option when the remediation is multi-quarter and the team needs embedded senior leadership to run it. Either path solves a different problem than this essay does. This essay exists mostly so that anyone reading it can, before they do either, understand why the framework is shaped the way it is.

The Scars of Autonomy: Field Reports from Three Agentic Deployments

Three agents, three scars

1. The fintech that refunded too much (Policy Granularity)

2. The software vendor whose agent split-brained (Toolchain Interoperability)

3. The retailer whose escalation path led nowhere (Human-Agent Handoff)

The fourth lever that did not fail (and why that is also a finding)

The pattern across all three

What to do on Monday

Further reading

Need Expert Technology Guidance?

Three agents, three scars

1. The fintech that refunded too much (Policy Granularity)

2. The software vendor whose agent split-brained (Toolchain Interoperability)

3. The retailer whose escalation path led nowhere (Human-Agent Handoff)

The fourth lever that did not fail (and why that is also a finding)

The pattern across all three

What to do on Monday

Further reading

Need Expert Technology Guidance?

Continue Reading

Tech meets endurance