There's a pattern I see in almost every early agent deployment, and it makes me nervous every time. The safety story lives in the prompt. Somewhere in a system message there's a paragraph that says, in effect, "You are a careful, responsible assistant. Do not move money you shouldn't. Never take destructive actions without confirmation. Always follow the rules." And then the team ships it, points at that paragraph, and calls it a control.
It isn't a control. It's a wish. You have written a strongly worded request to a probabilistic system that is under no obligation to comply, and you've made your customers' money the stakes of whether it feels like cooperating this time.
I run security and DevOps at a company that serves more than 1,500 financial institutions. In that world, "the model usually behaves" is not an acceptable design point. The whole discipline of fintech security is built on a single assumption: the thing holding the credential will eventually do the wrong thing — through compromise, through bug, through confusion, through a cleverly worded input — and your job is to make sure that when it does, the damage is small and recoverable. Agents don't change that assumption. They just add a new actor who is fast, tireless, persuadable, and very confident.
Intent is not a security boundary
The mental shift I want every architect to make is this: model intent is not a security boundary, and it never will be. A boundary is something that holds regardless of what the actor on the other side wants. A firewall rule doesn't care how badly a packet wants through. An IAM policy doesn't get talked into anything. That indifference is the entire point.
An agent's alignment, by contrast, is a property of its inputs — and its inputs include the open internet, retrieved documents, tool outputs, and whatever a user types. We already have a name for the discipline of getting a system to do something its operator didn't intend by feeding it crafted input: we call it injection, and we've spent thirty years learning we cannot prompt our way out of it. Prompt injection is just the same lesson wearing new clothes. You don't beat SQL injection with a comment in the query asking attackers to please be nice. You beat it with parameterized queries that make the dangerous thing structurally impossible. Agents need the structural equivalent.
So the question I ask of any agent design is not "how do we make it want to behave?" It's "what can this thing actually do when it's wrong — and is that survivable?" That's the blast radius. Engineer that, and the model's mood stops mattering.
Four controls that hold when the model misaligns
Here's the concrete part, the layered defense I'd insist on before any agent touches a money-movement path. None of these depend on the model cooperating.
Least-privilege IAM, scoped to the task and nothing more. An agent gets its own identity — never a shared service account, never a human's credentials, never a long-lived key sitting in an environment variable. It gets short-lived, scoped credentials issued per task, the way you'd hand a contractor a badge that opens one door for one afternoon. On AWS that's the well-worn vocabulary of narrow IAM roles, scoped-down session policies on AssumeRole, resource and condition constraints, and permissions boundaries that cap what a role can ever do even if someone later tries to widen it. The agent that summarizes transactions has no path to initiate one. That's not a setting you toggle later; it's the shape of the system. And it means every action the agent takes is attributable to a distinct principal, which is the difference between an incident you can reconstruct and one you can only apologize for.
Transaction caps the agent cannot raise. Velocity limits, per-transaction ceilings, daily aggregate limits — enforced on the server side, in the system of record, completely outside the agent's reach. The agent can request; the ledger decides. If a compromised or confused agent tries to push a thousand transfers, the cap is what turns a catastrophe into a rate-limited annoyance and an alert. The cardinal sin is putting the limit in the prompt ("never transfer more than X"). That's not a cap. That's a suggestion the next clever input will overwrite.
Dual control for anything that matters. Segregation of duties is one of the oldest ideas in financial controls, and it maps onto agents beautifully: the agent proposes, a second independent party disposes. That second party can be a human approving above a threshold, or a separate policy-enforcement service with its own logic and its own identity — the point is independence. One actor should never be able to both initiate and approve a consequential action, and "one actor" includes your agent. If a single prompt can carry an action from idea to execution with nothing in between, you've built a single point of failure and handed it to the least predictable component in the stack.
Irreversible-action gating. I sort every action an agent can take into reversible and irreversible, and I treat that line as sacred. Reversible actions — drafting, staging, reading, queuing — can run with light friction; mistakes there cost a retry. Irreversible ones — settling a payment, deleting records, sending external communications, provisioning that costs real money — go behind hard gates: explicit confirmation, a holding window, idempotency keys so a retry storm doesn't fire twenty times, and wherever possible a design that makes the action reversible in the first place. Prefer soft-deletes over hard deletes. Prefer staged transactions with a settlement step over immediate execution. The more you can move from the irreversible column to the reversible one, the smaller the part of your system where you have to be perfect.
This is platform work, and it's an advantage
Notice what all four have in common: they live in infrastructure, not in the model. They're IAM, API gateways, ledger logic, approval workflows, audit trails. This is exactly the boring, durable plumbing that security and platform teams have built for decades — which is the good news. We are not starting over. The agent era doesn't demand a new security religion; it demands that we apply the controls we already know how to build to a faster, more autonomous caller, and that we stop being talked out of them by demos that look magical.
And here's the reframe I'll leave you with. Treating the model as untrusted isn't pessimism that slows you down — it's the thing that lets you move. The teams that will deploy agents aggressively in regulated environments aren't the ones with the most cleverly worded prompts. They're the ones who can stand in front of an examiner, or a partner's due-diligence team, and show that even a fully compromised agent can't exceed its caps, can't move money without a second set of eyes, and can't do anything it can't undo. When the floor is solid, you can give the agent more rope, because you've already engineered where the rope ends.
So here's the challenge. Go find the most autonomous agent in your environment and ask one question: if it were fully adversarial right now — not buggy, adversarial — what's the worst thing it could do before anything outside the model stopped it? If your honest answer is "I'd have to trust that it wouldn't," you don't have a safety strategy. You have a prompt. Go engineer the blast radius instead.
