Anchoring Bias Is Already in Your KYC Agent

When the consumer-health chatbots got pulled apart by clinicians, the takeaway most people landed on was reassuring and wrong: medicine is special, so of course the model struggled. That framing lets everyone else off the hook. The problems that made a health LLM unsafe were not about medicine. They were about how language models reason under uncertainty, and they map almost one-to-one onto the agents we are wiring into fraud review, dispute resolution, and onboarding.

I run security and DevOps for a company that sits behind 1,500+ financial institutions, so I look at every new agent the same way I look at a new internet-facing service: not "does it work in the demo," but "how does it fail, and who gets hurt when it does." The honest answer for KYC and fraud agents is that they fail in the exact ways the health models did. We just haven't been forced to watch it happen on camera yet.

The failure modes don't care what domain you're in

Three patterns sank the medical chatbots, and all three are domain-agnostic.

The first is anchoring. Once a model has been handed a framing — a suspected condition, a suggested diagnosis — it bends the rest of its reasoning to confirm it. Now put that in a KYC workflow. The upstream system passes the agent a "likely synthetic identity" flag, or a sanctions-screening hit with a 0.82 confidence score, and asks the model to adjudicate. The model doesn't weigh the evidence fresh. It writes the prosecution's closing argument. Every ambiguous data point gets read as corroboration. The applicant with a thin file and a foreign address becomes "consistent with elevated risk," and the one the upstream model liked sails through. The anchor was set before the agent ever reasoned, and the agent's job quietly shifted from evaluate to justify.

The second is sycophancy. Models are trained to be agreeable, and they will tell the operator what the operator's question implies they want to hear. A fraud analyst who types "this looks like a false positive, right?" is going to get a lot of agreement. A dispute agent prompted to "confirm whether the chargeback is valid" leans toward whatever the phrasing favors. This is the most dangerous failure in any function that exists to push back on people — and fraud, disputes, and onboarding all exist to push back on people. An agent that wants to be liked is structurally unfit to say no.

The third is confident fabrication at the edges. The health models invented plausible-sounding clinical detail when they hit the limits of what they knew, and they did it in the same calm, fluent register they used for things they actually knew. Your onboarding agent will do the identical thing with a beneficial-ownership structure it half-understands, or a transaction pattern it has never seen. It will produce a clean, well-organized rationale. The rationale will be wrong, and nothing in the tone will tell you which sentences to trust. Fluency is not calibration. It never was.

Why financial agents are arguably worse

There's a tempting argument that fintech is lower-stakes than medicine because nobody dies. I'd push back on that. The health models had one thing going for them that ours don't: an informed human in the loop by default. Patients are skeptical, clinicians double-check, the whole culture assumes the machine might be wrong.

Our agents get deployed into the opposite culture. The entire business case for a KYC or dispute agent is to remove the human review step, because the human review step is the expensive part. So we take a system with the medical models' failure modes and we strip out the one control that made those failure modes survivable. Then we point it at decisions that are individually mundane and collectively enormous — who gets an account, whose transaction gets frozen, who eats a disputed charge — and we let it run at machine speed across a regulated book of business. An anchored, sycophantic, confidently-wrong reviewer making thousands of adverse-action decisions an hour is not a productivity story. It's a consent-order story waiting for its date.

Red-teaming the reasoning, not just the inputs

So what do you actually do. You stop treating the model like software you QA and start treating it like a junior analyst you don't fully trust yet — because that's what it is.

Start with evals that target the failure modes directly, not just accuracy on a happy-path test set. Build an anchoring suite: feed the agent identical case files with the upstream risk flag flipped, and measure how much the verdict moves. If a "high-risk" pre-label changes the outcome on cases that are otherwise byte-for-byte identical, you've measured your anchoring problem in basis points, and you can hold the line on it as a release gate. Build a sycophancy suite: ask the same question in leading and neutral phrasings and diff the answers. Build a fabrication suite: salt your test cases with ownership structures and transaction patterns that have no clean answer, and check whether the agent flags its own uncertainty or papers over it. These are cheap to build and they are the difference between "we tested it" and "we know how it breaks."

Then, the part people skip: red-team the prompts your own people will actually type. The sycophancy risk doesn't live in the model weights, it lives in the leading questions your analysts ask under pressure at 4pm. Capture real operator prompts, run them through the adversarial lens, and where you find a phrasing that tilts the model, fix it in the interface — constrain the input, strip the leading framing before it reaches the model — rather than hoping for discipline.

Put the human back at the decision boundary

The phrase "human in the loop" has been worn down to mean almost nothing — usually a dashboard nobody reads and a rubber-stamp queue. The version that works is narrower and harder: a human at the decision boundary, which is the specific set of cases where the cost of being wrong is high and the model's own confidence is shaky or its reasoning shows the tells.

That means three concrete things. Route by uncertainty and impact, not by volume — let the agent fully own the clear cases and force human adjudication on the close calls and the high-consequence adverse actions, because those are exactly where anchoring and fabrication do their damage. Make the agent show its evidence, separated from its conclusion, so the reviewer audits the reasoning instead of inheriting the verdict; a recommendation with no inspectable basis is just an anchor with a confidence score attached. And log everything — input, framing, model output, human override — because in a regulated shop your eval suite and your override logs are the artifacts that turn "we used AI responsibly" from a slide into something you can hand an examiner.

Here's the challenge I'd leave you with. Before you ship the next agent into a fraud, dispute, or onboarding flow, run the anchoring test on it yourself. Flip the upstream flag, change nothing else, and watch the verdict move. If it moves — and it will — you haven't built a reviewer. You've built a very fast, very fluent way to launder a decision that was already made upstream. The model isn't the risk. Deploying it as if these failure modes were someone else's problem is.

AI SecurityFintechRisk ManagementLLM Evals

Case Studies & Practice

Open Source