When Three Tools Report Three Token Counts, You Can't Attest to Any of Them

Here's a small thing that turned into a big thing. We were reconciling AI spend across a handful of teams, and I asked what should have been a trivial question: how many tokens did we actually consume last month? I got three answers. The coding assistant's dashboard said one number. The model provider's console said another. Our own gateway logs said a third. None of them agreed, and the gaps weren't rounding errors — they were the kind of spread that makes a finance partner narrow their eyes.

The instinct in the room was to argue about which tool was right. That's the wrong instinct, and it's the one I want to talk you out of. The problem isn't that one of the three is wrong. The problem is that you have three sources of truth for the same fact, which means you have zero. When three tools report three token counts, you can't attest to any of them. And in a business built on trust, attestation is the whole game.

Token counts are not a billing detail. They're a control.

It's tempting to file token discrepancies under "FinOps housekeeping" — annoying, but cosmetic. It isn't, and here's the mechanic that makes it serious. A token count is a proxy for a stream of facts you are obligated to know about your own data. How much did this cost? Which model processed it? Where did the inference physically run? Did any of those bytes leave the boundary you promised a customer, a regulator, or an auditor that they would never leave?

Every one of those questions resolves to the same underlying event: a request went to a model, carried some payload, and came back. The token count is just the most visible attribute of that event. If you can't reconcile the count, it means you don't have a single authoritative record of the event itself. And if you don't have an authoritative record of the event, you cannot answer the questions that actually matter — the ones about cost, data residency, and scope — with anything better than a shrug and a screenshot.

I serve a business that sits behind 1,500+ financial institutions. "A shrug and a screenshot" is not an acceptable answer to any of them. It's certainly not an acceptable answer to their examiners.

The three failure modes hiding behind one number

Walk the disagreement out to its consequences and you find three separate liabilities, each wearing the same costume.

Cost you can't defend. This is the obvious one, and it's the least dangerous. If the coding tool, the provider, and your gateway disagree on volume, your chargeback model is fiction. You're allocating AI spend to product lines based on whichever number was easiest to export. That's survivable internally. It stops being survivable the moment AI cost becomes a line item a board or a customer wants explained, because "we think it's roughly this" is not a sentence a CFO can repeat under questioning.

Data residency you can't prove. This is the one that should keep you up at night. Each of those tools may route differently — a coding assistant calling its own backend, a model SDK hitting a default region, your gateway pinning traffic to a specific endpoint. When the counts diverge, it's frequently because traffic took paths you didn't fully model. Public cloud gives you genuinely strong residency controls — regional endpoints, VPC-private connectivity to managed model services, the ability to keep inference inside a boundary you choose. But those controls only mean something if all of your model traffic flows through the layer where they're enforced. If a developer's tool is quietly calling a model over a path your gateway never sees, your residency story has a hole in it that you can't even measure, let alone close.

PCI scope you can't bound. Scope is determined by where regulated data can flow, not where you intend it to flow. An AI call that reaches a model endpoint your metering layer doesn't observe is, by definition, a data flow you haven't accounted for. If you can't enumerate every model call and prove what it carried, you cannot credibly draw the line around your cardholder data environment. Three token counts is a leading indicator that your scope diagram is aspirational. An assessor will find that hole faster than your finance team did.

Build the metering layer first, then plug everything into it

The fix is architectural, and it's the same move security and platform teams have made for every other class of traffic we eventually had to govern. You don't trust each endpoint to self-report. You force the traffic through a chokepoint you own, and you make that chokepoint the single authoritative record.

For model usage that means one metering layer — call it an AI gateway, a proxy, a broker, whatever fits your stack — that every model call traverses. Codex, Claude, the homegrown agent, the analyst's notebook, the batch job: all of it egresses through the same door. That door is where you count tokens, stamp the model and region, attach the requesting identity and the data classification, and write an immutable log. The vendors' own dashboards become useful sanity checks, not sources of truth. When they disagree with your layer, that disagreement is now signal — it tells you a path is escaping your control — instead of an unresolvable mystery.

The cultural shift this requires is the harder part. Engineers reach for the most direct integration: the SDK that talks straight to the provider, the IDE plugin with its own auth. Every one of those direct paths is a hole in your metering layer, and every hole is a token count you can't reconcile later. So the rule has to be boring and absolute: there is one sanctioned way to call a model, it goes through the gateway, and direct provider calls are an exception that requires a reason — not a default a developer picks for convenience. Make the governed path the easy path. Give people a clean SDK wrapper, fast endpoints, and good defaults, so the compliant route is also the lowest-friction route. You don't win this with policy memos. You win it by making the right door the most convenient door.

The standard is attestation, not approximation

I keep coming back to a distinction I use for the rest of the security program: the difference between claiming something and being able to prove it under questioning. We can claim our AI usage is metered, regionally contained, and out of PCI scope. The three-token-count problem is the universe telling us we can't yet prove it. A claim you can't defend in a due-diligence session or an audit isn't a control. It's a liability with good intentions.

So here's the challenge I'd put to any CIO, CISO, or platform lead standing up AI right now. Before you celebrate the productivity wins — and they're real — ask your team the boring question I asked mine: how many tokens did we consume last month, and can we prove it from one authoritative source? If you get one number with a clean lineage back to a single metering layer, you've built the foundation for everything else: cost you can defend, residency you can prove, scope you can bound. If you get three numbers and an argument, you don't have a reporting discrepancy. You have an ungoverned data flow wearing a billing problem as a disguise — and the time to build the metering layer was before the first model call, which means the second-best time is now.

AI GovernanceFintechDevOpsCloud Security

Case Studies & Practice

Open Source