We spent a decade learning cloud FinOps. We are repeating every mistake with LLM spend, and the meter runs faster.
A team I know burned through a quarter's model budget in the first eleven days of the quarter. Nobody was being reckless. A retrieval feature shipped, an agent started looping on a class of documents nobody had tested at scale, and every retry called the most expensive model in the catalog. The dashboard that would have caught it did not exist, because for LLM spend almost nobody has built one yet.
That is the whole problem in one sentence. We spent the last decade learning cloud FinOps. Tag everything. Attribute cost to a team. Set a budget. Alert before you blow it. Review weekly. Then generative AI arrived and we threw all of it out the window. The FinOps Foundation named AI cost management the top wanted skill for 2026, and reports that about 98% of organizations are now managing AI spend. "Managing" is generous. Most are receiving an invoice and reacting to it.
Here is the uncomfortable part. AI spend is harder to govern than EC2, not easier. A virtual machine has a predictable hourly rate. A model call has a cost that depends on input tokens, output tokens, the model you happened to route to, whether you cached the context, and how many times an agent decided to retry. The unit of spend is the request, and requests are generated by code, increasingly by autonomous agents that make their own decisions. You cannot manage what you cannot see at that granularity, and until very recently you could not see it at all.
Start by metering the request, not the bill
The first move is to get cost attribution down to the individual call. Without it, every other control is guesswork.
The platforms finally caught up in May. Amazon Bedrock added request-level usage attribution on May 20, 2026, which means you can tag an inference profile and get usage broken out per request rather than as one undifferentiated monthly total. Microsoft, which renamed Azure AI Foundry to Microsoft Foundry effective January 1, 2026, shipped project-level cost attribution on May 31. Use these. They are the equivalent of turning on detailed billing and cost allocation tags in your first month on AWS. You would never run production cloud without them, and you should not run production AI without their equivalent.
But platform attribution alone is not enough, because your real cost driver is usually your own application logic. The durable place to meter is your gateway. Every team serving LLMs at any scale should be routing calls through a single internal gateway or proxy, not letting forty services hold their own API keys and call providers directly. The gateway is where you stamp each request with the metadata that matters: which team, which feature, which agent, which model, input and output token counts, and a derived cost. Emit that as a structured event into your existing observability pipeline. Now you have a per-request cost record you own, independent of any one vendor's billing format, and you can join it to the rest of your telemetry.
One caution while you are wiring this up. Apply the same secret hygiene here that you apply everywhere else. GitGuardian found 1,275,105 AI secrets sitting in public GitHub repositories in 2025, up 81% year over year. A gateway with centralized, rotatable keys is also how you stop forty copies of a provider key from leaking into forty repos.
Route to the minimum effective intelligence
Once you can see cost per request, the single highest-leverage lever is routing. The principle I use is minimum effective intelligence: send each task to the cheapest model that still produces an accepted result, and only escalate when the cheap model fails an explicit quality check.
Most teams do the opposite. They pick the strongest model in the catalog, point everything at it, and never revisit the decision. That is the AI equivalent of running every workload on your largest instance type because it is simpler. It works, and it is enormously wasteful.
Concretely, tier your traffic. Classification, extraction, short rewrites, and routing decisions almost never need a frontier model. Reserve the expensive models for genuinely hard reasoning, and let a cheaper model handle the long tail. Newer model controls help here. Claude Opus 4.8, which shipped on May 28, 2026, added a user-selectable effort control, so you can dial reasoning depth down for tasks that do not need it instead of paying for maximum effort on every call. Build an evaluation harness that measures whether the cheaper path actually passes, then escalate on failure. The escalation logic lives in the same gateway that does your metering, which is exactly why the gateway is the right place to invest.
There is a second reason to keep the routing layer flexible. GPT-5.5 Instant became the ChatGPT default on May 5, 2026, exposed through a floating "chat-latest" alias. Pinning your application to a floating alias means your cost and behavior can change underneath you without a deploy on your side. Pin to stable model ids, like "claude-opus-4-8", and make model selection a config decision your routing layer owns, not an accident of whatever the provider promoted to default this week.
Treat caps as a control plane, not a hope
Attribution tells you what happened. Caps stop the eleven-day budget fire from happening at all.
A budget you only read about after the fact is a postmortem, not a control. Enforce the limit where the request flows, at the gateway. Give every team, feature, and agent a spend budget. Track running spend against it in a fast store. When a caller crosses a soft threshold, alert. When it crosses the hard cap, the gateway refuses or downgrades the request rather than forwarding it. This is the same discipline as an AWS Budgets action that triggers automatically, except you are enforcing it inline on the path that actually spends the money, so it bites in seconds rather than after the daily billing refresh.
Set the alert thresholds where they give you time to act, not where they confirm the disaster. A burn-rate alert that fires when a team is on pace to exceed its monthly budget by day ten is worth ten dashboards nobody opens.
Tag agents like cost centers, and review weekly
The last piece is organizational, and it is where the FinOps analogy becomes exact.
Autonomous agents are the new spend-generating workloads, and they behave like services with their own budgets and their own blast radius. So give each agent an identity and a cost center, the same way you would tag a service. This is not only a finance concern. The Cloud Security Alliance, in its non-human identity governance whitepaper on May 20, 2026, noted that non-human identities already outnumber humans by roughly 45 to 1, and as high as 144 to 1 in some estimates. Every one of those agents can spend money and take action. Naming and budgeting them is the same hygiene that lets you both attribute cost and revoke access cleanly.
Then put it on a cadence. The cloud FinOps shops that control spend do it with a boring weekly review: top movers, biggest cost-per-outcome offenders, anything new that appeared, anything trending toward its cap. Run the identical meeting for AI. Pull the per-request data from your gateway, look at cost per accepted result rather than raw token volume, and ask one question of every line that grew: is this buying us proportional value.
None of this is novel. It is the cloud FinOps playbook applied to a new unit of consumption. The teams that get hurt are the ones who assume AI spend is a model problem and forget it is an operations problem. The meter is already running. The only question is whether you have built the dashboard yet.
I am curious where others are putting the metering: at the gateway, at the provider, or both. What is working for you.
