In April, the engineer who watches Uber's AI budget did the math twice, because the first answer did not make sense. The company had already spent its entire 2026 allocation for AI coding tools. It was April. Four months into the year, the year's money was gone.
Uber was not careless, and it was not alone. Within weeks, a roster of the most sophisticated technology companies on earth started doing the same thing: capping how much AI their own engineers were allowed to use. After two years of being told to spend on tokens without fear, the era of tokenmaxxing ended inside a single quarter.
Burned its full-year AI coding budget by April. Now $1,500 a month, per engineer, per tool.
Capped tokens per employee after adoption of its Code Puppy platform surged.
Moved to limit AI tool usage in mid-June, citing an exponential increase in cost.
Found individual engineers running $500 to $2,000 a month on a single coding agent.
Scrapped an internal usage leaderboard after staff gamed it to top the chart.
Joined Cisco and others imposing per-employee usage restrictions as bills climbed.
The trade press gave the reversal a name within days: tokenminimizing. To understand why the most AI-forward companies in the world all hit the brakes at once, you have to understand the contradiction underneath the bill.
The Gospel That Just Died
For two years the smartest builders preached the exact opposite of caps. Garry Tan and YC called it tokenmaxxing: push consumption to the limit, because one engineer spending aggressively on tokens can do the work of hundreds. Ramp's own AI usage climbed over 6,000% year on year. The advice was not wrong. While a token was a rounding error, spending more of them was pure leverage.
What changed is that tokens stopped being a rounding error. Not everyone bought the gospel even at its peak, and the holdouts now look prescient.
Levie
"We never celebrated tokenmaxxing. We never had leaderboards."
Cheaper Tokens, Bigger Bills
Here is the contradiction. The cost of intelligence collapsed. A model as capable as the original GPT-3 cost around $60 per million tokens in late 2021. By 2024 the same capability ran about $0.06, a thousandfold drop in three years, a curve a16z named LLMflation. Prices fell roughly 10x a year, every year.
Bills should have shrunk to nothing. Instead, enterprise generative-AI spend went from $1.7B in 2023 to $11.5B in 2024 to roughly $37B in 2025, more than 20x, while the price of a token fell over 90%. Cheaper inputs, much bigger bills. That is not a paradox to anyone who has read their economic history.
Nadella
"Jevons paradox strikes again. As AI gets more efficient and accessible, we will see its use skyrocket, turning it into a commodity we just can't get enough of."
In 1865, the economist William Stanley Jevons noticed that more efficient steam engines did not reduce Britain's coal use. They increased it. When a resource gets cheaper to use, you use far more of it, and total spend climbs. Tokens are coal now. Every efficiency the labs ship gets eaten by a workload that grows faster than the price falls.
Why Agents Eat Tokens Alive
The thing that broke the budget is not the chatbot. It is the agent. A chat turn sends a prompt and gets an answer, a few hundred to a few thousand tokens. An agent loops. It reads context, calls a tool, reads the result, reasons, calls another tool, re-reads the growing context, and repeats until the job is done. Every loop re-sends the accumulated state. Tokens compound.
Anthropic published its own numbers. A single agent uses about 4x the tokens of a chat interaction. A multi-agent system, the kind that fans work out across subagents, uses about 15x. By one analysis a typical agent task burns around 96,000 tokens, against a few hundred for a chat reply. Priced per interaction, a 2023 linear workflow cost roughly $0.04. A 2026 orchestrated agent, with tools and reasoning and subagents, costs around $1.20. Thirty times more, for one run.
The Number Nobody Can See
So the bills exploded, and the companies reached for the blunt instrument: a cap. But a cap is a confession. It says we cannot tell good spend from bad, so we will limit all of it. The invoice tells you how many tokens you spent. It cannot tell you what they bought. Jaya Gupta named this the token budget war: the prize is not cheaper tokens, it is knowing the marginal utility of each one, the business value created by the next dollar of inference.
That number is invisible in most companies. The same workflow, run twice on the same input, can differ in token cost by 5 to 10x with nothing visibly wrong. Was that spend replacing a contractor, generating revenue, reducing a risk, or was it an engineer tokenmaxxing on a leaderboard? The bill does not say. A flat cap throttles the workflow that prints money and the one that wastes it at exactly the same rate. It is the worst tool for the job, reached for because it is the only tool most companies have.
The Company That Didn't Cap
While everyone else reached for the cap, one company did the opposite, and got a better result. In late June, Coinbase reported it had cut its internal AI bill by nearly half. Not by limiting engineers. By rewiring the plumbing underneath them.
The mechanics are unglamorous and entirely repeatable. Route the easy, high-frequency queries to a cheap model and reserve the frontier tier for the work that earns it. Cache the context that agent loops re-send on every turn, lifting the hit rate from 5% to 60%. Put it all behind one gateway so the routing and caching happen by default, not by hope.
Armstrong
"How to keep AI spend flat while token usage grows exponentially? Not with friction and spend alerts. With better defaults, routing, and caching."
From Tokenmaxxing to Allocation
Coinbase did not minimize tokens. It allocated them. That is the move, and it is the opposite of a cap. Spend aggressively where the marginal token pays for itself, and starve the spend that does not. To do that you need a unit the business already understands: the cost of a completed outcome.
Price the work in finished units, the way a BPO already does. Cost per resolved ticket. Cost per processed claim. Cost per reviewed contract. Cost per dollar of revenue moved. Once a workflow is measured in cost-per-outcome instead of cost-per-token, the allocation decision gets obvious, and the levers are the same ones Coinbase pulled.
Run a cheap model for the easy turns and reserve the frontier tier for the ones that justify it. Most tokens do not need the most expensive model.
Agent loops re-send the same instructions every turn. Caching and a real read discount stop you paying full price for the same context again and again.
Smaller, cleaner context is fewer re-read tokens on every loop. Context engineering is now a cost lever, not just a quality one.
Tokenmaxxing
Token allocation
What This Means for Your Company
The token bill is the second of three walls every company hits with AI right now. Trust came first, can we rely on the output. Cost is the one that moved to the front of the room in 2026, can we afford the output. The workforce question sits behind both, and it is the subject of the next post in this series.
For cost, the lesson of June is not to cap like Uber or to spend like 2025. It is to allocate like Coinbase: instrument the workflows, define the outcomes, route and cache and trim, and put a cost-per-outcome number where the usage chart used to be. The company that wins is not the one that spends the least or the most. It is the one that can see what a token bought. Get that right for one function and the same discipline repeats across the next. That is the work nativefirst does on site.
You can see your token bill. Can you see what it bought?
Book a free Diagnostic: 30 to 45 minutes, no deck, no pitch. We map where your tokens are going, which workflows are worth the spend, and how to put a cost-per-outcome number on the function you ship first.
Book the Diagnostic →