In April, the engineer who watches Uber's AI budget did the math twice, because the first answer did not make sense. The company had already spent its entire 2026 allocation for AI coding tools. It was April. Four months into the year, the year's money was gone.

Uber was not careless, and it was not alone. Within weeks, a roster of the most sophisticated technology companies on earth started doing the same thing: capping how much AI their own engineers were allowed to use. After two years of being told to spend on tokens without fear, the era of tokenmaxxing ended inside a single quarter.

The rollback · mid-2026
UberUber

Burned its full-year AI coding budget by April. Now $1,500 a month, per engineer, per tool.

WalmartWalmart

Capped tokens per employee after adoption of its Code Puppy platform surged.

MetaMeta

Moved to limit AI tool usage in mid-June, citing an exponential increase in cost.

MicrosoftMicrosoft

Found individual engineers running $500 to $2,000 a month on a single coding agent.

AmazonAmazon

Scrapped an internal usage leaderboard after staff gamed it to top the chart.

AT&TAT&T

Joined Cisco and others imposing per-employee usage restrictions as bills climbed.

The trade press gave the reversal a name within days: tokenminimizing. To understand why the most AI-forward companies in the world all hit the brakes at once, you have to understand the contradiction underneath the bill.

The Gospel That Just Died

For two years the smartest builders preached the exact opposite of caps. Garry Tan and YC called it tokenmaxxing: push consumption to the limit, because one engineer spending aggressively on tokens can do the work of hundreds. Ramp's own AI usage climbed over 6,000% year on year. The advice was not wrong. While a token was a rounding error, spending more of them was pure leverage.

What changed is that tokens stopped being a rounding error. Not everyone bought the gospel even at its peak, and the holdouts now look prescient.

Aaron Levie
Aaron
Levie

"We never celebrated tokenmaxxing. We never had leaderboards."

Aaron Levie, Box  ·  June 2026, on "AI psychosis," the belief that more AI is always better

Cheaper Tokens, Bigger Bills

Here is the contradiction. The cost of intelligence collapsed. A model as capable as the original GPT-3 cost around $60 per million tokens in late 2021. By 2024 the same capability ran about $0.06, a thousandfold drop in three years, a curve a16z named LLMflation. Prices fell roughly 10x a year, every year.

Bills should have shrunk to nothing. Instead, enterprise generative-AI spend went from $1.7B in 2023 to $11.5B in 2024 to roughly $37B in 2025, more than 20x, while the price of a token fell over 90%. Cheaper inputs, much bigger bills. That is not a paradox to anyone who has read their economic history.

Satya Nadella
Satya
Nadella

"Jevons paradox strikes again. As AI gets more efficient and accessible, we will see its use skyrocket, turning it into a commodity we just can't get enough of."

Satya Nadella, Microsoft  ·  January 2025

In 1865, the economist William Stanley Jevons noticed that more efficient steam engines did not reduce Britain's coal use. They increased it. When a resource gets cheaper to use, you use far more of it, and total spend climbs. Tokens are coal now. Every efficiency the labs ship gets eaten by a workload that grows faster than the price falls.

Fig. 1
The scissors: price per token down, total spend up
2023 2024 2025 $1.7B $11.5B $37B spend price / token down 90%+ every token got cheaper. the bill got bigger.
Per-token price fell more than 90% while enterprise spend rose more than 20x. Sources: a16z (LLMflation), Menlo Ventures.

Why Agents Eat Tokens Alive

The thing that broke the budget is not the chatbot. It is the agent. A chat turn sends a prompt and gets an answer, a few hundred to a few thousand tokens. An agent loops. It reads context, calls a tool, reads the result, reasons, calls another tool, re-reads the growing context, and repeats until the job is done. Every loop re-sends the accumulated state. Tokens compound.

Anthropic published its own numbers. A single agent uses about 4x the tokens of a chat interaction. A multi-agent system, the kind that fans work out across subagents, uses about 15x. By one analysis a typical agent task burns around 96,000 tokens, against a few hundred for a chat reply. Priced per interaction, a 2023 linear workflow cost roughly $0.04. A 2026 orchestrated agent, with tools and reasoning and subagents, costs around $1.20. Thirty times more, for one run.

Fig. 2
Tokens per task, by how much the model loops
Chat turn 1x Single agent ~4x Multi-agent ~15x · ~96K tokens/job per interaction, cost rose from about $0.04 to $1.20
More autonomy means more loops, and every loop re-sends the context. Source: Anthropic engineering, SemiAnalysis via Derek Thompson.

The Number Nobody Can See

So the bills exploded, and the companies reached for the blunt instrument: a cap. But a cap is a confession. It says we cannot tell good spend from bad, so we will limit all of it. The invoice tells you how many tokens you spent. It cannot tell you what they bought. Jaya Gupta named this the token budget war: the prize is not cheaper tokens, it is knowing the marginal utility of each one, the business value created by the next dollar of inference.

That number is invisible in most companies. The same workflow, run twice on the same input, can differ in token cost by 5 to 10x with nothing visibly wrong. Was that spend replacing a contractor, generating revenue, reducing a risk, or was it an engineer tokenmaxxing on a leaderboard? The bill does not say. A flat cap throttles the workflow that prints money and the one that wastes it at exactly the same rate. It is the worst tool for the job, reached for because it is the only tool most companies have.

The Company That Didn't Cap

While everyone else reached for the cap, one company did the opposite, and got a better result. In late June, Coinbase reported it had cut its internal AI bill by nearly half. Not by limiting engineers. By rewiring the plumbing underneath them.

Coinbase Coinbase · cut its AI bill ~50% without a single cap
~50%
AI spend cut, with no usage limits on engineers. 91% never hit the old limits anyway.
5 → 60%
Cache hit rate on routine engineering queries, after routing everything through one internal gateway.
$1.40
Per-million cost of the cheap open-weight default it routes to, against $5 for frontier Opus.

The mechanics are unglamorous and entirely repeatable. Route the easy, high-frequency queries to a cheap model and reserve the frontier tier for the work that earns it. Cache the context that agent loops re-send on every turn, lifting the hit rate from 5% to 60%. Put it all behind one gateway so the routing and caching happen by default, not by hope.

Brian Armstrong
Brian
Armstrong

"How to keep AI spend flat while token usage grows exponentially? Not with friction and spend alerts. With better defaults, routing, and caching."

Brian Armstrong, Coinbase  ·  on X, June 26, 2026

From Tokenmaxxing to Allocation

Coinbase did not minimize tokens. It allocated them. That is the move, and it is the opposite of a cap. Spend aggressively where the marginal token pays for itself, and starve the spend that does not. To do that you need a unit the business already understands: the cost of a completed outcome.

Price the work in finished units, the way a BPO already does. Cost per resolved ticket. Cost per processed claim. Cost per reviewed contract. Cost per dollar of revenue moved. Once a workflow is measured in cost-per-outcome instead of cost-per-token, the allocation decision gets obvious, and the levers are the same ones Coinbase pulled.

Fig. 3
Allocation, not rationing
queries all of them Router by difficulty cheap model most of the tokens frontier tier only when it earns it cache spend follows value, not volume
Route the easy turns to a cheap model, send only the work that earns it to the frontier, and cache what repeats. That is how Coinbase halved its bill without a single cap.
Lever 01
Route by difficulty

Run a cheap model for the easy turns and reserve the frontier tier for the ones that justify it. Most tokens do not need the most expensive model.

Lever 02
Cache the context

Agent loops re-send the same instructions every turn. Caching and a real read discount stop you paying full price for the same context again and again.

Lever 03
Engineer the context

Smaller, cleaner context is fewer re-read tokens on every loop. Context engineering is now a cost lever, not just a quality one.

2025

Tokenmaxxing

Operating rule
Spend without fear. More tokens, more leverage. Leaderboards optional.
The metric
Usage. How much are we consuming, and is it going up.
Fails when
The bill outruns the budget and the only answer left is a cap.
2026

Token allocation

Operating rule
Spend where the marginal token pays for itself. Starve the rest.
The metric
Cost per completed outcome. Per ticket, per claim, per dollar moved.
Requires
An attribution layer that ties tokens spent to value produced.

What This Means for Your Company

The token bill is the second of three walls every company hits with AI right now. Trust came first, can we rely on the output. Cost is the one that moved to the front of the room in 2026, can we afford the output. The workforce question sits behind both, and it is the subject of the next post in this series.

For cost, the lesson of June is not to cap like Uber or to spend like 2025. It is to allocate like Coinbase: instrument the workflows, define the outcomes, route and cache and trim, and put a cost-per-outcome number where the usage chart used to be. The company that wins is not the one that spends the least or the most. It is the one that can see what a token bought. Get that right for one function and the same discipline repeats across the next. That is the work nativefirst does on site.

You can see your token bill. Can you see what it bought?

Book a free Diagnostic: 30 to 45 minutes, no deck, no pitch. We map where your tokens are going, which workflows are worth the spend, and how to put a cost-per-outcome number on the function you ship first.

Book the Diagnostic →
Sources
1TheNextWeb, "Tokenminimizing," with CryptoBriefing and AndroidHeadlines coverage, June 2026. Spend caps and usage restrictions at Uber, Walmart (Code Puppy), Meta, Microsoft, Amazon, Cisco, and AT&T. thenextweb.com · cryptobriefing.com
2Coinbase / Brian Armstrong, on X, late June 2026, via Startup Fortune and BeInCrypto. ~50% AI bill cut with no usage caps; model routing to open-weight defaults (GLM 5.2 at $1.40/M vs Opus $5); cache hit rate raised from 5% to 60%; 91% of engineers never hit the old limits. startupfortune.com
3Guido Appenzeller, a16z, "LLMflation." Roughly 1,000x decline in the cost of GPT-3-level capability over three years, about 10x per year. a16z.com
4Menlo Ventures, "2025: The State of Generative AI in the Enterprise." Enterprise spend $1.7B (2023) to $11.5B (2024) to ~$37B (2025). menlovc.com
5Anthropic engineering, "How we built our multi-agent research system" (agents ~4x, multi-agent ~15x tokens), and Derek Thompson, "The Great AI Cost Panic of 2026," citing SemiAnalysis and Ramp (~96K tokens per task; Ramp 13x year-on-year token spend). anthropic.com · derekthompson.org
6Satya Nadella, on X, January 27, 2025, on the Jevons paradox. x.com · Jaya Gupta (@JayaGup10), "Token Budget Wars," May 2026, on adoption to allocation and marginal token utility.
John Tan
John Tan

Founder and CEO of nativefirst.ai. Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.