Frontier models now handle roughly 85% of real-world economic tasks at something close to expert level. METR's evals show Claude completing 80% of four-hour expert tasks without human help. The benchmarks keep moving up and to the right.
So here's the question nobody is asking clearly: when a model can do 80% of expert work, what happens to that 80%?
It gets cheap. Very cheap. And the 20% that's different gets expensive.
The Paradox Playing Out Right Now
Two things are true at the same time. Every major tech company is announcing AI-driven headcount reductions. And Every, Dan Shipper's media company, automated everything they could and grew from 4 to 30 people.
Codex runs the code reviews. Claude Code writes the first drafts. Agents sit in nearly every workflow. And they needed more humans, not fewer.
These two facts look contradictory. They're not. The companies cutting headcount hired too many people for work that was always automatable. The companies growing are creating new categories of work that didn't exist before the automation was in place.
The "AI layoffs" at Meta and ClickUp are a cover story. They hired ahead of demand in roles built around rote execution. When the automation caught up, there was nothing left to justify the head count. That's not AI displacing experts. That's AI revealing which work was never really expert in the first place.
The Floor Rises. The Zone Gets Crowded.
AI raises the floor for everyone. Your junior hire can now produce a decent first draft of almost anything on day one. Your marketing team can A/B test ten angles overnight. Your engineers can scaffold a working prototype in hours instead of days.
This sounds like good news. It is. But it creates a problem nobody warned you about.
When everyone can produce a decent first draft, the market fills up with decent first drafts. The zone between "not bad" and "obviously great" gets brutally crowded. The gap between generic and genuinely good becomes the only gap that matters commercially.
That gap requires judgment. And judgment does not train on existing data.
AI progress creates more work for humans, not less. The automation raises the floor and floods the zone. Rising above the baseline requires expertise that can't be made explicit enough to replicate.
What Benchmarks Cannot Measure
There's a structural reason this keeps surprising people. Benchmarks measure performance on known frames. The eval set is drawn from tasks that already exist, questions that have already been asked, problems that have been solved before.
Models are extraordinarily good at operating inside established frames. They pattern-match to what exists. That's how they got to 85% on real-world economic tasks.
But the most valuable work is not operating inside existing frames. It's recognizing when the frame is wrong. Zooming out, recentering the problem, deciding what question to actually be answering before any execution begins.
A model cannot tell you that you're solving the wrong problem. It will solve the problem you gave it, extremely well, and present the result with confidence. The human in the loop is the one who has to notice that the problem was misspecified from the start.
That's not a gap that closes as models get better. It's structural. The thing that makes humans indispensable is precisely the ability to operate outside established frames, which is exactly what you cannot train on.
Your Company Is Not Automating People Out
Here is what actually happens when companies deploy AI well. They automate the residue of expertise: the parts of expert work that are repetitive, explicit, and templatable. The part that a skilled person could, in principle, write a procedure for.
That's real automation. It's valuable. It means your best people stop doing work that was beneath them.
What it creates is demand for more of the non-residue: the judgment calls, the frame-setting, the taste decisions, the work that requires someone who has been in enough rooms to know what matters. That work expands to fill the space the automation created.
Every's growth from 4 to 30 is not an anomaly. It's the correct prediction. Automating the routine created space for work that was previously crowded out by routine. The people they hired were not doing what the old 4 did. They were doing work that only became possible once the agents were running.
The Work Shifts. The Volume Goes Up.
This is not a comfortable message for everyone. Some roles do disappear, specifically roles that were primarily residue: execution without judgment, processing without synthesis, drafting without taste. If your job was 90% explicit procedure, the model is coming for that 90%.
But some work requires operating outside an established frame: making decisions with incomplete information, recognizing when a strategy is wrong before the data confirms it, building trust with a customer in a difficult conversation. The automation makes that work more valuable, not less.
The bar rises. The volume goes up. The work shifts toward what's hard to commoditize.
What This Means for How You Deploy
Most companies are approaching this backwards. They're trying to figure out which people to replace with AI. The right question is: which work can we automate so that our best people can do more of what they're actually good at?
That's a different architecture. It's not a reduction plan. It's a capacity expansion. You're not cutting the expert. You're cutting the time the expert spends on work a model can do, so they can do more of what the model cannot.
The companies that figure this out early will grow headcount, not cut it. Not because they're bad at deploying AI, but because they're good at it. Every is the early data point. Watch what happens to the rest.
This is exactly why the embedded operator model exists. Running AI in production is not a one-time build. It requires someone who can direct the system, not just run it. Someone who sets the frame, catches the misspecified problems, and stays in the building when the agent behaves unexpectedly at 2am on a Tuesday.
Demand for that kind of judgment went up, not down.
Find the human judgment your agents need.
The Diagnostic is free. One conversation, 30–45 minutes. You'll leave knowing exactly where AI should amplify your experts and where it's just adding noise.
Book the Diagnostic →