The Work That Can't Be Trained Away with AI

Claude Fable 5 shipped on June 9, 2026. Anthropic's most capable model yet, built for general use, better at nearly every benchmark that existed before it. The reaction in the group chats was predictable: someone said "just put everything into Anthropic and Nvidia and go home."

That take is wrong. And every model upgrade makes it more wrong.

Here's the logic. If the model keeps getting better at everything, every company built on top is a thin wrapper. The only defensible position is to own the model or the chips. Everything else gets commoditised. Shut up shop.

The flaw in that logic: it only accounts for what can be measured. And a thing you can measure is a thing you can train against.

The benchmark trap

Coding agents improved faster than almost any other AI capability. That's not an accident. A compiler is a free verifier. You write code, the compiler tells you whether it ran. That tight feedback loop means you can generate millions of training examples automatically. Benchmarks saturate fast when the grader is built into the environment.

But here's what doesn't show up on a leaderboard: "passing the test never told you the change was the right one for a decade-old codebase with three undocumented reasons that module exists." The test passed. The production system broke. The correctness lived somewhere the benchmark couldn't see.

This is the structure of the problem. Benchmarks capture what's publicly legible. Models train against what's publicly legible. The work that isn't on a leaderboard, the work whose correctness exists only inside a specific company, doesn't get easier to automate as models improve. It gets harder to access from the outside.

The two locks

There are two bottlenecks intelligence doesn't solve.

Permission. You have to be let in. Security review, integration work, legal contracts, procurement cycles. These don't compress because the model got smarter. The majority of American doctors open OpenEvidence every day. No amount of compute buys that relationship. Trust was built through acquiescence, over time, in the room. Gradient descent doesn't replicate it.

>50%

The majority of American doctors open OpenEvidence every day. No amount of compute buys that relationship. Trust was built in the room, over time.

Fig. 2

What the model eats first

A benchmark is a thing you can measure. A thing you can measure is a thing you can train against.

Framework: Sarah Guo, "The Untrainable" (Jun 2026) · figure by nativefirst.ai

Accountability. Someone has to put their name on what the AI does. Inside a company, in a regulated workflow, in a production system, there is always a human signature somewhere. The smarter the model, the more valuable that signature becomes, not less. The deadbolt is the user. The user has to let you in and stay responsible for what comes out.

Aaron Levie put it plainly: "There's still an insanely large gulf between model capabilities and what it takes to apply them to specific corporate workflows. A lot is access to and formatting of the right data, and a ton more is change management and specific implementation work it takes to make AI work in any specific corporate setting."

He wrote that in response to the same piece that prompted this one. The gulf he's describing is not a gap that closes as models improve. The implementation work, the change management, the data formatting: these are not intelligence problems. They are trust and access problems.

The private ground

"An application earns its place in the untrainable corner by doing unglamorous work: arranging a company's private reality so a model can act on it."

That framing, from Sarah Guo's "The Untrainable," is the cleanest description of what the embedded operator actually does. The glamorous work is the model. The unglamorous work is building the layer underneath it.

Matt MacInnis at Rippling put a number on it: a token spent answering a generic question is worth almost nothing. A token spent reasoning over your company's data is worth much more. The value is not in the model. The value is in the private context the model reasons over. No public training corpus contains your CRM state, your deal history, your internal terminology, your undocumented exceptions. Those things exist only inside the company. Getting a model to act on them requires someone who has done the unglamorous work to make that context available.

That work is: building the integrations, structuring the data, mapping the workflows, figuring out which information sources are trusted and which are stale, sitting in the room when the edge case surfaces and adjusting accordingly. None of it is on a benchmark. All of it compounds.

What Fable 5 actually changes

Every new model release does two things at once. It eats more of the generic, measurable work. And it raises the ceiling on what's possible with good private context.

Fable 5 can do things Fable 4 could not. That means the distance between "model reasoning over generic information" and "model reasoning over your specific company's reality" just got wider, not narrower. The generic output got cheaper. The contextual output got more valuable.

The operator who is already inside a company, with the integrations built and the data structured and the trust established, earns compounding advantage from every upgrade. The model improves. The private context it reasons over improves. The gap between that and what a generic deployment can do grows.

The operator who is outside the company trying to get in faces a harder version of the same access problem it always was, even with a better model in hand.

The translation work compounds

There is a category of work that looks less impressive than shipping a coding agent or passing a bar exam benchmark. It is: sitting with a procurement team to understand which data lives in which system. Figuring out why the CRM field is wrong half the time. Mapping the exception that everyone knows about but no one documented. Building the prompt chain that connects three internal tools that were never designed to talk to each other.

This work is not on a leaderboard. It can't be trained against because the correctness lives inside the company, not on the public internet. It requires permission to access and accountability for what comes out.

Fable 5 will not do it. The next model won't either. And the model after that.

Each upgrade makes the case stronger. The measurable work gets eaten faster. The unmeasurable work gets more valuable faster. The operator who is already inside keeps compounding. The model keeps improving. The translation work it cannot do keeps compounding.

That is the untrainable corner. And it is not shrinking.

Get inside before the model does.

One free conversation. 30–45 minutes. We'll map your company's private ground: the data, workflows, and context that make AI valuable when it's yours, not generic.

Book the Diagnostic →

Sources

1Sarah Guo (@saranormous), "The Untrainable", Substack, June 2026. On private correctness, permission bottlenecks, and the untrainable corner.

2Aaron Levie (@levie), X, June 2026. On the gap between frontier model capability and enterprise deployment reality.

3Claude (@claudeai), "Introducing Claude Fable 5", X, June 2026.

John Tan

Founder and CEO of nativefirst.ai. Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.