In three weeks this June, the two labs that matter most each shipped their most capable agent. Anthropic released Claude Fable 5 on June 9. OpenAI previewed GPT-5.6, the engine under Codex, on June 26. Every comparison since has asked the same question, which one won, and every one of them is asking it wrong.

Put the two side by side and they are not the same machine. Strip the marketing and you find two different bets on what an agent even is. We run both at nativefirst, in production, every week, and we do not pick a winner, we pick a job. Here is the scorecard, the two philosophies under it, and how we decide which one to reach for.

Fig. 1
The scorecard
Codex 5.6 Claude Fable 5 Maker OpenAI Anthropic Shipped Jun 26, 2026 (preview) Jun 9, 2026 Lineup 3 tiers: Sol / Terra / Luna 1 model, Mythos-class Everyday price /1M $2.50 in / $15 out (Terra) $10 in / $50 out Coding SOTA Terminal-Bench 2.1 95% SWE-bench Verified Autonomy bet subagents in parallel one agent, runs for hours Computer use desktop app + browser via Claude Code + tools Availability top tier (Sol) gated broadly available Real-world work GDPval not run (gated) GDPval-AA leader
Two flagships, three weeks apart. The interesting differences are not in the price row.

Round 1: Price

This one is not close, and it should not be read as a verdict. Codex undercuts Fable at every tier. The Codex everyday model, Terra, runs $2.50 input and $15 output per million tokens. Fable 5 is $10 and $50, four times the input and more than three times the output. Even Codex's frontier Sol tier, at $5 and $30, comes in below Fable.

Fable is priced as the premium instrument, and Anthropic is unapologetic about it. The case for paying it is a labor case, not a token case: a senior week of work delivered overnight still costs less than one billable consultant hour. Codex is priced for volume, for work that runs constantly in the background. Different pricing because they expect to be used differently.

Round 2: Raw Coding

The two labs report different benchmarks, which makes a clean apples-to-apples read hard. But enough numbers are public to see the shape, and it is not a tie. Each lab leads on the benchmark it chose to lead with.

Fig. 2
The benchmarks, where each lab is strong
Codex 5.6 Sol Claude Fable 5 Terminal-Bench 2.1 agentic CLI 88.8ultra 91.9 83.4 SWE-bench Pro patch quality Sol not published at preview · GPT-5.5 ref 58.6 80.3 SWE-bench Verified solved correctly Sol not published at preview 95 Codex wins the loop. Fable wins the patch.
SWE-bench is independent, but running a model on it needs access, and Sol is in gated preview. So Fable 5's scores are independently verified, while Sol's lone number is the Terminal-Bench figure OpenAI self-reported. Sources: OpenAI preview, Anthropic, Artificial Analysis, SWE-bench leaderboard.

Read together, the pattern is telling, with one caveat worth stating plainly. Fable optimizes for the quality of the patch, Codex for the quality of the loop. One is graded on the answer, the other on the workflow. But the evidence is not symmetric: Fable is generally available, so its SWE-bench and GDPval numbers are independently run, while Sol is gated in preview, so its single Terminal-Bench figure is OpenAI's own and cannot yet be checked by anyone outside. Take the loop win as reported, not confirmed.

One benchmark matters more than any coding test for a company actually deploying AI, and it points the same way. GDPval scores models on real economic work across 44 occupations, not puzzles, which makes it the closest thing to a test of whether a model can do a job. Fable 5 leads it on the independent Artificial Analysis leaderboard. Sol has no GDPval number either, gated and unrun. The model that is hardest to evaluate is also the one you cannot buy.

Round 3: What They Think an Agent Is

This is the real fight, and it is a philosophical one. Both labs agree the future is agentic. They disagree on the shape.

Anthropic's bet is depth. Fable 5 is one extraordinarily capable agent that you point at a goal and leave alone. It runs for hours, tests its own work, and self-corrects before you ever see the output. The whole design assumes a single agent reliable enough to trust unsupervised over a long horizon.

Anthropic Claude Code Team
Claude
Code

"Fable can run for hours, tests its own work, and often produces better code than me. My job is increasingly about direction and setup, not supervision."

Anthropic Claude Code Team  ·  June 9, 2026

OpenAI's bet is breadth. GPT-5.6 ships an ultra mode that does not make one agent think longer, it spins up subagents and runs them in parallel, then recombines the results. Orchestration is baked into the model. And Codex, the product on top, reaches out into the whole computer: a desktop app with an in-app browser and computer use, so the agent acts in the tools you already have rather than waiting on an API.

Peter Yang
Peter
Yang

"I built so many workflows relying on those two things, browser and computer use, instead of hunting for APIs."

@petergyang, on switching to Codex  ·  June 2026
Fig. 3
Same month, opposite bets
DEPTH one agent, runs for hours BREADTH many agents, every desk Claude Fable 5 premium, self-verifying Codex 5.6 cheap, orchestrated, broad not better or worse, pointed at different problems
One agent you trust to go deep, or many agents that go everywhere. That is the choice.

Round 4: Safety and Access

Both shipped with guardrails, and both handle the highest-risk work by routing around it, but in opposite directions. Fable 5 auto-detects sensitive cyber, biology, or chemistry queries and reroutes them to Claude Opus 4.8, telling you when it does. The capability is throttled at the query level. Codex throttles at the access level: the frontier Sol tier is gated to roughly 20 approved companies under a US directive, while Terra and Luna stay open.

For a normal company building normal workflows, neither limit bites. You will run Fable 5 as-is and Codex on Terra. But it is worth knowing the frontier of each is fenced, just with a different fence.

The Verdict: Run Both

We do not pick one. We keep both in the stack and reach for the one the job calls for. The dividing line is not quality, it is shape of work.

Reach for Fable 5 when

The job is depth

Long, unsupervised runs
A hard, self-contained task you want done overnight without babysitting.
Quality of output matters most
Gnarly code, dense reasoning, a patch that has to be right the first time.
You will pay for it
The work is worth a premium and you are buying labor, not tokens.
Reach for Codex 5.6 when

The job is breadth

Work spans real apps
The task lives in a browser, a SaaS tool, a desktop, not just a repo.
Volume and parallelism
Many small jobs, run constantly, fanned out across subagents.
It has to reach non-engineers
Finance, ops, and comms need to drive it, not just developers.

Notice what is missing from both columns: the model name as a strategy. The tiers are converging, the prices are falling, and the leaderboard reshuffles every few weeks. Whichever is best this month, the other will match next month. Betting your company on a specific model is betting on a snapshot.

The durable work is upstream of the model. Clean context, defined goals, connected data, a clear definition of done. Get that right for one function and it runs on Fable, on Codex, on whatever ships in August. The model is not the moat. The rollout is. That is the work nativefirst does on site.

Stop benchmarking models. Start making functions agent-ready.

Book a free Diagnostic: 30 to 45 minutes, no deck, no pitch. We map which functions in your company are ready for an agent, what the data and context layer looks like, and which one to ship first so the rest can follow, on whichever model wins this month.

Book the Diagnostic →
Sources
1Anthropic, Claude Fable 5 launch, June 9, 2026, and third-party coverage (TechCrunch, VentureBeat, finout.io, llm-stats). Mythos-class model above Opus 4.8, $10 / $50 per million tokens, 95% SWE-bench Verified and 80.3% SWE-bench Pro, 1M-token context, and automatic rerouting of sensitive cyber, biology, and chemistry queries to Opus 4.8.
2OpenAI, "Previewing GPT-5.6 Sol," June 2026, and 9to5Mac, June 26, 2026. Three-tier line (Sol, Terra, Luna), per-million pricing ($5 / $30, $2.50 / $15, $1 / $6), ultra mode running subagents, Terminal-Bench 2.1 state of the art, and the Sol access restriction to roughly 20 approved companies.
3Anthropic Claude Code team, June 9, 2026, on Fable running for hours and testing its own work. Peter Yang, June 2026, on switching to Codex for its browser and computer-use workflows.
4Published GPT-5.6-vs-Fable-5 benchmark comparisons, June 2026 (explainx.ai, lushbinary, andrew.ooo), drawing on OpenAI and Anthropic figures: Terminal-Bench 2.1 (Sol 88.8, ultra 91.9; Fable 5 83.4) and SWE-bench (Fable 5 95% Verified, 80.3% Pro; Sol not published at preview, GPT-5.5 reference 58.6%).
John Tan
John Tan

Founder and CEO of nativefirst.ai. Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.