Codex 5.6 vs Claude Fable 5. Two Bets.

In three weeks this June, the two labs that matter most each shipped their most capable agent. Anthropic released Claude Fable 5 on June 9. OpenAI previewed GPT-5.6, the engine under Codex, on June 26. Every comparison since has asked the same question, which one won, and every one of them is asking it wrong.

Put the two side by side and they are not the same machine. Strip the marketing and you find two different bets on what an agent even is. We run both at nativefirst, in production, every week, and we do not pick a winner, we pick a job. Here is the scorecard, the two philosophies under it, and how we decide which one to reach for.

Fig. 1

The scorecard

Two flagships, three weeks apart. The interesting differences are not in the price row.

Round 1: Price

This one is not close, and it should not be read as a verdict. Codex undercuts Fable at every tier. The Codex everyday model, Terra, runs $2.50 input and $15 output per million tokens. Fable 5 is $10 and $50, four times the input and more than three times the output. Even Codex's frontier Sol tier, at $5 and $30, comes in below Fable.

Fable is priced as the premium instrument, and Anthropic is unapologetic about it. The case for paying it is a labor case, not a token case: a senior week of work delivered overnight still costs less than one billable consultant hour. Codex is priced for volume, for work that runs constantly in the background. Different pricing because they expect to be used differently.

Round 2: Raw Coding

The two labs report different benchmarks, which makes a clean apples-to-apples read hard. But enough numbers are public to see the shape, and it is not a tie. Each lab leads on the benchmark it chose to lead with.

Terminal-Bench 2.1, the agentic terminal benchmark OpenAI leads with. Per OpenAI's preview numbers, Sol scores 88.8 to Fable 5's 83.4, and Sol's ultra mode pushes it to 91.9. Codex wins the loop.
SWE-bench, the patch-quality suite Anthropic leads with. Fable 5 posts 95% Verified and 80.3% Pro, the highest publicly reported of any GA model. OpenAI did not publish Sol's SWE-bench Pro at preview, where its predecessor GPT-5.5 sat at 58.6%. Fable wins the patch.

Fig. 2

The benchmarks, where each lab is strong

SWE-bench is independent, but running a model on it needs access, and Sol is in gated preview. So Fable 5's scores are independently verified, while Sol's lone number is the Terminal-Bench figure OpenAI self-reported. Sources: OpenAI preview, Anthropic, Artificial Analysis, SWE-bench leaderboard.

Read together, the pattern is telling, with one caveat worth stating plainly. Fable optimizes for the quality of the patch, Codex for the quality of the loop. One is graded on the answer, the other on the workflow. But the evidence is not symmetric: Fable is generally available, so its SWE-bench and GDPval numbers are independently run, while Sol is gated in preview, so its single Terminal-Bench figure is OpenAI's own and cannot yet be checked by anyone outside. Take the loop win as reported, not confirmed.

One benchmark matters more than any coding test for a company actually deploying AI, and it points the same way. GDPval scores models on real economic work across 44 occupations, not puzzles, which makes it the closest thing to a test of whether a model can do a job. Fable 5 leads it on the independent Artificial Analysis leaderboard. Sol has no GDPval number either, gated and unrun. The model that is hardest to evaluate is also the one you cannot buy.

Round 3: What They Think an Agent Is

This is the real fight, and it is a philosophical one. Both labs agree the future is agentic. They disagree on the shape.

Anthropic's bet is depth. Fable 5 is one extraordinarily capable agent that you point at a goal and leave alone. It runs for hours, tests its own work, and self-corrects before you ever see the output. The whole design assumes a single agent reliable enough to trust unsupervised over a long horizon.

Claude
Code

"Fable can run for hours, tests its own work, and often produces better code than me. My job is increasingly about direction and setup, not supervision."

Anthropic Claude Code Team · June 9, 2026

OpenAI's bet is breadth. GPT-5.6 ships an ultra mode that does not make one agent think longer, it spins up subagents and runs them in parallel, then recombines the results. Orchestration is baked into the model. And Codex, the product on top, reaches out into the whole computer: a desktop app with an in-app browser and computer use, so the agent acts in the tools you already have rather than waiting on an API.

Peter
Yang

"I built so many workflows relying on those two things, browser and computer use, instead of hunting for APIs."

@petergyang, on switching to Codex · June 2026

Fig. 3

Same month, opposite bets

One agent you trust to go deep, or many agents that go everywhere. That is the choice.

Round 4: Safety and Access

Both shipped with guardrails, and both handle the highest-risk work by routing around it, but in opposite directions. Fable 5 auto-detects sensitive cyber, biology, or chemistry queries and reroutes them to Claude Opus 4.8, telling you when it does. The capability is throttled at the query level. Codex throttles at the access level: the frontier Sol tier is gated to roughly 20 approved companies under a US directive, while Terra and Luna stay open.

For a normal company building normal workflows, neither limit bites. You will run Fable 5 as-is and Codex on Terra. But it is worth knowing the frontier of each is fenced, just with a different fence.

The Verdict: Run Both

We do not pick one. We keep both in the stack and reach for the one the job calls for. The dividing line is not quality, it is shape of work.

Reach for Fable 5 when

The job is depth

Long, unsupervised runs

A hard, self-contained task you want done overnight without babysitting.

Quality of output matters most

Gnarly code, dense reasoning, a patch that has to be right the first time.

You will pay for it

The work is worth a premium and you are buying labor, not tokens.

Reach for Codex 5.6 when

The job is breadth

Work spans real apps

The task lives in a browser, a SaaS tool, a desktop, not just a repo.

Volume and parallelism

Many small jobs, run constantly, fanned out across subagents.

It has to reach non-engineers

Finance, ops, and comms need to drive it, not just developers.

Notice what is missing from both columns: the model name as a strategy. The tiers are converging, the prices are falling, and the leaderboard reshuffles every few weeks. Whichever is best this month, the other will match next month. Betting your company on a specific model is betting on a snapshot.

The durable work is upstream of the model. Clean context, defined goals, connected data, a clear definition of done. Get that right for one function and it runs on Fable, on Codex, on whatever ships in August. The model is not the moat. The rollout is. That is the work nativefirst does on site.

Stop benchmarking models. Start making functions agent-ready.

Book a free Diagnostic: 30 to 45 minutes, no deck, no pitch. We map which functions in your company are ready for an agent, what the data and context layer looks like, and which one to ship first so the rest can follow, on whichever model wins this month.

Book the Diagnostic →

Sources

1Anthropic, Claude Fable 5 launch, June 9, 2026, and third-party coverage (TechCrunch, VentureBeat, finout.io, llm-stats). Mythos-class model above Opus 4.8, $10 / $50 per million tokens, 95% SWE-bench Verified and 80.3% SWE-bench Pro, 1M-token context, and automatic rerouting of sensitive cyber, biology, and chemistry queries to Opus 4.8.

2OpenAI, "Previewing GPT-5.6 Sol," June 2026, and 9to5Mac, June 26, 2026. Three-tier line (Sol, Terra, Luna), per-million pricing ($5 / $30, $2.50 / $15, $1 / $6), ultra mode running subagents, Terminal-Bench 2.1 state of the art, and the Sol access restriction to roughly 20 approved companies.

3Anthropic Claude Code team, June 9, 2026, on Fable running for hours and testing its own work. Peter Yang, June 2026, on switching to Codex for its browser and computer-use workflows.

4Published GPT-5.6-vs-Fable-5 benchmark comparisons, June 2026 (explainx.ai, lushbinary, andrew.ooo), drawing on OpenAI and Anthropic figures: Terminal-Bench 2.1 (Sol 88.8, ultra 91.9; Fable 5 83.4) and SWE-bench (Fable 5 95% Verified, 80.3% Pro; Sol not published at preview, GPT-5.5 reference 58.6%).

John Tan

Founder and CEO of nativefirst.ai. Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.