In three weeks this June, the two labs that matter most each shipped their most capable agent. Anthropic released Claude Fable 5 on June 9. OpenAI previewed GPT-5.6, the engine under Codex, on June 26. Every comparison since has asked the same question, which one won, and every one of them is asking it wrong.
Put the two side by side and they are not the same machine. Strip the marketing and you find two different bets on what an agent even is. We run both at nativefirst, in production, every week, and we do not pick a winner, we pick a job. Here is the scorecard, the two philosophies under it, and how we decide which one to reach for.
Round 1: Price
This one is not close, and it should not be read as a verdict. Codex undercuts Fable at every tier. The Codex everyday model, Terra, runs $2.50 input and $15 output per million tokens. Fable 5 is $10 and $50, four times the input and more than three times the output. Even Codex's frontier Sol tier, at $5 and $30, comes in below Fable.
Fable is priced as the premium instrument, and Anthropic is unapologetic about it. The case for paying it is a labor case, not a token case: a senior week of work delivered overnight still costs less than one billable consultant hour. Codex is priced for volume, for work that runs constantly in the background. Different pricing because they expect to be used differently.
Round 2: Raw Coding
The two labs report different benchmarks, which makes a clean apples-to-apples read hard. But enough numbers are public to see the shape, and it is not a tie. Each lab leads on the benchmark it chose to lead with.
- Terminal-Bench 2.1, the agentic terminal benchmark OpenAI leads with. Per OpenAI's preview numbers, Sol scores 88.8 to Fable 5's 83.4, and Sol's ultra mode pushes it to 91.9. Codex wins the loop.
- SWE-bench, the patch-quality suite Anthropic leads with. Fable 5 posts 95% Verified and 80.3% Pro, the highest publicly reported of any GA model. OpenAI did not publish Sol's SWE-bench Pro at preview, where its predecessor GPT-5.5 sat at 58.6%. Fable wins the patch.
Read together, the pattern is telling, with one caveat worth stating plainly. Fable optimizes for the quality of the patch, Codex for the quality of the loop. One is graded on the answer, the other on the workflow. But the evidence is not symmetric: Fable is generally available, so its SWE-bench and GDPval numbers are independently run, while Sol is gated in preview, so its single Terminal-Bench figure is OpenAI's own and cannot yet be checked by anyone outside. Take the loop win as reported, not confirmed.
One benchmark matters more than any coding test for a company actually deploying AI, and it points the same way. GDPval scores models on real economic work across 44 occupations, not puzzles, which makes it the closest thing to a test of whether a model can do a job. Fable 5 leads it on the independent Artificial Analysis leaderboard. Sol has no GDPval number either, gated and unrun. The model that is hardest to evaluate is also the one you cannot buy.
Round 3: What They Think an Agent Is
This is the real fight, and it is a philosophical one. Both labs agree the future is agentic. They disagree on the shape.
Anthropic's bet is depth. Fable 5 is one extraordinarily capable agent that you point at a goal and leave alone. It runs for hours, tests its own work, and self-corrects before you ever see the output. The whole design assumes a single agent reliable enough to trust unsupervised over a long horizon.
Code
"Fable can run for hours, tests its own work, and often produces better code than me. My job is increasingly about direction and setup, not supervision."
OpenAI's bet is breadth. GPT-5.6 ships an ultra mode that does not make one agent think longer, it spins up subagents and runs them in parallel, then recombines the results. Orchestration is baked into the model. And Codex, the product on top, reaches out into the whole computer: a desktop app with an in-app browser and computer use, so the agent acts in the tools you already have rather than waiting on an API.
Yang
"I built so many workflows relying on those two things, browser and computer use, instead of hunting for APIs."
Round 4: Safety and Access
Both shipped with guardrails, and both handle the highest-risk work by routing around it, but in opposite directions. Fable 5 auto-detects sensitive cyber, biology, or chemistry queries and reroutes them to Claude Opus 4.8, telling you when it does. The capability is throttled at the query level. Codex throttles at the access level: the frontier Sol tier is gated to roughly 20 approved companies under a US directive, while Terra and Luna stay open.
For a normal company building normal workflows, neither limit bites. You will run Fable 5 as-is and Codex on Terra. But it is worth knowing the frontier of each is fenced, just with a different fence.
The Verdict: Run Both
We do not pick one. We keep both in the stack and reach for the one the job calls for. The dividing line is not quality, it is shape of work.
The job is depth
The job is breadth
Notice what is missing from both columns: the model name as a strategy. The tiers are converging, the prices are falling, and the leaderboard reshuffles every few weeks. Whichever is best this month, the other will match next month. Betting your company on a specific model is betting on a snapshot.
The durable work is upstream of the model. Clean context, defined goals, connected data, a clear definition of done. Get that right for one function and it runs on Fable, on Codex, on whatever ships in August. The model is not the moat. The rollout is. That is the work nativefirst does on site.
Stop benchmarking models. Start making functions agent-ready.
Book a free Diagnostic: 30 to 45 minutes, no deck, no pitch. We map which functions in your company are ready for an agent, what the data and context layer looks like, and which one to ship first so the rest can follow, on whichever model wins this month.
Book the Diagnostic →