- Claude Fable 5 leads SWE-bench Pro at 80.3%, 11 points ahead of Opus 4.8 (69.2%) and 22 ahead of GPT-5.5 (58.6%). The biggest single-generation jump on record.
- SWE-bench Verified is dead. OpenAI stopped evaluating it in February 2026, calling it "effectively saturated and also highly contaminated." Berkeley researchers later hit 100% with 10 lines of test config.
- FrontierCode is the successor, and it resets everything: every frontier model scores under 30% on FrontierCode Diamond. Fable 5: 29.3%. Opus 4.8: 13.4%.
Anthropic shipped Claude Fable 5 yesterday, and the headline benchmark number is real: 80.3% on SWE-bench Pro. That is the highest score ever posted, by the widest margin ever posted. It is also the least interesting part of the story.
The interesting part is that the benchmark Fable 5 just conquered spent the first half of 2026 getting publicly dismantled. If you are using SWE-bench scores to decide which model your engineering team should run, you need both halves of this story.
The Leaderboard, June 2026
| Rank | Model | Score |
|---|---|---|
| 1 | Claude Fable 5 | 80.3% |
| 2 | Claude Opus 4.8 | ~69.2% |
| 3 | GPT-5.5 | 58.6% |
For context: the gap between Fable 5 and Opus 4.8 (+11 points) is bigger than the gap that used to separate Opus from Gemini. One generation, in six months, opened more distance than the entire frontier race had between competitors. The model is genuinely better at fixing software. That part is not in question.
What is in question is whether the number means what you think it means.
SWE-bench Verified Is Officially Dead
In February 2026, OpenAI's Frontier Evals team quietly stopped evaluating SWE-bench Verified. Their stated reason, via Mia Glaese and Olivia Watkins on Latent Space, was blunt: the benchmark is "effectively saturated and also highly contaminated."
Contaminated means something specific here. SWE-bench tasks are real GitHub issues from public repos. Those repos, the issues, and the gold-standard fixes are all in the training data of every frontier model. OpenAI found that models could reproduce gold patches verbatim from nothing but the task ID. Not solve the problem. Recall the answer.
The audit numbers were worse than the contamination. 59.4% of audited hard tasks had flawed tests. Over 60% of audited failures were unsolvable as stated. So the benchmark was simultaneously too easy (memorized answers) and broken (impossible tasks), and the score blended both failure modes into one clean-looking percentage.
Then Berkeley Broke It With 10 Lines
In April 2026, UC Berkeley researchers led by Hao Wang published the kill shot. They scored 100% on SWE-bench without solving a single task. The method: a 10-line conftest.py that gamed the test harness itself. No reasoning, no patches, no model. A config file.
They went on to break 8 major agent benchmarks the same way. Their conclusion is the one line from this whole saga worth pinning to your wall:
"Don't trust the number. Trust the methodology."
This is not an academic gotcha. Every vendor pitch, every model selection memo, every board slide that cites a SWE-bench score is citing a number that a grad student can max out with a config file.
FrontierCode: The Reset
On June 8, two days before Fable 5 launched, Cognition released the successor: FrontierCode. It was built with more than 20 open-source maintainers, and it changes the grading question entirely. SWE-bench asks: do the tests pass? FrontierCode asks: would a maintainer merge this?
Cognition's framing: "Where others grade like a CI, FrontierCode grades like a tech lead." Mergeability, not test-passing. Code quality, side effects, whether the fix belongs in the codebase at all.
The results are humbling. Every frontier model scores under 30% on FrontierCode Diamond. Fable 5 leads at 29.3%. Opus 4.8 sits at 13.4%. And the eval has an 81% lower false-positive rate than SWE-bench Pro, meaning when it says a fix is good, a human tech lead usually agrees.
Same model. Same week. 80.3% on one leaderboard, 29.3% on the other.
What the 50-Point Spread Actually Tells You
The spread between 80% and sub-30% is not noise. It is the precise distance between "the tests pass" and "a tech lead would merge this." That distance is where your production incidents live. It is where the AI-generated PR that passed CI but quietly broke the billing edge case lives.
And here is the operator conclusion. Public benchmarks are contaminated by construction: anything public long enough to be a standard is in the training data. The only eval that cannot be contaminated is the one nobody published. Your repo. Your tickets. Your definition of mergeable, graded by your own tech leads on work the model has never seen.
That eval does not exist off the shelf and never will. It gets built inside your company, next to your codebase, by someone who sits with your engineers and learns what "good" means on your team. This is exactly why I work embedded rather than advising from a deck. You cannot scope an honest eval from outside the building.
Fable 5 is the best coding model ever shipped. The leaderboard it tops is dying. Both things are true, and the second one is the one that should change how you buy.
Trust the methodology.
Run the only honest benchmark.
The Diagnostic is free: 30–45 minutes. We'll scope the eval that matters, Fable 5 against your actual backlog, graded by your own standards.
Book the Diagnostic →