What Is SWE-bench? The AI Coding Benchmark, Explained

TL;DR

SWE-bench tests whether AI models can fix real GitHub issues from real open-source repos, graded by whether the repo's tests pass.
The famous version, SWE-bench Verified, is dead. OpenAI retired it in February 2026 after saturation and contamination made the scores meaningless.
Passing tests is not the same as mergeable code. The same model scores 80% on SWE-bench Pro and under 30% on FrontierCode Diamond.

Fig. 1

The SWE-bench family tree

Three years of patching the ruler. The standard finally became the one your team uses.

Sources: Jimenez et al. (Princeton, 2023); OpenAI; Scale AI; Cognition

SWE-bench is a benchmark that tests whether AI models can resolve real GitHub issues from real open-source repositories. The model gets the issue description and the full codebase. It has to produce a patch. The patch is graded one way: do the repository's tests pass? It was created by researchers at Princeton and the University of Chicago in 2023, and it became the default scoreboard for AI coding ability.

If you have seen a headline like "Model X scores 80% on SWE-bench," this is the benchmark it refers to. Here is what that number actually measures, why the most-quoted version of it no longer exists, and how to read the scores without getting fooled.

The family tree

SWE-bench is not one benchmark anymore. It is a family, and each generation exists because the previous one stopped telling the truth.

SWE-bench (2023)

The original: 2,294 issues pulled from 12 popular Python repositories like Django and scikit-learn. Real issues, real codebases, real tests. At launch, the best models solved under 2% of them. That number is the whole story of the last three years.

SWE-bench Verified (2024)

OpenAI hired human engineers to vet the original tasks and found many were broken: ambiguous issue descriptions, tests that punished correct fixes. They published a clean 500-task subset called SWE-bench Verified. It became the standard number every lab quoted. It is also the version that died first. More on that below.

SWE-bench Pro (2025)

Harder tasks, longer multi-file fixes, and a contamination-resistant design that includes commercial repositories models could not have memorized during training. When Verified saturated, Pro became the working leaderboard.

FrontierCode (2026)

Cognition's answer to a deeper problem, launched June 8, 2026. FrontierCode does not grade on whether tests pass. It grades on whether a maintainer would actually merge the patch: code quality, style, fit with the codebase. That standard is brutal, and it is much closer to what you actually pay engineers for.

How to read a score

When a model scores 80% on SWE-bench Pro, it means the model resolved 80% of the benchmark's issues well enough for the existing test suites to pass. That is genuinely impressive. It is not the same as saying the model writes code your team would ship.

Passing tests is a floor, not a ceiling. A patch can pass every test while being ugly, fragile, or wrong in ways the tests never check. This is why the same model posts wildly different numbers depending on who grades it:

Model (June 2026)	SWE-bench Pro	FrontierCode Diamond
Claude Fable 5	80.3%	29.3%
Claude Opus 4.8	~69.2%	13.4%
GPT-5.5	58.6%	not tested

Same models, same week. Fable 5 resolves four out of five Pro issues but produces mergeable-by-maintainer-standards code on fewer than one in three of FrontierCode's hardest tasks. Neither number is a lie. They are measuring different things: "does it work" versus "would a senior engineer accept this."

Why SWE-bench Verified died

Two reasons, and both matter if you are quoting benchmark numbers in a board deck.

Saturation. By late 2025, frontier models were clustered near the top of Verified. When everyone scores in the high 70s and 80s, the benchmark can no longer separate good from great. It becomes a participation trophy.

Contamination. Worse than saturation. The benchmark's repos and their fixes are public, which means they leak into training data. Researchers showed models reproducing the exact gold-standard patches when given little more than a task ID. The model was not solving the problem. It was remembering the answer.

The sharpest demonstration came from UC Berkeley in April 2026: Hao Wang and colleagues showed that a 10-line conftest.py file could score 100% on SWE-bench Verified without solving a single issue. It simply manipulated the test harness. A perfect score, zero engineering.

OpenAI retired Verified in February 2026 with a post titled "Why we no longer evaluate SWE-bench Verified." When the lab that built the benchmark stops trusting it, you should too.

What a CEO should take from this

Benchmarks are directionally useful and specifically gameable. Both things are true at once.

Directionally: the climb from 2% in 2023 to 80% on a harder benchmark in 2026 is real. The models genuinely got that much better at software work. If you are still planning as if AI coding is a toy, the benchmarks are telling you to update.

Specifically: any single score is an artifact of what got measured, what leaked into training, and who did the grading. Vendors know which benchmarks buyers quote, and they optimize for them. That is not fraud. It is incentives.

So the number that matters is not on any leaderboard. It is the model against your backlog, in your codebase, graded by your review standards. That eval does not exist until someone builds it inside your company: pick ten real tickets, run the model, have your senior engineers grade the patches the way they grade a new hire's. It takes a week and it tells you more than every public benchmark combined.

That is the work I do inside companies. The public scores get you to the table. The private eval tells you what to ship.

Your repo. Your benchmark.

Benchmark it on your backlog.

The Diagnostic is free: 30–45 minutes. We'll scope what Fable 5 actually does against your repo, graded by your standards.

Book the Diagnostic →

Sources

1Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", Princeton University / University of Chicago, 2023.

2OpenAI, "Why we no longer evaluate SWE-bench Verified," February 2026. On saturation and contamination in the Verified subset.

3Cognition, "Introducing FrontierCode," June 8, 2026. On mergeability-graded evaluation and Diamond-tier scores.

4Hao Wang et al., UC Berkeley, April 2026. On the 10-line conftest.py scoring 100% on SWE-bench Verified without resolving any issue.

5Anthropic, Claude Fable 5 announcement, June 9, 2026. On SWE-bench Pro and FrontierCode results.

John Tan

Founder and CEO of nativefirst.ai. Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.