TL;DR
  • OpenAI deprecated SWE-bench Verified in mid-February. Models, including GPT-5.2, could reproduce gold patches from nothing but the task ID. The benchmark that defined AI coding for three years is dead.
  • The 80% cluster made it meaningless anyway. Opus 4.5 and 4.6 at 80.9%, Gemini 3.1 Pro at 80.6%, GPT-5.2 at 80.0%. Three labs, one number.
  • The living benchmark is SWE-bench Pro on a standardized harness. SEAL public set: GPT-5.3-Codex 56.8%, Gemini 3.1 Pro 54.2%, Opus 4.6 51.9%. Real spread, real headroom.

Update, June 2026: the successor arrived. FrontierCode grades like a tech lead and reset everyone below 30%.

Sometime in mid-February, OpenAI published a note with a dry title: "Why we no longer evaluate SWE-bench Verified." The content was not dry. Their models could reproduce the gold patches from nothing but the task ID. Including GPT-5.2. Not solve the issue. Recall the answer.

That is contamination, and it is terminal. SWE-bench Verified is built from real GitHub issues in public repos. The repos, the issues, and the human-written fixes are all in the training data of every frontier model. Once a model can recite the fix from the question number, the benchmark stops measuring engineering and starts measuring memory.

The deprecation only made official what the leaderboard had been whispering for months. The benchmark that defined AI coding for three years died of success.

The Month in Three Releases

February 5: Claude Opus 4.6. Verified: 80.9%, reported as a 25-trial average. That number is exactly flat against Opus 4.5, and the flatness is the saturation tell. A meaningfully better model that cannot move the score means the score has run out of room. The number to watch instead: SWE-bench Pro at 53.4% vendor-reported, 51.9% on Scale AI's SEAL standardized public set. Hold onto that two-number reality. We will come back to it.

February 5, same day: GPT-5.3-Codex. On SEAL's public set it posts 56.8%, against 56.4% for GPT-5.2-Codex. Incremental, but it is the top of the benchmark that still discriminates.

February 19: Gemini 3.1 Pro. Verified: 80.6%, which means Google joined a cluster that was already dead by the time the press release shipped. The model is genuinely strong on reasoning benchmarks: 77.1% on ARC-AGI-2, 94.3% on GPQA. On Pro it lands at 54.2%, between the two Codex models and ahead of Opus 4.6 on the standardized harness.

February 2026 · Dead benchmark vs. living benchmark
ModelSWE-bench VerifiedSWE-bench Pro (SEAL public set)
Claude Opus 4.680.9%51.9%
Gemini 3.1 Pro80.6%54.2%
GPT-5.2 / 5.2-Codex80.0%56.4%
GPT-5.3-Codexnot reported56.8%

Read the columns side by side. The left column says the race is over and everyone tied. The right column says the race is mid-pack and nobody is close to done. Only one of those columns can be true, and OpenAI just told you which.

The Flatline

Fig. 1
The flatline: best Verified score by quarter
SWE-BENCH VERIFIED · BEST REPORTED SCORE 80% 50% ~50% ~65% ~77% 80.9% 80.9% saturated + contaminated, deprecated mid-Feb MID 2024 LATE 2024 MID 2025 NOV 2025 FEB 2026 THE LIVING BENCHMARK · SWE-BENCH PRO, SEAL PUBLIC SET GPT-5.3-Codex 56.8% Claude Opus 4.6 51.9%
Three years of climb, then a wall. The line stopped moving because the test stopped measuring.

The Two-Number Reality

One detail from the Opus 4.6 release deserves its own section. Anthropic reported 53.4% on SWE-bench Pro. SEAL's standardized public set puts the same model at 51.9%. Neither number is a lie. The vendor number runs on the vendor's own scaffold, tuned to the model. The SEAL number runs every model through the same harness.

A point and a half does not sound like much until you remember that entire marketing campaigns get built on smaller gaps. From here on, every coding score you see comes in two flavors: the scaffold the vendor built for the test, and the harness someone else built to be fair. Always ask which one you are looking at.

What February Actually Taught

SWE-bench Verified taught the industry to chase a number. For three years that was useful: the number was hard, public, and honest enough. Then the models ate the test, the scores converged at 80%, and the number kept getting printed long after it stopped meaning anything.

SWE-bench Pro and standardized harnesses like SEAL are teaching the industry the opposite habit: chase reality. Held-out tasks, same harness for everyone, scores in the 50s with real spread between models. That is what a benchmark with something left to say looks like.

The operator translation: any public benchmark old enough to be a buying criterion is old enough to be in the training data. The eval that actually predicts whether a model helps your team is the one built on your repo, your tickets, and your definition of done. Nobody has memorized that one.

Stop chasing the number.

Run the only honest benchmark.

The Diagnostic is free: 30–45 minutes. We'll scope an eval on your actual backlog, graded by your standards.

Book the Diagnostic →
Sources
1OpenAI, "Why we no longer evaluate SWE-bench Verified," mid-February 2026. On saturation, contamination, and gold-patch reproduction from task IDs alone (including GPT-5.2).
2Anthropic, Claude Opus 4.6 announcement, February 5, 2026. SWE-bench Verified 80.9% (25-trial average); SWE-bench Pro 53.4% vendor-reported.
3OpenAI, GPT-5.3-Codex, February 5, 2026.
4Google, Gemini 3.1 Pro, February 19, 2026. SWE-bench Verified 80.6%; ARC-AGI-2 77.1%; GPQA 94.3%.
5Scale AI, SEAL leaderboard, SWE-bench Pro standardized public set: GPT-5.3-Codex 56.8%, GPT-5.2-Codex 56.4%, Gemini 3.1 Pro 54.2%, Claude Opus 4.6 51.9%.
John Tan
John Tan

Fractional Chief of AI at nativefirst.ai. Former YC CEO (Depict). Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.