TL;DR
  • Claude Opus 4.5 scores 80.9% on SWE-bench Verified, the first model in history over 80%. It shipped November 24 at $5/$25 per million tokens, with 59.3% on Terminal-Bench.
  • Four frontier releases in twelve days: GPT-5.1 (76.3%), Gemini 3 Pro (76.2%), GPT-5.1-Codex-Max (77.9%, state of the art for five days), then Opus 4.5.
  • Same model on contamination-resistant SWE-bench Pro: 45.9%. Still the top Pro score, but the 35-point spread between the two leaderboards is the number nobody is quoting.

Anthropic shipped Claude Opus 4.5 on November 24, and the headline writes itself: 80.9% on SWE-bench Verified, the first model ever past 80%. Five days earlier that crown belonged to OpenAI. A week before that, to nobody.

If you only read one benchmark post this month, read this one to the end. The 80.9% is real. The asterisk next to it is bigger than the number.

Twelve Days, Four Frontier Models

November was the fastest sequence of frontier releases the field has ever seen:

Three labs, four models, twelve days, and a 4.6-point jump in the state of the art. The frontier is not slowing down. It is sprinting in public.

The Leaderboard, End of November

SWE-bench Verified · End of November 2025
RankModelVerifiedPro
1Claude Opus 4.580.9%45.9%
2GPT-5.1-Codex-Max77.9%n/a
3Claude Sonnet 4.577.2%n/a
4GPT-5.176.3%n/a
5Gemini 3 Pro76.2%n/a
Fig. 1
SWE-bench Verified, end of November 2025
SWE-BENCH VERIFIED · % RESOLVED Claude Opus 4.5 80.9% GPT-5.1-Codex-Max 77.9% Claude Sonnet 4.5 77.2% GPT-5.1 76.3% Gemini 3 Pro 76.2% Same model on contamination-resistant Pro: 45.9%. Hold that thought.
Four frontier releases in twelve days. The lead changed hands twice.

The 80.9% matters beyond bragging rights. SWE-bench Verified is the benchmark every engineering leader quotes when picking a model, and 80% was the psychological line. Opus 4.5 did not creep over it. It cleared the previous state of the art by 3 full points, the largest single-release jump of the year, while cutting price to $5/$25.

The Number Nobody Quoted: 45.9%

Back in September, Scale AI launched SWE-bench Pro: 1,865 tasks built to resist contamination. The repos are commercial or copyleft-licensed, the tasks are harder, and the harness is standardized under Scale's SEAL leaderboard. At launch, the best frontier models scored around 23%.

On Pro, Opus 4.5 scores 45.9%. That is the top Pro score posted to date, roughly double where the frontier stood ten weeks ago. Genuinely impressive.

But put the two numbers side by side. 80.9% on Verified. 45.9% on Pro. Same model, same week, 35 points apart.

What the 35-Point Spread Means

There are two ways to read the spread.

The optimistic reading: Pro is simply harder. Bigger codebases, longer fixes, no hand-holding. Every model drops on Pro, so the ranking still holds, and the ranking is what you buy on.

The uncomfortable reading: SWE-bench Verified tasks are public GitHub issues, and public GitHub is training data. Some unknown fraction of that 80.9% is not problem-solving. It is recall. Pro was built contamination-resistant precisely because Scale suspected the public number was inflated, and the 35-point gap is consistent with that suspicion.

Everyone is choosing the optimistic reading. The spread is the warning.

Here is the operator takeaway. When a model scores 81% on the benchmark it may have memorized and 46% on the one it could not, the honest planning number for your team is closer to 46. Budget your AI engineering rollout on the Pro number, celebrate the Verified number, and never confuse the two in a board deck.

Better still, stop planning on either. Your repo is not in anyone's training set. An eval built on your actual backlog, graded by your own tech leads, is the only score that predicts what happens when the model meets your codebase on Monday.

Opus 4.5 is the best coding model available today, on both rulers. Just know which ruler you are reading.

Update, June 2026: the 35-point spread became the story. By June the benchmark was dead and the successor reset everyone below 30%.

Run the only honest benchmark.

The Diagnostic is free: 30–45 minutes. We'll scope an eval on your actual backlog, graded by your standards.

Book the Diagnostic →
Sources
1Anthropic, "Introducing Claude Opus 4.5," November 24, 2025. SWE-bench Verified 80.9%, Terminal-Bench 59.3%, $5/$25 per million tokens.
2OpenAI, "Building more with GPT-5.1-Codex-Max," November 19, 2025. SWE-bench Verified 77.9%. GPT-5.1 (November 12): 76.3%.
3Google, "Gemini 3" announcement, November 18, 2025. SWE-bench Verified 76.2%.
4Scale AI, "SWE-bench Pro," September 19, 2025. 1,865 contamination-resistant tasks; top frontier models ~23% at launch; SEAL leaderboard standardization. Opus 4.5 Pro score: 45.9%.
John Tan
John Tan

Fractional Chief of AI at nativefirst.ai. Former YC CEO (Depict). Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.