TL;DR
  • Claude Opus 4.8 takes the SWE-bench Pro lead at 69.2%, up from Opus 4.7's 64.3%. Same $5/$25 pricing, fast mode at 2.5x speed.
  • Open weights close in. GLM-5.1 briefly topped Pro during May, and Kimi K2.6 sits within 6 points of Opus 4.7 at roughly 8x lower cost.
  • SWE-bench Verified is a zombie metric. Opus 4.8 posts 88.6% on a benchmark its own creator stopped trusting. Nobody serious cites it anymore.

This is the May edition of the monthly SWE-bench recap. Four numbers to know: 69.2, 6, 8, and 54.2. A new leader, a shrinking gap, a collapsing price premium, and a Pro tier Google forgot to update. Here is what each one means.

May 28: Opus 4.8 Lands

Anthropic shipped Claude Opus 4.8 on May 28 and took the SWE-bench Pro lead: 69.2%, up from Opus 4.7's 64.3%. It also posted 88.6% on SWE-bench Verified and 83.4% on OSWorld, the computer-use benchmark. Pricing held at $5/$25 per million tokens, with a fast mode running at 2.5x speed.

A 4.9-point jump on Pro in one point release is a real result. Pro is the harder, less contaminated of the two SWE-bench tracks, and the one this series treats as the scoreboard that still means something. For now.

The Verified Zombie

About that 88.6% on Verified. It is a strong score on a benchmark that its own creator stopped trusting. Verified numbers still circulate as zombie statistics: third parties report GPT-5.5 at 88.7%, a figure OpenAI itself has never published, because OpenAI stopped reporting Verified entirely.

So we now have a leaderboard where the two top scores are 0.1 points apart, one is unofficial, and nobody serious cites either. When a metric stops appearing in the vendor's own launch posts and survives only in third-party roundups, it is not measuring progress anymore. It is measuring habit.

Open Weights, 6 Points Back, 8x Cheaper

The sharper story in May was below the frontier. GLM-5.1 briefly topped SWE-bench Pro during the month before Opus 4.8 reclaimed the lead. And Kimi K2.6 now sits within 6 points of Opus 4.7 at roughly 8x lower cost.

That ratio is the number CFOs will remember. The frontier premium, the extra you pay for the best closed model over the best open one, is now being argued about in public. If your workload lives in the band where a 6-point gap does not change the outcome, the open-weights option went from ideological to financial this month.

Grok 4.3: Opting Out of the Race

xAI shipped Grok 4.3 to general availability in early May. It is strong where it chose to be strong: 98% on tau²-Bench Telecom, an agentic customer-support benchmark, at $1.25/$2.50 per million tokens. It is absent from SWE-bench Pro and from serious coding leaderboards generally.

"xAI gives up on being the 'best' model, and that's a good thing."

That is 302.AI's read, and it is the right one. Grok 4.3 is not trying to win this column's table. It is trying to win the support-ticket queue at a fifth of the price. Whether that positioning holds is a question for the agent benchmarks, not this one.

Google's Awkward I/O

At I/O on May 19, Google shipped Gemini 3.5 Flash, and created a problem for its own lineup: the cheap tier now beats the expensive tier. Flash outperforms Gemini 3.1 Pro on agents and coding, posting 76.2% on Terminal-Bench 2.1. Meanwhile 3.1 Pro's SWE-bench Pro score stands at 54.2%, last among the frontier labs.

The Pro tier is stale while the cheap tier improves. Until Google ships a 3.5 Pro, anyone paying Pro prices for coding is paying for a number from a previous generation.

The Leaderboard, End of May 2026

SWE-bench Pro · End of May 2026
RankModelScore
1Claude Opus 4.869.2%
2Claude Opus 4.764.3%
3GPT-5.558.6%
4Kimi K2.6~58% reported
5Gemini 3.1 Pro54.2%
Grok 4.3not benchmarked
Fig. 1
SWE-bench Pro, end of May 2026
Claude Opus 4.8 69.2% Claude Opus 4.7 64.3% GPT-5.5 58.6% Kimi K2.6 (reported) 58% Gemini 3.1 Pro 54.2% Grok 4.3: not benchmarked on Pro. Kimi runs at roughly 8x lower cost.
The frontier premium is now an argument, not a fact.

What May Actually Taught

Step back from the table and look at the month as a whole. A new leader on Pro. A dead benchmark still being quoted. An open-weights model that briefly held the top spot. A frontier lab that opted out of the race entirely. A cheap tier beating its own expensive sibling.

May's lesson: the leaderboard reshuffles monthly, the winners change, and none of it tells you whether the model can work your backlog. Opus 4.8 at 69.2% is the right default for coding work this month. But "right default" and "right for your codebase" are different claims, and only one of them can be settled from a chart.

Update, June 2026: if May felt like a reshuffle, June was a reset. The whole scale got torn down and rebuilt.

Benchmark your backlog.

Run the only honest benchmark.

The Diagnostic is free: 30–45 minutes. We'll scope an eval on your actual backlog, graded by your standards.

Book the Diagnostic →
Sources
1Vellum, "Claude Opus 4.8 benchmarks explained," May 2026. SWE-bench Pro 69.2%, Verified 88.6%, OSWorld 83.4%, $5/$25 pricing, fast mode at 2.5x speed.
2Artificial Analysis, Grok 4.3 launch analysis, May 2026. tau²-Bench Telecom 98%; $1.25/$2.50 per million tokens; no SWE-bench Pro submission.
3MarkTechPost, "Gemini 3.5 Flash at I/O 2026," May 19, 2026. Flash beats Gemini 3.1 Pro on agents and coding; Terminal-Bench 2.1: 76.2%.
4302.AI, "xAI gives up on being the 'best' model, and that's a good thing," May 2026. On Grok 4.3 positioning.
John Tan
John Tan

Fractional Chief of AI at nativefirst.ai. Former YC CEO (Depict). Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.