- GPT-5.4 shipped March 5 and took the lead on SWE-bench Pro: 59.1% on SEAL's standardized public set, the new top score on the harness that counts post-Verified. Plus 1M context and native computer use.
- Opus 4.6 still leads the commercial subset at 47.1%. Two subsets, two leaders. Read the fine print before you quote a number.
- Grok 4.20 hit GA on March 10, leads IFBench, and is absent from every serious coding leaderboard.
Update, June 2026: April broke the benchmark and June replaced it.
February ended with SWE-bench Verified in the ground. March answered the obvious question: what do we look at now? The answer, at least for this month, is SWE-bench Pro on SEAL's standardized harness. And on March 5, OpenAI took the top of it.
GPT-5.4 Ships, and Leads
GPT-5.4 launched March 5 in Pro and Thinking variants, with mini and nano following on March 17. The benchmark headline: 59.1% on SEAL's standardized public SWE-bench Pro set, the new leader on the standardized harness. The capability headlines: a 1M-token context window and native computer use, both shipped in the same release.
That 59.1% matters more than it would have three months ago. With Verified deprecated in February, Pro's standardized SEAL harness is now the scoreboard that counts. There is no saturated, contaminated alternative left to retreat to. This is the number.
The Leaderboard, End of March
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.4 | 59.1% |
| 2 | GPT-5.3-Codex | 56.8% |
| 3 | Gemini 3.1 Pro | 54.2% |
| 4 | Claude Opus 4.6 | 51.9% |
Note the asterisk in the figure. SWE-bench Pro has a public set and a commercial subset, and Opus 4.6 remains the leader on the commercial subset at 47.1%. So the honest sentence is: GPT-5.4 leads on the public standardized set, Anthropic leads on the commercial one. Any vendor who quotes one without the other is doing marketing, not measurement.
Grok 4.20, for Completeness
Grok 4.20 went GA on the API on March 10. It leads IFBench, the instruction-following benchmark, which is a real and useful capability. It is also absent from every serious coding leaderboard. If your use case is software engineering, March did not change your shortlist.
The Two-Number Problem Persists
The pattern we flagged in February has not gone away: vendor-scaffold numbers run several points above standardized-harness numbers for the same model. A lab running its own agent scaffold, with its own retry logic and tool setup, posts one score. SEAL's standardized harness, running every model through the same plumbing, posts a lower one. Same model, same tasks, different number.
So when you read a SWE-bench score this year, the first question is not "how high?" It is "whose harness?" A standardized 59.1% and a vendor-scaffolded 64% describe different experiments, and only one of them is comparable across labs.
What 59.1% Actually Means
Here is the part worth sitting with. Post-Verified, the leaderboard finally measures something closer to real work, and the scores are honest-low. A 59.1% leader means four in ten real issues still fail. The frontier model, on the cleanest harness we have, with the issue handed to it on a plate, does not resolve them.
That gap is not an argument against deploying AI on your engineering work. It is the argument for how. The 40% that fails is where deployment work lives: the scaffolding, the retry logic, the human checkpoints, the eval that tells you which of your tickets land in the 59% and which land in the 41%. Nobody finds that out from a leaderboard. You find it out by running the model on your backlog.
The leaderboard got honest. Now you should too.
Measure your own 59%.
Run the only honest benchmark.
The Diagnostic is free: 30–45 minutes. We'll scope an eval on your actual backlog, graded by your standards.
Book the Diagnostic →