- Berkeley researchers broke 8 agent benchmarks without solving a single task, scoring 100% on both SWE-bench Verified and SWE-bench Pro with pytest config hooks.
- Claude Mythos Preview exposed the contamination gap: 93.9% on Verified, ~45.9% on Pro. A 48-point spread on the same skill.
- GPT-5.5 launched April 23 at 58.6% on SWE-bench Pro, trailing Claude Opus 4.7's 64.3%, while open-weights Kimi K2.6 undercut everyone at $0.60/M input.
April was the month the coding benchmark story fell apart. Four separate events, in four weeks, each chipping at the same foundation: the number everyone cites to compare coding models stopped meaning what people think it means. This is the April edition of the Benchmark series, reconstructed so the May and June editions make sense.
Apr 6: Mythos Preview and the 48-Point Tell
Anthropic shipped Claude Mythos Preview on April 6 with a headline score of 93.9% on SWE-bench Verified. Impressive, until you looked at the second number: roughly 45.9% on SWE-bench Pro, the contamination-resistant variant built from private and copyleft repos.
Same model. Same skill, allegedly. A 48-point spread.
That spread became the smoking gun for the contamination argument. SWE-bench Verified is built from public GitHub issues that sit in every frontier model's training data. A model can memorize Verified, reproduce the gold patches it has already seen, and still struggle on tasks it has never encountered. Mythos Preview did not prove every high Verified score is memorization. It proved memorization alone can carry you into the 90s.
Apr 10: Saturation, by the Numbers
Four days later, AgentMarketCap published a saturation analysis that put the trajectory in one line: SWE-bench Verified climbed from 4% to over 80% in under three years. The remaining ~6% of unsolved tasks, by their analysis, is mostly dataset noise rather than capability headroom: broken tests, underspecified issues, tasks that cannot be solved as stated.
Buried in the same piece: OpenAI had quietly stopped reporting Verified scores after the contamination evidence piled up. No announcement. The number just stopped appearing in their releases. When the lab that popularized a benchmark stops citing it, that is the obituary, even if nobody reads it aloud.
Mid-April: Berkeley Breaks Everything
Then UC Berkeley's RDI group published "How We Broke Top AI Agent Benchmarks" (Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song). They exploited 8 major agent benchmarks to near-perfect scores without solving a single task.
The damage report:
- SWE-bench Verified and SWE-bench Pro: 100%, via conftest.py pytest hooks and overwriting the container's result parser. The contamination-resistant benchmark fell to the same trick as the contaminated one.
- Terminal-Bench: 100%.
- WebArena: ~100%.
- GAIA: ~98%.
- OSWorld: 73%.
No reasoning, no patches, no model intelligence. The harness graded itself and the harness could be lied to. Their conclusion is the line this entire series keeps coming back to:
"Don't trust the number. Trust the methodology."
Apr 23: GPT-5.5 Lands Second
Into this wreckage, OpenAI launched GPT-5.5 on April 23. On SWE-bench Pro it posted 58.6%, trailing Claude Opus 4.7's 64.3%. Where it won: terminals (82.7% on Terminal-Bench 2.0) and a 1M-token context window.
The most interesting number in the release was not a public benchmark at all. OpenAI reported 73.1% on Expert-SWE, its internal eval built from 20-hour engineering tasks. A private eval, on long-horizon work, graded by the people who built it. That is the signal of where evals go next: away from public leaderboards, toward task suites nobody can train on.
The Open-Weights Pressure
Two more data points from the month. On April 20, Moonshot's Kimi K2.6 shipped with open weights and beat GPT-5.4 on SWE-bench Pro at $0.60 per million input tokens. The frontier premium now has to justify itself against an open model you can run yourself. And Grok 4.3 entered beta on April 17 with no SWE-bench Pro score at all, absent from the coding leaderboards entirely.
The Leaderboard, End of April 2026
| Rank | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.7 | 64.3% |
| 2 | GPT-5.5 | 58.6% |
| 3 | Kimi K2.6 | above GPT-5.4 (reported) |
| 4 | Claude Mythos Preview | ~45.9%* |
*Mythos Preview scored 93.9% on SWE-bench Verified the same week. The 48-point gap between its Verified and Pro scores is the contamination story in one row.
What April Actually Taught
April's lesson is not that models got worse. Opus 4.7 at 64.3% on a contamination-resistant benchmark is real capability. The lesson is that the public number and the real number diverged in plain sight, and the industry kept citing the public one.
Update, June 2026: the replacement arrived. A maintainer-graded successor reset the whole leaderboard; the June edition covers it.
If you are picking a model for your engineering team off a leaderboard that a config file can max out, you are not measuring the model. You are measuring the harness. The only benchmark that cannot be gamed is the one run on your repo against your standards, graded by your own tech leads on work the model has never seen.
Trust the methodology.
Run the only honest benchmark.
The Diagnostic is free: 30–45 minutes. We'll scope an eval on your actual backlog, graded by your standards.
Book the Diagnostic →