GDPval Update (May 2026): The Leaderboard Reshuffles

TL;DR

Claude Opus 4.8 took the GDPval-AA lead at 1890 Elo on May 28, 576 points clear of Gemini 3.1 Pro. The largest gap Anthropic had published.
Grok 4.3 jumped +321 Elo over Grok 4.20, the biggest single-release gain on the board, and it wins on price: roughly $395 for the full suite run.
Gemini 3.5 Flash (1656) beat Google's own flagship 3.1 Pro (1314) by 342 points. The cheap tier outran the Pro tier on real work.

May was the busiest month GDPval has had. Three of the four major labs moved, and the board looks nothing like it did in April. Grok jumped 321 points. Google's budget model lapped Google's flagship. And on May 28, Anthropic took the overall lead with the widest margin the leaderboard has seen.

Here is the month in order, then what the reshuffle actually means for anyone buying AI for real work.

Early May: Grok 4.3 Finds Its Lane

xAI shipped Grok 4.3 to general availability in early May and landed at 1500 Elo on GDPval-AA. That is a +321 jump over Grok 4.20's 1179, the biggest single-release Elo gain on the board.

The headline is not the reasoning. Artificial Analysis still rates Grok 4.3 as "not yet competitive with first-tier frontier models" for deep reasoning. The headline is the economics. At $1.25 per million input tokens and $2.50 per million output, the full GDPval-AA suite run costs roughly $395. Add a 1M-token context window, native video, and a 98% score on tau²-Bench Telecom for support-agent tasks, and you have a model built for volume work, not for prize fights.

"xAI gives up on being the 'best' model, and that's a good thing."

That positioning take from 302.AI is the right read. Grok 4.3 is not chasing the ceiling. It is buying Elo per dollar, and for high-volume, well-scoped tasks like support, that trade is rational.

May 19: Google's Cheap Tier Beats Google's Flagship

At Google I/O, Gemini 3.5 Flash hit 1656 Elo on GDPval-AA. Google's own flagship, Gemini 3.1 Pro, sits at 1314. The budget model beat the premium model by 342 points on real-work tasks.

Flash also posted 83.6% on MCP Atlas, ahead of both Opus 4.7 and GPT-5.5 at the time. Meanwhile the Pro tier had been sitting stale since February. Read that as a strategy signal: Google is iterating fastest where the price-performance curve is steepest, and letting the flagship wait for its next big swing. If you standardized on "the Pro tier is the good one," May broke that assumption.

May 28: Opus 4.8 Takes the Ceiling

Anthropic closed the month by shipping Claude Opus 4.8, which took the overall GDPval-AA lead at 1890 Elo. That is 576 points clear of Gemini 3.1 Pro, the largest gap Anthropic had published, at the same $5/$25 pricing as before. No price increase for the new ceiling.

The Board, End of May

#	Model	GDPval-AA Elo
1	Claude Opus 4.8	1890
2	GPT-5.5	1769
3	Gemini 3.5 Flash	1656
4	Grok 4.3	1500
5	Gemini 3.1 Pro	1314
6	Grok 4.20	1179

Fig. 1

GDPval-AA Elo, end of May 2026

Every lab is optimizing for work-shaped evals now.

Data: Artificial Analysis GDPval-AA, May 2026

Source: Artificial Analysis GDPval-AA leaderboard, end of May 2026

What the Reshuffle Means

Three labs moving in one month on the same benchmark tells you where the optimization pressure went. Every lab is now tuning for work-shaped evals, not exam-shaped ones. GDPval is the scoreboard the releases are aimed at.

The second signal is fragmentation. Price-performance is splitting into distinct strategies. Grok and Gemini Flash are buying Elo per dollar at the bottom and middle of the curve. Anthropic is selling the ceiling. The board is no longer one race; it is a curve with multiple defensible points on it.

Which changes the buyer's question. It stopped being "which model is smartest" and became "which point on the price-capability curve does this workflow need." Your support queue probably does not need 1890 Elo. Your contract analysis probably should not run on the cheapest point. Matching workflows to curve points is now the actual decision.

The Number That Still Beats All of These

One more thing before you reorganize your stack around this table. The gap between any two adjacent models on it is smaller than the gap between a perfect brief and the brief your team actually writes. GDPval hands every model a complete, well-specified task. Your company does not. Specification is still where the Elo gets lost.

Brief beats model.

Update, June 2026: the table did not stay still. Fable 5 reset the ceiling again. Read the June update.

Your work, actually benchmarked.

The Diagnostic is free: 30–45 minutes. We'll measure the frontier models against your real deliverables and pick the right point on the curve.

Book the Diagnostic →

Sources

1Artificial Analysis, "xAI launches Grok 4.3," May 2026. GDPval-AA 1500 Elo, pricing, and the first-tier reasoning caveat.

2MarkTechPost, "Gemini 3.5 Flash at I/O 2026," May 19, 2026. GDPval-AA 1656 Elo and MCP Atlas 83.6%.

3Vellum, "Claude Opus 4.8 benchmarks explained," May 2026. GDPval-AA 1890 Elo and the 576-point gap over Gemini 3.1 Pro.

4302.AI, "Grok 4.3: xAI gives up on being the 'best' model, and that's a good thing," May 2026.

John Tan

Founder and CEO of nativefirst.ai. Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.