GDPval Update (June 2026): The Benchmark That Actually Matters

TL;DR

Claude Fable 5 leads GDPval-AA at 1932 Elo, ahead of Opus 4.8 (1890), GPT-5.5 (1769), and Gemini 3.1 Pro (1314). Anthropic now tops the benchmark OpenAI invented.
Frontier models are at expert parity on real deliverables. GPT-5.5 reported 84.9% expert parity in April, at roughly 100x the speed and 100x lower cost than human professionals.
The fine print matters: only 220 of 1,320 tasks are evaluated, graders agree just 70% of the time, and the benchmark hands the model a perfect brief. Your company does not.

On June 9, Anthropic shipped Claude Fable 5. Within a day, Artificial Analysis updated the leaderboard that matters most to anyone who runs a company: GDPval-AA. Fable 5 sits at the top with 1932 Elo.

Not a coding benchmark. Not a math olympiad. A benchmark made of actual work. If you track one capability number this year, track this one.

What Is GDPval?

GDPval is OpenAI's benchmark for measuring AI performance on real-world, economically valuable tasks across occupations. Instead of exam questions, it uses actual work deliverables: legal briefs, financial models, engineering analyses, sales decks. Each output is graded against the work of experienced professionals, by experienced professionals. It is the benchmark category that replaced exam-style evals. When Artificial Analysis rebuilt its industry-default Intelligence Index in January 2026, it rebuilt it around work-shaped benchmarks like GDPval, not academic ones.

That shift matters because exam scores stopped predicting anything useful around the time every frontier model started acing them. GDPval asks the question a CEO actually cares about: can this model produce the deliverable my team produces, at the quality my team produces it?

The June 2026 Leaderboard

Here is the current GDPval-AA standing, as scored by Artificial Analysis:

#	Model	GDPval-AA Elo
1	Claude Fable 5	1932
2	Claude Opus 4.8	1890
3	GPT-5.5	1769
4	Gemini 3.1 Pro	1314

Fig. 1

GDPval-AA Elo, June 2026

The ceiling moved again. The brief is still on you.

Data: Artificial Analysis GDPval-AA, Jun 2026 · what GDPval measures

Source: Artificial Analysis GDPval-AA leaderboard, June 2026

Notice the irony. OpenAI invented GDPval to measure AI against economically valuable work. Anthropic now holds the top two slots on it. OpenAI's own number is still remarkable: GPT-5.5 reported 84.9% expert parity on GDPval at launch in April. But the benchmark category OpenAI created is currently a scoreboard for its competitor.

What the Numbers Actually Mean

Strip away the Elo and the claim underneath is simple: on well-specified deliverables, frontier models now match experienced professionals. And they do it roughly 100x faster and 100x cheaper.

That ratio is the story. A financial model that takes an analyst two days costs you two days of analyst salary. The same model from Fable 5 costs minutes and cents. ProMarket put it bluntly in April when examining what this does to economic consulting: AI is landing first on analyst-tier work, the well-scoped production tasks that fill the bottom half of every professional services pyramid.

"The analysts who will remain competitive are those who can supervise AI systems, not merely operate them."

That sentence generalizes well beyond consulting. The deliverable layer of knowledge work is now contested. The judgment layer is not. Yet.

The Fine Print

Before you reorganize your company around a leaderboard, read the critique. Empiric Crafting published the most careful evaluation of GDPval's methodology, and three problems stand out.

Selection bias. The headline numbers come from 220 evaluated tasks out of a 1,320-task set. That is a curated slice, and curated slices flatter models. The full distribution of real work is messier than the gold subset.

Grader noise. Inter-rater agreement between the human experts doing the grading is around 70%. When professionals disagree on what good looks like 30% of the time, "expert parity" has error bars wide enough to drive a quarter through.

The Meeting Problem. This is the big one. GDPval measures artifact creation: the brief, the model, the deck. But much of knowledge work is not the artifact. It is the social coordination around the artifact. The pre-reads, the alignment, the objection handling, the politics. AI makes the deck. It does not get the org to agree with the deck.

None of this invalidates the benchmark. It bounds it. GDPval tells you what models can do with a perfect brief and no organizational friction. That is a real and valuable measurement. It is also not Monday morning.

The Gap Between 1932 Elo and Your Office

Here is the part the leaderboard cannot show you. GDPval hands the model a complete, well-specified task: full context, clear deliverable, defined success criteria. Nobody in your company writes briefs like that. Not your PMs, not your VPs, not you.

The gap between benchmark parity and production value is specification, context, and judgment. That is an org problem, not a model problem. The model is ready. The brief is not.

Which means every GDPval point the models gain raises the value of one specific skill: turning your company's messy reality into well-specified work. Someone has to sit inside the business, see how the actual deliverables get made, and translate that into tasks a 1932-Elo model can execute. That is why I embed on-site instead of advising from a slide. The benchmark proved the capability. The operator closes the distance.

Briefs are the bottleneck.

Your work, actually benchmarked.

The Diagnostic is free: 30–45 minutes. We'll pick one function and measure what Fable 5 does against your real deliverables, not OpenAI's.

Book the Diagnostic →

Sources

1Artificial Analysis, GDPval-AA leaderboard, June 2026. Elo standings for Claude Fable 5, Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro.

2OpenAI, "Introducing GPT-5.5," April 23, 2026. Reported 84.9% expert parity on GDPval.

3Empiric Crafting, "Evaluating GDPval." On selection bias (220 of 1,320 tasks), 70% inter-rater agreement, and the Meeting Problem.

4ProMarket, "AI Is Coming for the Economic Consulting Industry," April 29, 2026. On analyst-tier work and supervising vs. operating AI systems.

5Anthropic, "Claude Fable 5 and Mythos 5," June 9, 2026.

John Tan

Founder and CEO of nativefirst.ai. Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.