TL;DR
  • GDPval measures AI on real professional deliverables, not exam questions. Tasks come from working professionals across 44 occupations and are graded blind by experts from those fields.
  • Frontier models now hit expert parity on well-specified work. Claude Fable 5 leads the June 2026 GDPval-AA leaderboard at 1932 Elo, on the benchmark OpenAI invented.
  • The score hides the hard part: GDPval hands the model a perfect brief. Your company does not write perfect briefs. The constraint is specification, not capability.
Fig. 1
How a GDPval score gets made
real work task from pros, avg 14 yrs exp model output human expert output blind grader can't tell which is which Elo the catch: the task arrives as a perfect brief
Expert parity means the grader could not tell. It does not mean your brief was this clean.

GDPval is OpenAI's benchmark for measuring AI performance on real-world, economically valuable work. Instead of trivia questions or math puzzles, it uses actual professional deliverables: financial models, legal briefs, marketing plans, engineering analyses. The tasks are drawn from the occupations that contribute most to GDP, created by experienced professionals from those fields, and graded by other experts who compare AI output against human expert output in blind head-to-head comparisons. OpenAI launched it in 2025.

If you run a company, GDPval is the benchmark worth understanding, because it is the first one that tries to answer the question you actually care about: can this model do the work my people do? Here is how it works, where the leaderboard stands, and what the scores quietly leave out.

How GDPval works

The full GDPval set contains 1,320 tasks spanning 44 occupations across 9 sectors. The construction is the interesting part. OpenAI did not write the tasks. They recruited working professionals averaging 14 years of experience each, and asked them to contribute real deliverables from their jobs: the spreadsheet a financial analyst actually builds, the brief a lawyer actually drafts, the care plan a nurse actually writes. Many tasks include reference files, context documents, and multi-file outputs. These are not prompts. They are work.

Grading is blind comparison. A model produces a deliverable. An expert grader from that occupation receives two versions of the work, one from the AI and one from a human professional, without knowing which is which. The grader picks the better one, or rates them at parity. When people say a model achieves "expert parity" on GDPval, this is what they mean: blind expert graders chose the AI's deliverable as good as or better than the human's.

That design choice matters. Exam-style benchmarks measure whether a model knows things. GDPval measures whether a model produces things a professional would sign off on. Those are different skills, and the second one is the one with a price tag attached.

The leaderboard, June 2026

Artificial Analysis runs GDPval as a continuous Elo leaderboard, GDPval-AA, where model outputs are pitted against each other in expert-judged comparisons. Here is where it stands.

ModelLabGDPval-AA Elo
Claude Fable 5Anthropic1932
Claude Opus 4.8Anthropic1890
GPT-5.5OpenAI1769
Gemini 3.1 ProGoogle1314

GDPval-AA Elo, Artificial Analysis, June 2026.

Two things jump out. First, Anthropic leads the benchmark category that OpenAI invented, with both of its frontier models ahead of GPT-5.5. Second, the absolute level: OpenAI's own reported number for GPT-5.5 was 84.9% expert parity at its April 2026 launch. The third-place model on this table wins or ties against experienced human professionals about 85% of the time on these tasks. The leaders are above that.

What the score means

Expert parity on GDPval means something specific and narrow: on a well-specified deliverable, with the inputs provided and the goal defined, a frontier model produces work that a blind expert grader rates as good as a 14-year professional's. And it does so at roughly 100x the speed and a fraction of the cost.

This is why GDPval displaced exam-style evals as the industry default. When models saturated the academic benchmarks, the question shifted from "how smart is it" to "what work can it do," and GDPval was the first serious attempt to measure that. By January 2026, Artificial Analysis had rebuilt its entire Intelligence Index around work-shaped evals. The industry now scores models the way you would score a hire: on output, against a professional baseline.

What it hides

Three caveats before you reorganize your company around a leaderboard.

Selection bias. Most published results run on the gold subset, 220 of the 1,320 tasks. That subset is real work, but it is the cleaner end of real work: tasks that could be packaged with complete inputs and unambiguous grading criteria. The messier 1,100 tell a less tidy story.

Grader noise. Inter-rater agreement among the expert graders runs around 70%. Experts disagree with each other about what good work looks like roughly a third of the time. A benchmark graded by judges who disagree that often has wide error bars, whatever the headline number says.

The Meeting Problem. GDPval measures artifact creation. It does not measure the coordination work around artifacts: the stakeholder alignment, the "actually, legal needs to see this first," the three rounds of feedback that change the goal mid-flight. Most knowledge work is not the deliverable. It is the negotiation that surrounds the deliverable.

And the big one: GDPval hands the model a perfect brief. Every task arrives with complete context, clean inputs, and a defined deliverable. Your company does not write perfect briefs. Your briefs are a Slack thread, two contradicting stakeholders, and a deck from last quarter. The benchmark measures the model under conditions your organization has never once produced.

What a CEO should take from this

The honest read is this: the model is ready for expert-level deliverable work today. That part is settled. What GDPval cannot tell you is whether your company can feed it. The binding constraint is no longer capability. It is specification and judgment: who turns your messy reality into the well-specified task the model executes at parity, and who verifies the output is the right work, not just good work.

That is an org problem, not a model problem. And it has a sharp corollary: every Elo point the models gain raises the value of the person who can write the brief. The operator who can take a vague business need and shape it into a task with defined inputs and a checkable goal is the multiplier on everything above 1900 on that table. Waiting for a better model does not fix a specification gap. It widens it.

The benchmark says the work is ready to move. The brief is on you.

Your work, actually benchmarked.

The Diagnostic is free: 30–45 minutes. We'll pick one function and measure Fable 5 against your real deliverables, not OpenAI's task set.

Book the Diagnostic →
Sources
1OpenAI, "GDPval: Evaluating AI on real-world, economically valuable tasks," 2025. Benchmark design, 1,320 tasks, 44 occupations, 9 sectors, expert grading methodology.
2Artificial Analysis, GDPval-AA leaderboard, June 2026. Elo scores: Claude Fable 5 (1932), Claude Opus 4.8 (1890), GPT-5.5 (1769), Gemini 3.1 Pro (1314).
3OpenAI, "Introducing GPT-5.5," April 23, 2026. Reported 84.9% expert parity on GDPval.
4Empiric Crafting, "Evaluating GDPval." On the 220-task gold subset, grader agreement, and selection effects.
5ProMarket, April 29, 2026. On the limits of artifact-based benchmarks for measuring economic impact.
John Tan
John Tan

Fractional Chief of AI at nativefirst.ai. Former YC CEO (Depict). Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.