- GPT-5.5 launched April 23 at 84.9% GDPval expert parity, the headline number of the month. Price doubled to $5/$30 per million tokens.
- OpenAI also shipped Expert-SWE at 73.1%, an internal eval of 20-hour expert tasks. The tell for where evals go next: longer-horizon real work.
- ProMarket called the consulting shift in print: GDPval-class capability lands first on analyst-tier work. Supervising AI beats operating it.
- Grok 4.3 entered beta April 17. At the time, Grok 4.20 sat at 1179 Elo, bottom of the major-lab pack. May would change that.
This is the monthly GDPval recap for April 2026. One launch dominated the month, one essay named what the launch means for knowledge work, and one lab quietly set up a comeback.
April was the month expert parity stopped being a forecast and became a spec sheet line.
April 23: GPT-5.5 Lands at 84.9%
OpenAI shipped GPT-5.5 on April 23 with one number doing all the work: 84.9% expert parity on GDPval, up from GPT-5.4. On OpenAI's own benchmark of real-world deliverables, graded blind by experienced professionals, the model now matches or beats the human expert on roughly 85 of every 100 tasks.
The price moved with the capability. GPT-5.5 launched at $5 per million input tokens and $30 per million output tokens, double the previous tier. OpenAI is signaling that frontier work-grade output is worth frontier money, and the market mostly agreed.
The quieter release in the same announcement is the more interesting one. OpenAI shipped Expert-SWE at 73.1%, an internal eval built from 20-hour expert tasks. Read that as a roadmap, not a footnote. GDPval measures discrete deliverables. Expert-SWE measures sustained, multi-day expert work. That is where OpenAI thinks evals go next: longer horizons, messier scope, real projects instead of single artifacts.
What 84.9% Means, and What It Does Not
The claim is precise and worth stating precisely. GDPval hands the model a well-specified deliverable: full context, clear ask, defined output. Blind expert graders then compare the model's work against a professional's. At 84.9%, the model reaches parity on that setup.
That setup is not how work arrives in a company. Nobody hands your team a perfect brief. The ask is half-formed, the context lives in six heads and two Slack channels, and the success criteria get negotiated after the draft exists. GDPval measures what the model can do once the brief is perfect. Getting the brief to perfect is the part the benchmark assumes away.
So 84.9% is real, and it is bounded. Both things are true.
April 29: The Economists Notice
Six days after the GPT-5.5 launch, ProMarket published "AI Is Coming for the Economic Consulting Industry" by Birjandi and Zadeh. It is the first mainstream economics outlet to map GDPval-class capability onto a specific labor market, and the mapping is blunt: the capability lands first on analyst-tier work, the well-scoped production tasks that fill the bottom half of every professional services pyramid.
"The analysts who will remain competitive are those who can supervise AI systems, not merely operate them."
That line travels well beyond consulting. Operating a model means typing prompts. Supervising one means owning the spec, checking the output, and knowing when the model is confidently wrong. The first skill is already commoditized. The second is the new job description.
April 17: Grok 4.3 Enters Beta
xAI put Grok 4.3 into beta on April 17, with the full API following April 30. At the time it was easy to dismiss. Grok 4.20 sat at 1179 Elo on GDPval-AA, bottom of the major-lab pack, and a beta release does not move a leaderboard.
The May GA would change that. The reshuffle is covered in the May 2026 GDPval update; the short version is that April's last-place lab did not stay there.
The Leaderboard, End of April 2026
GDPval-AA standings as reported by Artificial Analysis at the end of the month. Where an exact April Elo is not verified, the cell is left blank rather than guessed.
| # | Model | GDPval-AA Elo |
|---|---|---|
| 1 | GPT-5.5 | 1769 |
| 2 | Claude Opus 4.7 (led Anthropic's entry, pre-4.8) | |
| 3 | Gemini 3.1 Pro | 1314 |
| 4 | Grok 4.20 | 1179 |
GPT-5.5 set the bar, and for a few weeks the bar held. Anthropic's answer (Opus 4.8, then Fable 5) is a May and June story.
The Operator's Read
April 2026 is the month expert parity became table stakes. Every frontier lab now ships work-grade output on well-specified tasks; the differences between them are real but shrinking, and they reshuffle monthly.
Which means the constraint moved. The binding limit on what AI does inside your company is no longer model capability. It is brief quality: how well your org turns its messy, half-spoken work into tasks a parity-grade model can execute. And brief quality is an org problem. No API upgrade fixes it. Someone has to sit inside the business, watch how the deliverables actually get made, and write the specs the model needs.
The models cleared the bar. Your briefs are the bar now.
Fix the brief.
Your work, actually benchmarked.
The Diagnostic is free: 30–45 minutes. We'll measure the frontier models against your real deliverables, not OpenAI's task set.
Book the Diagnostic →