- GPT-5.4 moved to the top of GDPval-AA at 1674 Elo, 41 points above Claude Sonnet 4.6 (1633), as of late March 2026. The margin is within drift range; nobody published a crowning.
- Grok 4.20 went GA mid-month at roughly 1179 Elo, mid-table on work tasks but leading IFBench for instruction following.
- Three labs sit within 70 Elo at the top. The work-benchmark race has no runaway winner, so the buying decision shifts to price, context, and behavior inside your workflows.
Update, June 2026: April brought GPT-5.5 and Opus 4.7, the April edition covers it; by June Fable 5 reset the ceiling.
February belonged to Anthropic. March belonged to nobody, and that is the story.
On March 5, OpenAI shipped GPT-5.4. Within days it landed at 1674 Elo on GDPval-AA, Artificial Analysis's Elo ranking for real-world work deliverables, 41 points above Claude Sonnet 4.6's 1633. That put GPT-5.4 at the top of the table.
Note the phrasing. Top of the table, not crowned champion. Artificial Analysis never published a "GPT-5.4 takes #1" piece, and for good reason: Elo ratings on GDPval-AA drift as pairwise comparisons accumulate, and 41 points sits within that drift range. The two ratings overlap once you account for confidence intervals. If a vendor deck shows you this leaderboard without error bars, ask why.
The Late-March Leaderboard
Here is the GDPval-AA board as of late March 2026. Ratings drift as new comparisons land, so treat these as a snapshot, not scripture.
| # | Model | GDPval-AA Elo |
|---|---|---|
| 1 | GPT-5.4 | 1674 |
| 2 | Claude Sonnet 4.6 | 1633 |
| 3 | Claude Opus 4.6 | 1606 |
| 4 | GPT-5.2 | ~1467 |
| 5 | Gemini 3.1 Pro | 1314 |
| 6 | GLM-4.7 (open weights) | 1224 |
| 7 | Grok 4.20 | 1179 |
Grok 4.20: Mid-Table, Different Bet
The other March release was xAI's Grok 4.20, GA on March 10. On GDPval-AA it landed around 1179 Elo, bottom of this board. That undersells what xAI shipped, though: Grok 4.20 leads IFBench, the instruction-following benchmark, which measures something GDPval does not. A model that follows instructions precisely is a different tool than a model that produces expert-grade deliverables, and some workflows want the former.
On the work-eval story specifically, xAI's real move was still ahead of it. The jump came later, with Grok 4.3 in May. In March, mid-table was the honest read.
What a 70-Point Cluster Means for Buyers
Step back from the individual scores and look at the shape of the board. OpenAI, Anthropic, and Anthropic again, all within 70 Elo of each other at the top. After February's Anthropic sweep, GPT-5.4 pulled OpenAI back level within five weeks. The work-benchmark race has no runaway winner.
That changes what the leaderboard is for. When one model led by hundreds of points, the leaderboard made your decision. When three models cluster within drift range, it cannot. The differentiator shifts to the things the leaderboard does not measure: price per task, context window against your actual document sizes, latency in your actual pipeline, and above all how each model behaves inside your workflows, on your deliverables, with your data.
A top-three cluster this tight is exactly when vendor marketing gets loudest and least meaningful.
Every lab can now produce a chart where its model wins something. All of the charts are technically true. None of them tell you which model writes your proposals, triages your tickets, or builds your financial models best. The only benchmark that answers that is the one you run on your own work.
The race is tied.
Your work, actually benchmarked.
The Diagnostic is free: 30–45 minutes. We'll measure the frontier models against your real deliverables.
Book the Diagnostic →