- GPT-5.2 shipped December 11 and scored 70.9% win/tie vs professionals on OpenAI's own GDPval, up from GPT-5.1's 38.8%. It also posted 80.0% on SWE-bench Verified.
- Artificial Analysis launched GDPval-AA mid-December: an independent Elo leaderboard. Claude Opus 4.5 topped the launch board at roughly 1403; by late December GPT-5.2 (xhigh) leads at roughly 1456.
- GLM-4.7 is the open-weights leader at 1224 Elo. September was a paper. December made it a horse race, graded by a third party.
When OpenAI launched GDPval on September 25, it was a research paper with a provocative premise: grade AI models against human professionals on real work deliverables. The best score at launch belonged to Claude Opus 4.1, at 47.6% win/tie against experts. Interesting. Academic. Easy to file away.
December un-filed it. Two things happened this month that turned GDPval from a paper into the scoreboard everyone will be watching in 2026.
December 11: GPT-5.2 Nearly Doubles the Score
OpenAI shipped GPT-5.2 on December 11, and the headline number from its own GDPval run was startling: 70.9% win/tie against industry professionals, up from GPT-5.1's 38.8%. That is not an incremental bump. That is one model generation moving from "loses to the professional most of the time" to "wins or ties seven times out of ten."
For completeness, GPT-5.2 also posted 80.0% on SWE-bench Verified, joining the cluster of frontier models now sitting at 80 or above on the coding benchmark. But the coding number is the footnote this month. The work number is the story.
One caveat, and it is the caveat that sets up the second event: that 70.9% came from OpenAI grading OpenAI. The benchmark's inventor was also its only referee.
Mid-December: GDPval-AA Arrives
Then Artificial Analysis, the independent benchmarking firm, launched GDPval-AA: an Elo-style leaderboard built on GDPval-class tasks, run through its own agentic harness called Stirrup, with pairwise comparisons graded independently. For the first time, the labs do not mark their own homework on economically valuable work.
At launch, Claude Opus 4.5 (non-thinking) topped the board at roughly 1403 Elo. By late December, GPT-5.2 (xhigh) had taken the #1 slot at roughly 1456. And in a board update around December 30, GLM-4.7 became the open-weights leader at 1224 Elo. Here is the standing as the year closes:
| # | Model | GDPval-AA Elo |
|---|---|---|
| 1 | GPT-5.2 (xhigh) | ~1456 |
| 2 | Claude Opus 4.5 (launch rating, since re-rated) | ~1403 |
| 3 | GLM-4.7 (open-weights leader) | 1224 |
One reaction this month captured the mood. Dan McAteer, posting on X as the GDPval-AA board landed:
"GDPval is the most important LLM evaluation... Opus 4.5 is tops. Llama 4 Maverick is by far [worst]."
Why Independent Grading Changes Everything
Until this month, every GDPval number you saw came from the vendor whose model was being measured. OpenAI graded GPT-5.2's 70.9%. That does not make the number wrong, but it makes it unverifiable, and unverifiable numbers are useless for decisions involving money.
GDPval-AA fixes the incentive problem. An Elo system built on pairwise comparisons, run by a firm that sells neutrality, means a model's rating reflects how its work stacks up against every other model's work on the same tasks, judged by the same process. When the December board flipped from Opus 4.5 to GPT-5.2 in two weeks, that flip meant something. Self-graded homework, retired.
Read the Numbers With Date Stamps
One honesty note before you quote any of this in a board deck. Elo ratings are not fixed scores. They drift as new comparisons accumulate and new models enter the pool. Opus 4.5's ~1403 was its launch rating and has already been re-rated. Every number in this post should carry the qualifier "as of late December 2025," and the board will look different in a month. That is not a flaw. That is what a live leaderboard is.
What This Means for Your Company
The December takeaway is not which model is #1. It is that the question "can AI do real professional work?" now has a public, independently graded, continuously updated answer. The scores say yes, increasingly. But GDPval-AA measures models on well-specified tasks with full context. Your company's work does not arrive that way. The gap between a 1456 Elo and your Monday morning is specification, and specification is an operator problem, not a model problem.
The race is on.
Your work, actually benchmarked.
The Diagnostic is free: 30–45 minutes. We'll measure the frontier models against your real deliverables.
Book the Diagnostic →Update, June 2026: the race never slowed. The June board looks nothing like this one.