TL;DR
  • Claude Sonnet 4.6 is the new GDPval-AA leader at 1633 Elo (as of late February 2026), inside Opus 4.6's confidence interval, at $3/$15 per million tokens. A mid-tier-priced model holds the work-eval crown.
  • Opus 4.6 retook #1 on February 5 at 1606, roughly 150 Elo clear of GPT-5.2 and winning about 70% of head-to-head matchups. Anthropic now owns both top slots.
  • Gemini 3.1 Pro exposed the exam/work split: 94.3% on GPQA and 77.1% on ARC-AGI-2, yet 1314 Elo, last among the majors on real deliverables.

February was the month the GDPval leaderboard stopped being a flagship-pricing story. Anthropic shipped twice, took both top slots, and the second model to land, the cheaper one, is the one sitting at #1.

Update, June 2026: the reshuffles kept coming. May rearranged the whole board.

February 5: Opus 4.6 Retakes the Lead

Anthropic released Claude Opus 4.6 on February 5, and Artificial Analysis scored it at 1606 Elo on GDPval-AA, retaking #1. That put it roughly 150 Elo ahead of GPT-5.2, which translates to about a 70% win rate when expert graders compare deliverables head to head.

Seven out of ten times, the professionals grading the work picked the Opus output. On real deliverables: briefs, models, decks, analyses. Not exam answers.

February 17: Sonnet 4.6 Takes It From Its Own Sibling

Twelve days later, Anthropic did it again. Claude Sonnet 4.6 landed at 1633 Elo, a new #1, within Opus 4.6's 95% confidence interval. Statistically the two are neck and neck. Commercially they are not: Sonnet 4.6 is priced at $3 input / $15 output per million tokens, mid-tier money for top-of-board output.

One detail in the Artificial Analysis writeup deserves more attention than it got. Sonnet 4.6 spent 280 million tokens completing the eval. Sonnet 4.5 spent 58 million on the same benchmark. Nearly five times the thinking for the same task set. The model is not just smarter; it deliberates longer before it answers. Thinking harder is part of the product now, and it shows up on your token bill.

"A mid-tier-priced model at the top of the work leaderboard means the capability question and the pricing question have come apart."

That same week, on February 17, xAI pushed Grok 4.20 into beta. It had not been scored by month end, so its board impact is a March story.

February 19: Gemini 3.1 Pro and the Exam/Work Split

Google released Gemini 3.1 Pro on February 19 with genuinely strong exam-style numbers: 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2. On GDPval-AA it scored 1314, last among the major labs.

That is the cleanest demonstration yet of the gap this series keeps hammering: acing graduate-level science questions and producing work a professional would sign off on are different capabilities. Gemini 3.1 Pro is the exam/work split in a single model. If you pick models off exam benchmarks, this is the month that habit officially stopped being defensible.

The Board, End of February

#ModelGDPval-AA Elo
1Claude Sonnet 4.61633
2Claude Opus 4.61606
3GPT-5.2~1456
4Gemini 3.1 Pro1314
5GLM-4.7 (open leader)1224
Fig. 1
GDPval-AA Elo, late February 2026
Claude Sonnet 4.6 1633 Claude Opus 4.6 1606 GPT-5.2 ~1456 Gemini 3.1 Pro 1314 94.3% GPQA, last on real work GLM-4.7 1224 open weights Scale starts at 1000; ratings as of late Feb. Anthropic holds both top slots.
The crown moved to mid-tier pricing. Elo ratings drift as new matchups land; all figures as of late February 2026.
Source: Artificial Analysis GDPval-AA leaderboard, late February 2026. Ratings drift over time.

GLM-4.7 at 1224 deserves its line: it is the open-weights leader, and it sits within 100 Elo of Gemini 3.1 Pro. The open-versus-closed gap on real work is narrower at the bottom of the majors than the marketing suggests.

What February Actually Changed

Two things, and both matter if you are deploying rather than spectating.

First, the work-eval crown is no longer a flagship-pricing story. The best model on real deliverables, as of late February, costs $3/$15. You no longer have to choose between the board leader and your inference budget. The caveat is the token count: a model that thinks five times longer partially refunds its own discount. Price the task, not the rate card.

Second, exam brilliance does not transfer to deliverables. Gemini 3.1 Pro proved it in one release: top-tier GPQA, bottom-of-the-majors GDPval. If your model evaluation process still leads with academic benchmarks, February is the evidence to retire it.

The right move is the same one this series always lands on: benchmark the models against your own deliverables, not anyone's leaderboard. The board tells you who to shortlist. Your work tells you who to deploy.

Cheap just won.

Your work, actually benchmarked.

The Diagnostic is free: 30–45 minutes. We'll measure the frontier models against your real deliverables.

Book the Diagnostic →
Sources
1Artificial Analysis, "Opus 4.6 takes lead in agentic real-world knowledge tasks," February 2026. Opus 4.6 at 1606 Elo, ~150 over GPT-5.2, ~70% head-to-head win rate.
2Artificial Analysis, Claude Sonnet 4.6 GDPval analysis, February 2026. Sonnet 4.6 at 1633 Elo; 280M tokens spent on the eval vs. Sonnet 4.5's 58M.
3Anthropic, Claude Opus 4.6 announcement, February 5, 2026.
4Google, Gemini 3.1 Pro release, February 19, 2026. GPQA Diamond 94.3%, ARC-AGI-2 77.1%.
John Tan
John Tan

Fractional Chief of AI at nativefirst.ai. Former YC CEO (Depict). Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.