TL;DR
  • Artificial Analysis shipped Intelligence Index v4.0 on January 6. Ten work-shaped evals in, three saturated exams out, and the top index score fell from roughly 73 to 50 or below. Same models, rebuilt ruler.
  • No frontier release moved either major leaderboard this month. Quiet month for launches, loud month for methodology.
  • SWE-bench Verified is out of road: an 80% cluster at the top (Opus 4.5 and 4.6 at 80.9, GPT-5.2 at 80.0) with nothing left to distinguish.

January 2026 shipped no frontier model worth a headline. It still managed to be the most consequential month for AI benchmarks in a year. On January 6, Artificial Analysis released version 4.0 of its Intelligence Index, the composite number that half the industry treats as the default ruler for model quality. The top score dropped from roughly 73 to 50 or below overnight. The models did not get worse. The ruler got rebuilt around real work.

What Changed in v4.0

The new index is built from ten evaluations, and the selection principle is work, not exams. In: GDPval-AA, Terminal-Bench Hard, tau²-Bench Telecom, SciCode, AA-LCR, AA-Omniscience, IFBench, HLE, GPQA Diamond, and CritPt. Out: MMLU-Pro, AIME 2025, and LiveCodeBench, all retired for the same reason. Frontier models had saturated them. When every model aces the exam, the exam stops measuring anything.

Look at what replaced them. GDPval-AA grades real occupational deliverables. Terminal-Bench Hard tests agentic work in a shell. tau²-Bench Telecom simulates a customer-service environment with tools and state. These are job-shaped tasks, and models do far worse on job-shaped tasks than on multiple choice. Hence the 20-point haircut at the top of the board.

Fig. 1
The ruler got rebuilt
INDEX V3 MMLU-Pro AIME 2025 LiveCodeBench top score ~73 INDEX V4 GDPval-AA Terminal-Bench Hard tau²-Bench top score ≤50 same models, honest ruler
Three retired exams, ten work-shaped evals, twenty fewer points of flattery.

A Quiet Month, On Purpose

No frontier release moved either major board in January. No new top model, no shuffled podium. That made the month unusually clean for a different kind of story: the people who maintain the rulers spent it admitting the rulers had drifted. A quiet month for launches turned into a loud month for methodology.

The Scaffolding Question

The second methodology fight of the month is harder to resolve. Call it scaffolding inflation. The same model can score 81% on a coding benchmark when wrapped in a tuned agentic harness, and 69% standalone. Both numbers are real. Both get quoted. Which one is "the" score?

Vendors quote the scaffolded number, because it is bigger. Skeptics point out you are no longer benchmarking the model, you are benchmarking the model plus an engineering team's harness. The debate built all month across eval-tracking coverage, and it matters for buyers more than for researchers: the number on the vendor slide may describe a system you are not actually buying.

SWE-bench Verified Hits the Wall

Meanwhile the industry's favorite coding leaderboard ran out of road. The top of SWE-bench Verified is now an 80% cluster: Claude Opus 4.5 and 4.6 at 80.9, GPT-5.2 at 80.0. Fractions of a point separate the frontier, and those fractions are inside the noise. When every contender posts the same score, the leaderboard has stopped distinguishing anything. The endgame is visible from here.

"Whoever controls the eval controls the narrative. The only eval you control is your own."

What This Means If You Buy AI

When the index top drops from 73 to 50 overnight, your vendor's "best model" slide did not get worse. The ruler got honest. That should change how you read every capability claim that crosses your desk.

The lesson of January is that benchmark numbers are downstream of benchmark design, and benchmark design is controlled by someone who is not you. Artificial Analysis moved the industry's default ruler toward real work, which is progress. But the only eval that maps to your P&L is one built from your deliverables, graded by your standards. Saturated exams flattered everyone. Work-shaped evals flatter no one, and your workflows are the most work-shaped eval there is.

Own your ruler.

Update, June 2026: February killed SWE-bench Verified outright. The February edition covers the funeral.

Run the only honest benchmark.

The Diagnostic is free: 30–45 minutes. We'll scope the eval that matters: your workflows, your grading standards.

Book the Diagnostic →
Sources
1VentureBeat, "Artificial Analysis overhauls its AI Intelligence Index," January 2026. On the v4.0 release, the ten constituent evals, and the score reset.
2Artificial Analysis, Intelligence Index v4.0 methodology pages. Eval composition: GDPval-AA, Terminal-Bench Hard, tau²-Bench Telecom, SciCode, AA-LCR, AA-Omniscience, IFBench, HLE, GPQA Diamond, CritPt.
3CodeSOTA, coverage of the scaffolding and contamination debate, January 2026. On scaffolded vs. standalone scores (81% vs. 69%) and the SWE-bench Verified 80% cluster.
John Tan
John Tan

Fractional Chief of AI at nativefirst.ai. Former YC CEO (Depict). Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.