TL;DR
  • The Intelligence Index is the industry's single number for model capability, computed by Artificial Analysis, an independent benchmarking firm, from roughly 10 component evals. It is the most cited answer to "which model is smartest."
  • Version 4 rebuilt it around real work. On January 6, 2026, AA retired saturated exam benchmarks and added work-shaped evals like GDPval-AA and Terminal-Bench Hard. The top score fell from ~73 to 50 overnight. The ruler got honest.
  • It still cannot rank models for your company. Claude Fable 5 leads the index at 65 while regressing on code-review precision. Composites hide task shape, and no index includes your workflows.

The Artificial Analysis Intelligence Index is an independent composite score of frontier AI model capability, computed by Artificial Analysis, a third-party AI benchmarking firm, from roughly 10 component evaluations. When a headline says a new model is "the smartest in the world," this is almost always the number behind it. It is the industry's most cited single figure for how capable a model is, and unlike a vendor's launch slide, it is measured by people who do not sell models.

If you run a company, you will see this number in every model launch from now on. Here is what is inside it, how to read it, where it stands in June 2026, and the three things it cannot tell you.

What's inside the index

The current version, v4.0, shipped on January 6, 2026, and was rebuilt around one idea: stop scoring models on exams, start scoring them on work. The 10 components cluster into three groups. Real-work evals: GDPval-AA (professional deliverables), Terminal-Bench Hard (agentic terminal tasks), and tau²-Bench Telecom (customer-service agents). Code and science: SciCode, CritPt, and GPQA Diamond. Reasoning and reliability: Humanity's Last Exam, AA-LCR (long-context reasoning), AA-Omniscience (knowledge and hallucination), and IFBench (instruction following).

Fig. 1
What's inside the index
AA Intelligence Index v4 REAL WORK GDPval-AA Terminal-Bench Hard tau²-Bench Telecom CODE & SCIENCE SciCode CritPt GPQA Diamond REASONING & RELIABILITY HLE AA-LCR AA-Omniscience IFBench RETIRED IN V4: SATURATED MMLU-Pro AIME 2025 LiveCodeBench models maxed these out
Ten components, three clusters. The exams everyone trained against are gone.

Just as important is what got cut. MMLU-Pro, AIME 2025, and LiveCodeBench were retired as saturated: frontier models had effectively maxed them out, so they no longer separated good from great. A benchmark that everyone aces is a participation trophy.

How to read it: the v4 reset

The v4 launch did something benchmarks almost never do. Scores went down. The top of the leaderboard fell from roughly 73 to 50 or below overnight. The models did not get worse in January. The ruler got honest. By swapping saturated exams for harder, work-shaped evals, AA reset the scale so the index could keep discriminating between frontier models for years instead of months.

The independence is the other thing to understand. Artificial Analysis runs every model through its own harness, called Stirrup, at its own cost. A full suite run costs hundreds of dollars per model in inference alone: about $395 for Grok 4.3, and $1,800 or more for an Opus-class model. That spend buys the one thing vendor benchmarks cannot offer: nobody is grading their own homework.

The index, June 2026

Here is where the frontier stands as of this week, on day one of the Claude Fable 5 release.

Fig. 2
The index, June 2026
40 50 60 70 Claude Fable 5 65 Claude Opus 4.8 61 GPT-5.5 (xhigh) 60 Grok 4.3 53 scale starts at 40; v4 rescaled the whole index
A 4-point gap at the top of a 10-eval composite is a real lead, not noise.

Claude Fable 5 takes the top spot at 65, 4 points clear of its stablemate Claude Opus 4.8 at 61, with GPT-5.5 at its xhigh reasoning setting right behind at 60. Grok 4.3 sits at 53 from its May run. One detail worth knowing: AA scored Fable 5 on the shipping configuration, which falls back to Opus on some tasks, not on the raw Mythos model underneath. The 65 is the product you can buy, not a lab artifact. That is the right way to test, and it is why the number is worth quoting.

What it does not tell you

Three limits before you put the index in a board deck.

A composite hides task shape. Averaging 10 evals into one number flattens the profile that actually matters for deployment. Fable 5 leads the overall index while regressing on code-review precision against Opus 4.8, per CodeRabbit's independent day-one testing. Same week, same model: best composite, worse at one specific job. The single number cannot show you that.

Price-capability is a curve, not a ranking. The index tells you who is on top. It does not tell you that the model 4 points down may cost a fraction as much per token, which for high-volume workflows is the decision. AA publishes the cost data; the headline number ignores it.

No index includes your workflows. The components are real work, but they are someone else's real work: telecom support scripts, physics problems, standardized deliverables. Your contract review, your tender pipeline, your support queue appear in no benchmark anywhere.

What a CEO should take from this

Use the index for what it is good at: a trustworthy, independent answer to "is the frontier still moving" (yes: 50 to 65 in five months) and "is any vendor's launch claim inflated" (check it against AA before repeating it). Do not use it to pick the model for a specific function. That decision needs the model measured against your actual tasks, with your actual data, by someone who will be accountable for the output. The leaderboard is the weather report. Your evals are the forecast for your own postcode, and nobody publishes those for free.

Composite scores rank models. Your workflows hire them.

Your work. Your index.

Index your own workflows.

The Diagnostic is free: 30–45 minutes. We'll measure the frontier models against your actual work, not a composite.

Book the Diagnostic →
Sources
1Artificial Analysis, Intelligence Index v4 methodology and leaderboard. 10 components, Stirrup harness, run-cost data (~$395 for Grok 4.3, $1,800+ for Opus-class).
2VentureBeat, January 2026. Coverage of the v4 overhaul: top scores fell from ~73 to 50 or below as saturated exams (MMLU-Pro, AIME 2025, LiveCodeBench) were retired for work-shaped evals.
3Artificial Analysis, June 2026 day-one results: Claude Fable 5 (65), Claude Opus 4.8 (61), GPT-5.5 xhigh (60), Grok 4.3 (53, May run). Fable 5 tested on the shipping config with Opus fallback. Announced alongside Anthropic's Fable 5 release, June 9, 2026.
4CodeRabbit, June 2026. Independent day-one testing showing Fable 5 regression on code-review precision versus Opus 4.8.
John Tan
John Tan

Fractional Chief of AI at nativefirst.ai. Former YC CEO (Depict). Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.