For two years, founders and CEOs held back on committing to AI with a sensible rationale: the models aren't good enough yet. We'll wait until they can reliably handle our actual work. We'll move when the technology is ready.

The technology is ready.

In October 2025, OpenAI published GDPVal, a benchmark evaluating AI models on real-world economically valuable tasks across 44 occupations in the top 9 sectors contributing to US GDP. Tasks were built by industry professionals with an average of 14 years of experience. The finding: frontier models are approaching industry expert quality, improving roughly linearly over time, and can already perform many of these tasks cheaper and faster than unaided human experts when paired with the right setup.

The question is no longer whether the model can do the work. The question is whether your company is set up to let it.

What GDPVal Actually Measured

Most AI benchmarks test reasoning, coding ability, or factual recall. These are things that are easy to measure but don't map directly to economic value. GDPVal tested something different: what professionals actually do at work. Real deliverables in real job functions, from finance to legal to operations to software engineering. Not academic exercises. Professional work activities, assessed by the people who do them.

The scope of the study makes its findings hard to dismiss.

GDPVal at a glance
44 occupations
Across software engineering, legal, finance, operations, sales, and more. Not academic benchmarks. Professional work activities.
9 GDP sectors
The top contributors to US GDP. Tasks built by 14-year-average-experience industry professionals.
Approaching expert quality
Frontier models now match or near-match human experts on deliverable quality for most tested tasks.
Improving linearly
Performance is not plateauing. Each model generation pushes closer. The trend is not slowing.
GDPVal: AI Performance vs. Human Expert Baseline
Software Engineering
91%
Legal & Compliance
86%
Finance & Accounting
84%
Sales Operations
82%
Customer Support
79%

Source: GDPVal benchmark (arXiv:2510.04374). Scores reflect deliverable quality ratings vs. 14-year-average-experience professionals.

The Three Unlocks

GDPVal also identified the three factors that determine whether a frontier model achieves expert-level performance on a given task. Not which model you use. These three factors:

1. Increased reasoning effort

Giving the model room to think through the problem before answering. This is about how you call the model, not which model you use. Extended thinking, chain-of-thought prompting, structured problem decomposition before output. Most companies never configure this. They call the model with a system prompt and a user message and expect expert output.

2. Expanded task context

Giving the model access to the relevant information it needs to do the work well. Your CRM data. Your internal documentation. Your live system state. The context that a human expert would have walking into the task. Without this, the model is answering questions about your business with no knowledge of your business. It is expert-quality reasoning applied to a blank canvas.

3. Improved scaffolding

The structure around the model call: the prompt architecture, the tool integrations, the exception handling, the output validation, the human escalation logic. Scaffolding is what turns a capable model into a reliable production system. Without it, you have a smart engine with no car around it.

Most companies have none of these three in place. They have a model. They do not have context, scaffolding, or a setup that enables reasoning effort. The model is sitting at a desk with nothing on it.

Most company AI setups

Model. No setup.

Context
Model called with minimal context, maybe a system prompt. No access to live internal data.
Tool integrations
No integrations connecting the model to real systems. Queries answered in isolation.
Exception handling
No exception handling or escalation logic. Edge cases fail silently or produce bad output.
Output validation
No defined output format or validation. Downstream actions proceed on unverified output.
What GDPVal performance requires

Context. Scaffolding. Oversight.

Context
Full task context routed from CRM, docs, and live system state via MCP. The model knows what a human expert would know.
Tool integrations
Integrations that let the agent read, write, and act on real data. Reasoning effort enabled via extended thinking and chain-of-thought prompting.
Exception handling
Escalation criteria defined: what the agent handles autonomously vs. routes to a human. Edge cases surface rather than propagate.
Output validation
Output schema validated before any downstream action is taken. The system fails loudly, not silently.

When Paired with Human Oversight

The GDPVal finding is specific: "when paired with human oversight, frontier models can perform these tasks cheaper and faster than unaided experts." That "when paired with" clause is not a footnote. It is the whole job.

Human oversight in a production AI deployment is not a person watching every output. It is the architecture: defined escalation criteria, monitored output quality, exception handling that surfaces edge cases for human review. This architecture does not appear by default. It has to be designed, built, and maintained. That is what an operator does.

The model is the engine. The operator installs the car around it.

Companies that skip this step get a capable model producing inconsistent output with no way to catch the failures. They conclude the model isn't good enough. The model is good enough. The oversight layer is missing.

What This Means for Your Timeline

Companies that were waiting for the model to get good enough are now waiting for themselves to get ready. That is a different problem, and it has a different solution.

The model will continue improving on its own. Your context layer, your scaffolding, your permission architecture, your data instrumentation: none of those will improve on their own. They require someone inside your systems, building.

Every month you spend not installing the context and scaffolding layer is a month the model is expert-level and you're not using it. The capability is there. The setup is not. Those are not equivalent problems.

The companies moving now are not moving because they have better models. They have the same models as everyone else. They are moving because they have someone who built the setup.

The First Build

The First Build is 2 weeks and ships one Level-3 agent to your live production environment. Here is what those two weeks produce:

Week 1
Context layer installed

MCP server configured and connected to your relevant data sources. Task context routed to the agent from live systems, not a staging export. The model now knows what your team knows.

Week 1–2
Scaffolding built

Prompt architecture, tool integrations, exception handling, and escalation logic built around the real edge cases your data actually produces. Not a generic template. Your system.

Week 2
Production agent running

Output validation in place. One agent running in production, closing a real operational loop. Not a demo. The actual workflow, running live.

That is the three GDPVal unlocks, installed: reasoning effort configured, task context routed, scaffolding in place. The model was already ready. Now the setup is too.

The setup is the missing piece.

Book a free Diagnostic: 30–45 minutes, no deck, no pitch. We'll tell you which function your current data can support right now, what the context and scaffolding layer needs to look like, and what it takes to get from zero to a production agent in 2 weeks.

Book the Diagnostic →
Sources
3OpenAI, "Measuring the performance of our models on real-world tasks", OpenAI Blog, 2026. The public-facing overview of GDPVal findings.
John Tan
John Tan

Fractional AI & Product Founder at nativefirst.ai. Ex-CEO, Depict (Y Combinator). Embeds on-site with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.