For two years, founders and CEOs held back on committing to AI with a sensible rationale: the models aren't good enough yet. We'll wait until they can reliably handle our actual work. We'll move when the technology is ready.
The technology is ready.
In October 2025, OpenAI published GDPVal, a benchmark evaluating AI models on real-world economically valuable tasks across 44 occupations in the top 9 sectors contributing to US GDP. Tasks were built by industry professionals with an average of 14 years of experience. The finding: frontier models are approaching industry expert quality, improving roughly linearly over time, and can already perform many of these tasks cheaper and faster than unaided human experts when paired with the right setup.
The question is no longer whether the model can do the work. The question is whether your company is set up to let it.
What GDPVal Actually Measured
Most AI benchmarks test reasoning, coding ability, or factual recall. These are things that are easy to measure but don't map directly to economic value. GDPVal tested something different: what professionals actually do at work. Real deliverables in real job functions, from finance to legal to operations to software engineering. Not academic exercises. Professional work activities, assessed by the people who do them.
The scope of the study makes its findings hard to dismiss.
Source: GDPVal benchmark (arXiv:2510.04374). Scores reflect deliverable quality ratings vs. 14-year-average-experience professionals.
The Three Unlocks
GDPVal also identified the three factors that determine whether a frontier model achieves expert-level performance on a given task. Not which model you use. These three factors:
1. Increased reasoning effort
Giving the model room to think through the problem before answering. This is about how you call the model, not which model you use. Extended thinking, chain-of-thought prompting, structured problem decomposition before output. Most companies never configure this. They call the model with a system prompt and a user message and expect expert output.
2. Expanded task context
Giving the model access to the relevant information it needs to do the work well. Your CRM data. Your internal documentation. Your live system state. The context that a human expert would have walking into the task. Without this, the model is answering questions about your business with no knowledge of your business. It is expert-quality reasoning applied to a blank canvas.
3. Improved scaffolding
The structure around the model call: the prompt architecture, the tool integrations, the exception handling, the output validation, the human escalation logic. Scaffolding is what turns a capable model into a reliable production system. Without it, you have a smart engine with no car around it.
Most companies have none of these three in place. They have a model. They do not have context, scaffolding, or a setup that enables reasoning effort. The model is sitting at a desk with nothing on it.
Model. No setup.
Context. Scaffolding. Oversight.
When Paired with Human Oversight
The GDPVal finding is specific: "when paired with human oversight, frontier models can perform these tasks cheaper and faster than unaided experts." That "when paired with" clause is not a footnote. It is the whole job.
Human oversight in a production AI deployment is not a person watching every output. It is the architecture: defined escalation criteria, monitored output quality, exception handling that surfaces edge cases for human review. This architecture does not appear by default. It has to be designed, built, and maintained. That is what an operator does.
The model is the engine. The operator installs the car around it.
Companies that skip this step get a capable model producing inconsistent output with no way to catch the failures. They conclude the model isn't good enough. The model is good enough. The oversight layer is missing.
What This Means for Your Timeline
Companies that were waiting for the model to get good enough are now waiting for themselves to get ready. That is a different problem, and it has a different solution.
The model will continue improving on its own. Your context layer, your scaffolding, your permission architecture, your data instrumentation: none of those will improve on their own. They require someone inside your systems, building.
Every month you spend not installing the context and scaffolding layer is a month the model is expert-level and you're not using it. The capability is there. The setup is not. Those are not equivalent problems.
The companies moving now are not moving because they have better models. They have the same models as everyone else. They are moving because they have someone who built the setup.
The First Build
The First Build is 2 weeks and ships one Level-3 agent to your live production environment. Here is what those two weeks produce:
MCP server configured and connected to your relevant data sources. Task context routed to the agent from live systems, not a staging export. The model now knows what your team knows.
Prompt architecture, tool integrations, exception handling, and escalation logic built around the real edge cases your data actually produces. Not a generic template. Your system.
Output validation in place. One agent running in production, closing a real operational loop. Not a demo. The actual workflow, running live.
That is the three GDPVal unlocks, installed: reasoning effort configured, task context routed, scaffolding in place. The model was already ready. Now the setup is too.
The setup is the missing piece.
Book a free Diagnostic: 30–45 minutes, no deck, no pitch. We'll tell you which function your current data can support right now, what the context and scaffolding layer needs to look like, and what it takes to get from zero to a production agent in 2 weeks.
Book the Diagnostic →