The Model Got Better. Your Workflows Are Still Broken.

Andrej Karpathy spent December 2024 vibe coding. The model outputs stopped needing correction. He kept asking for more. The chunks came out fine. He stopped correcting. At Sequoia's AI Ascent 2026 he described the moment directly: "I tried to stress this on X because I think a lot of people experienced AI last year as a ChatGPT-adjacent thing — but you really had to look again, and you had to look as of December, because things changed fundamentally."

Most companies did not look again. They experienced ChatGPT as a better search engine in 2023, evaluated copilots in 2024, and are now waiting for the right moment to start building. The right moment arrived 18 months ago. Each new model release since then has given the same companies a new reason to wait.

The argument below is not that better models are irrelevant. They are not. It is that better models without architecture are a faster path to the same broken outcome. The bottleneck has never been capability. It has been assembly.

Eighteen Months of Releases. Zero New Excuses.

A new frontier model drops every 3 to 6 months. Each one is objectively better than the last. Each one also provides a fresh reason to defer. The companies that started building in mid-2024 have shipped through four model generations. Each release made their workflows faster and cheaper, automatically, because they had a system to upgrade into. The companies still evaluating have a more capable model to wait for. The pattern does not resolve. There is always a next release.

Major model releases: March 2023 to June 2026

Mar 2023

GPT-4

Mar 2024

Claude 3

May 2024

GPT-4o

Jun 2024
Claude 3.5

Sep 2024

Feb 2025

Claude 3.7

May 2025
Claude 4

Apr 2026

GPT-4.1

Jun 2026
Claude Opus 4.8

18 months of models. How many workflows did you ship?

The Math Nobody Wants to Run

Here is the argument that should end the "better models will fix it" framing once and for all.

Most production AI workflows are not single-step tasks. They are chains: classify the input, retrieve context, draft a response, validate the output, route to the right system, log the outcome. A modest pipeline has 8 to 12 steps. Each step has a reliability rate. The final workflow's success rate is the product of all of them.

Run the numbers:

Today's average

20%

0.80¹⁰ = 0.107 → ~11%

0.85¹⁰ = 0.197 → ~20%

At 85% per step, a 10-step workflow succeeds roughly 1 run in 5. This is the range most untuned agent pipelines sit in today.

Next model

35%

0.90¹⁰ = 0.349 → ~35%

The next model pushes per-step accuracy to 90%. The pipeline now succeeds 1 run in 3. Better. Still fails twice in three runs.

Architecture fix

85%+

Tighter steps + exception handling

Redesign for 5 well-scoped steps with explicit exception paths. At 95% each: 0.95⁵ = 77%. Add a validation loop: 85%+.

A model going from 85% to 90% per step doubles your success rate from 20% to 35%.
Redesigning the step count and exception architecture takes you to 85%+.
Model improvements give diminishing returns on multi-step reliability. Architecture is the lever.

Andrej Karpathy drew the line between two modes of working in From Vibe Coding to Agentic Engineering: vibe coding is prompting without architecture, agentic engineering is deliberate step design, tool integrations, and exception handling. The model capability is identical in both cases. The results are not.

This is why Anthropic's SWE-bench scores moved from 49% with Claude Sonnet 3.5 to 80%+ with Claude Opus 4.8. That's a real capability jump. But the companies that went from 20% to 85% workflow success rates did not do it by waiting for that jump. They did it by redesigning their exception architecture and tightening their step design. The model improvement is a multiplier on a well-built pipeline. On a badly-built one, it is still a disaster, just a slightly less frequent one.

Assembly Is the Bottleneck. Not Intelligence.

Intelligence alone doesn't drive outcomes. Integration into existing systems does. Most models are already superhuman at reasoning. Assembly is the bottleneck.

AGI requires assembly. You can have a model that writes better than any human, reasons better than any analyst, codes better than most engineers, and still have no production workflows, because nobody connected it to live systems, designed the exception paths, or built the feedback loops that close the operational loop.

Aaron Levie, CEO of Box, put it directly in a 20VC conversation: "Half your data state is not even ready... the other half is fragmented because you have two decades of employees bringing their own tools." And the corollary: "The second a new model drops your workflow probably breaks." A model upgrade does not reassemble itself into your systems. If the workflows are brittle, built without exception handling or model-agnostic abstraction, an upgrade is as likely to break them as improve them. Companies that built properly in 2024 swap a model in a week. Companies that were waiting start from scratch. Again.

Fig. 1

Two release histories

The models kept their side of the bargain.

What Production-Grade AI Actually Looks Like

Delivery Hero built Herogen, an AI engineering agent architecture, and hit 85% ticket success rate with zero-to-one developer interactions per ticket. The capability gain did not come from a new model dropping. It came from the architecture: the exception design, the feedback loops, the integration into their live systems. When better models arrived, they slotted in. The plumbing was already there.

Browserbase built a single generalised agent called bb that runs across engineering, ops, sales, support, and exec. It produced a 10x output increase not because a new model arrived, but because they built the credential brokering layer, the permission architecture, the skill system, and the Slack integration. The model is the reasoning layer inside a larger constructed system. Replace the model, the system keeps running. That is the goal.

The key question

It's not "which model are you using?"

It's "what's blocking the first loop from closing?"

In 90% of Diagnostic calls, the answer is one of three things:

Data access. The model can't reach the live systems it needs to act on.
Exception design. Nobody has defined what the agent handles vs. what it escalates to a human.
Ownership gap. There's no one accountable for the live system after the build.

None of these get fixed by a better model. They get fixed by an operator who sits inside your systems and ships the architecture.

When the Next Model Drops

Waiting company

Starting from scratch. Again.

No architecture to upgrade. No pipeline to plug the new model into.
Same data access problems. Systems still not connected. Credentials still not provisioned.
Same team uncertainty. Who owns the build? Who defines the exceptions?
New model, same question: "Should we wait for the next one?"
18 more months of competitive gap. Now compounding.

Building company

Swap the model. Ship the upgrade.

Architecture already in place. Model is one parameter in a running system.
Data connections live. New model reads the same live systems immediately.
Exception paths already mapped. Upgrade improves performance, doesn't reset it.
Swap takes a week. Deploy, monitor, done.
Improvement compounds. Each model release makes existing workflows better, not more urgent.

The Case for Building Now

The models have been good enough since mid-2024. Claude 3.5 was good enough to ship production agents. The teams that deployed on it have benefited from every improvement since. Automatically, because they had a system to upgrade into. Claude Opus 4.8 shipped this week. The teams that built in 2024 upgraded in a day. The teams still waiting have a new reason to wait. The pattern holds.

The companies still waiting do not have a model problem. They have an assembly problem. The bottleneck has never been reasoning capability. It has always been connecting the model to live data, designing exception paths, and having someone accountable for the running system. None of that changes when a new model drops.

If your answer to "what's blocking your AI deployment?" is "we're evaluating the model landscape", you are asking the wrong question. The model is not the variable. The architecture is.

The Diagnostic doesn't ask which model you're using.

Book a free Diagnostic: 30–45 minutes, no deck, no pitch. It asks what's blocking the first loop from closing and returns a 3-point read on your highest-leverage first build.

Book the Diagnostic →

Sources

1Andrej Karpathy, From Vibe Coding to Agentic Engineering, Sequoia AI Ascent, April 2026.

2Aaron Levie (CEO, Box), on data fragmentation and model upgrade brittleness. @levie on X.

3Model release timeline: GPT-4 (Mar 2023), Claude 3 (Mar 2024), GPT-4o (May 2024), Claude 3.5 (Jun 2024), o1 (Sep 2024), Claude 3.7 (Feb 2025), Claude 4 (May 2025), GPT-4.1 (Apr 2026), Claude Opus 4.8 (Jun 2026). Sources: OpenAI and Anthropic research pages.

John Tan

Founder and CEO of nativefirst.ai. Embeds with scaling founders and CEOs to ship Level-3 agents and AI workflows in production.