Can cheaper AI models replace expensive ones fast?

cheaper AI – As AI costs bite, a new bet is spreading: most workloads may move to much cheaper models within 12–18 months, even if the industry built its entire reputation on “bigger is better.” Early tests in legal AI suggest quality may hold—if companies design systems t
The assumption that powered the AI boom is starting to feel less like a law of nature and more like a choice companies stopped questioning early.
For years, “more powerful” has been the default path: larger models, better performance, higher wins. But the bill for running those models doesn’t stay theoretical for long. Costs have already pushed users to look at smaller. cheaper options—less because they want to downgrade. and more because they can’t ignore the math forever.
That’s where Coinbase co-founder Brian Armstrong’s prediction lands with a kind of blunt momentum. In a post on X. Armstrong wrote that while demand for intelligence is “near infinite. ” “80% of workloads will be running on 99% cheaper models within 12-18 months. ” with the remaining “20% of workloads” still handled by the latest-generation models “where IQ maxing is important.”.
If that scenario plays out, it wouldn’t just change product roadmaps. It could hit the AI industry’s economics at the level where companies actually feel it: who pays for inference, and how much.
Most AI companies have competed on quality by defaulting to the most advanced model available. But the underlying bet Armstrong is making is different—he’s betting that many tasks can be handled by far cheaper models without changing the outcome. If that holds. savings would flow away from the “biggest model for everything” approach and toward routing work to smaller systems. And that matters. because much of those savings would come from the pockets of the major labs. including OpenAI and Anthropic. as they head toward their IPOs.
The question that follows is painfully simple: are companies actually ready to switch?
Some early tests suggest they might be—if the system is arranged correctly. In a test by the legal AI tool Harvey, the company reduced inference costs by 3x without reducing quality. The trial, conducted in partnership with the inference platform Fireworks AI, combined Claude Opus with Fireworks’ GLM 5.1. Harvey then shifted Claude Opus to the most intensive tasks, while letting the cheaper model handle the rest. The reported result was a significantly lower load in terms of server time and overall cost.
Harvey co-founder Gabe Pereyra framed the point directly: “Quality comes first, and in legal it always will.” But he added that “the definition of quality is evolving from simply using the most powerful model for everything, to using the best model that gets the right answer most efficiently.”
That shift in definition is small on paper and enormous in practice. It turns “quality” from a blanket rule into a workflow decision—one that can be measured, optimized, and, eventually, demanded by budget-conscious teams.
The industry often debates this space as a fight between major labs and Chinese models. or between proprietary systems and open-weight ones. But the line that appears to matter more is the one drawn between large models and small ones. The source of savings is less about where a model comes from and more about how much compute it costs to run.
Even if the cheaper side is achieved through different approaches—whether it’s an independently served open-weight model or something deployed in-house—the economics hinge on the same trade: you can save money by swapping from GPT-5.5 to DeepSeek’s V4 Flash. and the math can also work even when the smaller option is simply “GPT-5.4-mini works just as well.”.
There’s already pressure building between large labs running inference themselves and independently served open-weight models. with an active price war shaping what companies pay today. But for the larger question—small versus large—who wins that model category isn’t the only issue. If the industry’s routing becomes smarter enough. the victory could simply belong to the cheapest option that still hits quality.
And that’s where the uncomfortable tension sits. Using the smallest model that still works sounds obvious. but it runs against the scaling-first approach that has dominated the industry until now. Labs leaned hard into training and building the most compute-intensive models possible, pushing the frontier of what AI could do. With prices heavily subsidized by investors, clients had little reason to choose anything but the most advanced option.
Now that token prices are rising and subsidies are slowing down. users are finally feeling cost pressure in a way that changes behavior. What happens next isn’t guaranteed. Companies might choose smaller models—but they might also cope by making fewer calls. using less context. or walking away from the least promising deployments.
Still, if smaller models can genuinely deliver the same results for most real-world tasks, the ripple effects could be severe. Inference demand might not grow as expected. And even if it does. that growth might arrive with a new set of questions: how to justify the cost of training a frontier model when so much of the work can be handled cheaper.
For an industry that built its momentum on “bigger wins,” the next phase may be about something far less glamorous—but potentially far more powerful: learning when bigger isn’t necessary.
AI models cheaper AI inference costs Armstrong prediction Harvey Fireworks AI Claude Opus GLM 5.1 OpenAI Anthropic IPOS GPT-5.5 DeepSeek V4 Flash GPT-5.4-mini
So basically AI will get cheaper and start doing everything… right?
I don’t buy it. If the model is cheaper then it’s automatically worse, like you get what you pay for.
Wait, legal AI tests? I thought AI was already too expensive for courts, so who’s paying for this 80% thing? Sounds like they’re just moving costs around and calling it “savings.”
Armstrong on X always says wild stuff. “99% cheaper” sounds made up tbh. Also bigger models are better, so if companies ditch them then how does it not mess up accuracy? Maybe they just dumb it down and pretend it’s the same outcome.