The Case for Smaller Models: Why Enterprises Need Them

Models are getting more expensive. GPT 5.5 runs $5 per million input tokens and $30 per million output. Claude Opus 4.8 sits close behind at $5 in and $25 out, and the ceiling keeps rising: Claude Fable and Mythos launched at twice the price of Opus. Ever since reasoning models appeared, inference has become the dominant factor in compute usage, and the bills show it: one company reportedly spent $500 million a month on Claude, and even Microsoft has said Anthropic’s models are too expensive. The arithmetic behind those numbers is simple: when a workflow runs millions of times against a frontier model, the bill isn’t a function of difficulty; it’s volume times the highest per-token rate on the menu.

For the last few years, providers effectively subsidized tokens to put AI into everyone’s hands. That era is closing. Prices are rising, and usage limits keep tightening on subscription plans, with both Anthropic and OpenAI drawing user complaints.

This raises the real question for any enterprise scaling AI: how do you automate more work without burning every token in the world? We argue that fine-tuning smaller models for specific workflows is not just viable but necessary. It’s not only a cost play: there are agents and workflows you simply cannot build reliably on a general frontier model alone.

The usual ways to cut costs

Fine-tuning isn’t where most teams start, though, and for good reason: there are cheaper levers to pull first. The easiest is the prompt harness. For a model you don’t fine-tune, the only lever is in-context learning: system prompts, agent instructions, skills, memory, and MCPs for tool calling.¹ This is why spec-driven development like Superpowers and Spec Kit took off, and why so much effort goes into harnesses like Claude Code and Codex. All the work stems from putting the right context in front of the model at the right time.

This works, and it’s essential, but it has a ceiling. Everything the model needs must live in the context window or close to it, and the prompt only grows as you patch more edge cases, making the process more expensive.² Some behaviors can’t be reliably induced by a prompt at all, no matter the wording, and longer contexts degrade performance on their own.

Another common approach is a model router: a smaller language model sits in front, classifies each query, and forwards a stronger one for difficult problems and a cheaper one for simple tasks.³ It works, but with tradeoffs. It adds latency, and it isn’t deterministic. Quality is capped below that of the strongest model in the pool, and because difficulty isn’t directly observable, a misroute and a genuine model failure look identical from the outside, making degraded responses hard to debug. It also takes control away from the user or workflow author, who often knows better than a naive classifier which model the task needs.

Multi-agent systems are the other approach gaining ground: a frontier model orchestrates, breaking work apart and handing routine steps to cheaper subagents. This already saves real money, since the expensive model only touches the parts that need it. The open question is what fills those subagent slots. Today, it’s usually a smaller general model behind a prompt, which inherits all the fragility above.

As agents take on more complex workflows, these constraints stop being annoyances and become blockers: the prompt has a ceiling, the router itself must be trained, and the subagent slots require models that are actually reliable. At that point, model training is no longer optional. The only question left is what to train, and a small, open model tuned to a specific task is the answer worth building toward.

Why specialized models matter

Model post-training is one of the most important recipes in the stack. The clearest reason is reliability: with the behavior trained into the weights, the model doesn’t drift as providers change their models underneath you, the way a prompt-and-harness setup does. It is also much cheaper to run, often by one to three orders of magnitude, depending on how small the model is.

Just as important, you own the result. The weights are yours, and they encode the data you trained them on, which matters most for the company-specific knowledge a general model never sees. That ownership is what makes continuous improvement possible: every interaction with the workflow generates more data, and that data feeds back into the next round of training, a compounding cycle no static general model can offer. That cycle rests on owning a model you can train in the first place.

To make these smaller models, two conditions have to hold. First, you need a way to quickly verify the work. This is what makes reinforcement learning possible: the model either reaches the correct end state or it doesn’t, so every attempt can be scored and turned into a training signal. Second, the training data has to be sufficiently diverse for the model to generalize to the real task after post-training, rather than overfitting to the example trajectories. Plenty of enterprise work fits both: support flows, routing, and structured extraction. Almost all the work in training methods comes down to the same thing: pulling a clean signal from each run; the real bottleneck is high-quality data.⁴

Why the industry is already moving here

The labs themselves are moving in this direction. The first era of scaling was about breadth, one model that could do everything, and you got there with more data. The current era is narrower: the headline gains now come on specific, checkable tasks like mathematical reasoning and software engineering, because those are the ones people pay for and, not coincidentally, the ones you can verify and train against (see Kimi K2 and DeepSeek). The same verifiability that makes a small model trainable is driving the frontier, too.⁵

The same is true on the efficiency side. Open-weight model families continue to close the intelligence gap,⁶ and a growing share of research focuses on extracting greater capability from fewer parameters through distillation, quantization, and sparsity. Hardware is following: Apple has been building MLX, which, together with unified memory, delivers some of the best-value local inference around, and NVIDIA’s Spark is putting its own version on the market. The direction is consistent: frontier API calls scale with usage, while an efficient model on a machine you own is a fixed cost, and the industry is racing to make that second option real.

Caveats and the hybrid future

There is no free lunch. A small model needs people who understand training and inference, and a real evaluation pipeline to know when it’s working. But the largest cost, curating high-quality data, is paid once. With that data in hand, you can retrain on a better base model as the frontier moves or extend to adjacent tasks. In a way, you are trading the bulk of a recurring per-token bill for a one-time investment in expertise and data.

Therefore, the ideal enterprise stack we expect to win is hybrid: a frontier model handles the open-ended, novel, and low-volume work where its generality earns its cost, while specialized small models own the repetitive, verifiable core that runs many times a day.⁷ The most common shape is a frontier model orchestrating a set of small, specialized agents. Regardless of the agent workflow, though, the rule of thumb is to match each task to the smallest model that does it reliably.

One more shift is worth naming. The work of building this hybrid stack is itself becoming something agents do: a frontier model can scaffold the harness, generate and filter training data, and run the evaluation loop. That is the playbook of agents creating agents: the expensive general model’s job is not to run your workflow forever, it is to build the cheap specialized system that does.

For an enterprise, training specific models is increasingly a requirement rather than an option. A frontier API bill scales with every call and converts into nothing you keep. The companies that integrate models tuned on their own data into their workflows get something no API contract offers: a system that improves the more it’s used. That’s the divide forming now, not between companies that use AI and companies that don’t, but between those that rent their intelligence and those that own it.

Frontier vs. small models, at a glance

	Frontier models	Small models
Performance	Usually superior across the board	Comparable, or ~6 months behind, but can be tuned to match on a specific task
ML expertise	None required	Needs ML and inference expertise to tune
Cost at scale	Much more expensive	1-3 orders of magnitude cheaper
Evaluation	Solid evals required	Solid eval pipelines required
Control	Black box, limited fine-grained control	More control; behavior tuned into the weights
Deployment	Online / API only	Online, offline, in-house, or on-device

Nearly all of this layer is just structured text placed in context: skills are markdown instruction files, memory is an organized set of markdown notes, and agent instructions are markdown documents. The harness is context engineering, not a different mechanism. ↩
Prompt caching (OpenAI, Anthropic, Google) provides a good amount of savings when the system prompt and instructions stay the same across a workflow. But it only lowers cost, not the performance ceiling: the long-context deterioration argument still applies. ↩
Routers can also span providers rather than just capability tiers, putting Claude, Gemini, and GPT side by side and routing by each one’s strengths. More sophisticated routers exist, using learned embeddings or cascades that escalate only when a cheaper model is unsure, but the same tradeoffs apply. And a router only helps if the classifier is good: it is itself a small language model, with no guaranteed judgment on queries unlike those it was trained on. ↩
The claim that data rather than the algorithm is the bottleneck applies to this narrowly scoped setting, where the task is fixed. In frontier research, where the goal is general capability, improvements to the reinforcement learning algorithms themselves remain a necessary and active area of work. ↩
Specialization can come at the expense of untargeted capabilities. There are reports of newer frontier releases regressing on open-ended tasks such as creative writing, which is consistent with what one would expect as the optimization target shifts toward verifiable domains. ↩
Distillation, training a smaller model on the outputs of a larger one, is part of how open-weight models closed the gap. In response, major labs have restricted access to full reasoning traces; OpenAI, Google, and Anthropic now withhold them, partly to limit distillation at scale. A side effect is that opaque reasoning makes production failures harder to diagnose. ↩
Fine-tuning narrows a model, and it can regress on tasks outside the target. New base models routinely shift performance across benchmarks, too. Here, that isn’t a real concern, because narrowing to the task was the entire point. ↩

The usual ways to cut costs

Why specialized models matter

Why the industry is already moving here

Caveats and the hybrid future

Frontier vs. small models, at a glance

Footnotes