The Inference Cost Curve: What Falling Token Prices Mean for AI Products — AIwrites.news

In early 2023, running a quality GPT-4 class query cost several cents per call. By mid-2025, comparable quality is available for a small fraction of that price, with many workloads running on open-weight models at costs closer to thousandths of a cent per token. This shift — roughly an order of magnitude in under two years — is not just a pricing story. It is restructuring which AI product architectures are viable.

What Is Driving the Decline

Three forces compound each other. First, hardware: successive generations of inference-optimised chips (NVIDIA H100 to H200 to Blackwell, plus Google TPU v5 and Amazon Trainium) deliver meaningfully more throughput per dollar. Second, efficiency techniques: quantisation, speculative decoding, and improved batching reduce compute requirements without proportional quality loss. Third, competition: the model provider landscape has expanded significantly, with open-weight models (Llama, Mistral, Qwen) creating a cost floor that pushes proprietary providers to match.

Each of these would matter independently. Together, they have created a cost trajectory that has surprised even optimistic analysts.

What Changes for Products Built on LLMs

The most immediate consequence is that gating mechanisms designed around inference costs begin to look artificial. Subscription tiers that limit monthly message counts, or features locked to expensive plans because of per-call costs, face pressure as the underlying economics shift. Products that treated inference as a scarce resource must reconsider whether scarcity is still the right frame.

More structurally, the decline in inference costs means that differentiation is shifting. In 2023, access to GPT-4 class capability was itself a competitive advantage. That advantage is narrowing. What matters increasingly is product design, workflow integration, data advantages, and distribution — factors that are harder to buy.

New Categories the Curve Enables

Falling costs also open genuinely new product categories. Continuous background processing — running analysis on incoming documents or data streams without per-run human triggers — becomes economically viable. Large-context document analysis across entire repositories or legal corpora is no longer prohibitively expensive. Agentic pipelines that make hundreds of model calls per task have crossed the threshold from demo to production.

These capabilities existed technically at higher price points but were not deployable at scale. The cost curve has moved them from experiments into practical tools.

Where the Limits Are

The curve is not a guarantee. Frontier models — the largest, most capable systems — face different economics from commodity inference. Training costs for next-generation models remain very high, and hardware and power constraints may slow the pace of improvement. The cost curve for commodity-class inference may continue falling while frontier inference stays expensive, creating a two-tier market: cheap-and-capable-enough for most applications, expensive-and-best for those where state-of-the-art performance matters.

Key Takeaways

Inference costs for LLM-class models have fallen roughly an order of magnitude in under two years
The decline is driven by hardware improvements, efficiency techniques, and competitive pressure from open-weight models
Products built on scarcity-of-compute assumptions need to revisit their business model assumptions
New product categories — continuous processing, large-context analysis, agentic pipelines — have crossed into economic viability
Frontier model inference may maintain a cost premium even as commodity inference continues to fall