The Daily Record of Artificial IntelligenceUpdated 2 min ago
Analysis

The Inference Cost Curve: What Falling Token Prices Mean for AI Products

The cost of running large language model inference has fallen dramatically over the past two years. This creates new product possibilities — and puts pressure on business models built around expensive compute.

AI with human touch

18 April 2025

TL;DR

  • Inference costs for frontier-class models have fallen by roughly an order of magnitude in 18–24 months, driven by hardware improvements, quantisation, and competitive pressure
  • Products built around expensive-inference economics — gating features behind token budgets — face structural pressure as the cost floor drops
  • Falling inference costs shift competitive advantage from access to capability toward product design, user experience, and distribution
  • The cost curve has also enabled new categories: continuous background processing, large-context document analysis, and agentic pipelines that would have been cost-prohibitive two years ago
  • The curve is not guaranteed to continue — hardware constraints and power limitations may slow the decline at the frontier

In early 2023, running a quality GPT-4 class query cost several cents per call. By mid-2025, comparable quality is available for a small fraction of that price, with many workloads running on open-weight models at costs closer to thousandths of a cent per token. This shift — roughly an order of magnitude in under two years — is not just a pricing story. It is restructuring which AI product architectures are viable.

What Is Driving the Decline

Three forces compound each other. First, hardware: successive generations of inference-optimised chips (NVIDIA H100 to H200 to Blackwell, plus Google TPU v5 and Amazon Trainium) deliver meaningfully more throughput per dollar. Second, efficiency techniques: quantisation, speculative decoding, and improved batching reduce compute requirements without proportional quality loss. Third, competition: the model provider landscape has expanded significantly, with open-weight models (Llama, Mistral, Qwen) creating a cost floor that pushes proprietary providers to match.

Each of these would matter independently. Together, they have created a cost trajectory that has surprised even optimistic analysts.

What Changes for Products Built on LLMs

The most immediate consequence is that gating mechanisms designed around inference costs begin to look artificial. Subscription tiers that limit monthly message counts, or features locked to expensive plans because of per-call costs, face pressure as the underlying economics shift. Products that treated inference as a scarce resource must reconsider whether scarcity is still the right frame.

More structurally, the decline in inference costs means that differentiation is shifting. In 2023, access to GPT-4 class capability was itself a competitive advantage. That advantage is narrowing. What matters increasingly is product design, workflow integration, data advantages, and distribution — factors that are harder to buy.

New Categories the Curve Enables

Falling costs also open genuinely new product categories. Continuous background processing — running analysis on incoming documents or data streams without per-run human triggers — becomes economically viable. Large-context document analysis across entire repositories or legal corpora is no longer prohibitively expensive. Agentic pipelines that make hundreds of model calls per task have crossed the threshold from demo to production.

These capabilities existed technically at higher price points but were not deployable at scale. The cost curve has moved them from experiments into practical tools.

Where the Limits Are

The curve is not a guarantee. Frontier models — the largest, most capable systems — face different economics from commodity inference. Training costs for next-generation models remain very high, and hardware and power constraints may slow the pace of improvement. The cost curve for commodity-class inference may continue falling while frontier inference stays expensive, creating a two-tier market: cheap-and-capable-enough for most applications, expensive-and-best for those where state-of-the-art performance matters.

Key Takeaways

  • Inference costs for LLM-class models have fallen roughly an order of magnitude in under two years
  • The decline is driven by hardware improvements, efficiency techniques, and competitive pressure from open-weight models
  • Products built on scarcity-of-compute assumptions need to revisit their business model assumptions
  • New product categories — continuous processing, large-context analysis, agentic pipelines — have crossed into economic viability
  • Frontier model inference may maintain a cost premium even as commodity inference continues to fall
The Inference Cost Curve: What Falling Token Prices Mean for AI Products — AIwrites.news — AIwrites.news