The $150M model that costs $2.3B to use

In partnership with

94% of Institutional Investors Are in Private Credit.

A 2025 Nuveen survey found 94% of institutional investors now allocate to private credit. Pension funds, sovereign wealth funds, endowments — it's about as close to unanimous as institutional finance gets. What they know: T. Rowe Price research shows a 10% private credit allocation has historically cut portfolio volatility and improved risk-adjusted returns. $592.8 billion deployed globally in 2024, up 78% from the year before. Accredited investors on Percent get direct access to private credit, starting at $500: · 16.72% current weighted average coupon rate · Terms as short as 3 months · Full borrower financials before you invest $1.82B funded. 0.58% lifetime charge-off rate. 97.33% of all principal returned or currently performing. New investors can receive up to $500 on their first investment.

Follow the institutional money.

^{Alternative investments are speculative. Past performance not indicative of future results. Terms apply.}

The Flip

Three years ago, training dominated AI budgets. Most compute went to building models. Inference was an afterthought, a line item someone else worried about.

That ratio has completely inverted. Deloitte's 2026 TMT Predictions peg inference at roughly 66% of all AI compute this year, up from 50% in 2025 and 33% in 2023. Lenovo CEO Yuanqing Yang went further at CES 2026, forecasting an 80/20 split:

"In the future, those numbers are reversed. Eighty percent will be on inference and 20% will be on training. That is our forecast."

But the 280-fold cost drop tells only half the story. Per-token prices have collapsed, yes. But the volume of tokens being generated has exploded by roughly 31x over the same period. When you multiply a 280x cheaper price by 31x more usage, total inference spending is still climbing sharply. This is the paradox at the center of the shift: each individual API call costs almost nothing, but organizations are making so many of them that inference has become the dominant cost center anyway. Jevons paradox, applied to GPU cycles.

The economics tell the story more bluntly. GPT-4 cost roughly $150 million to train. A one-time expense. Its inference bill in 2024 alone hit $2.3 billion. Training is the down payment on a house. Inference is the 30-year mortgage.

At GTC, Dean and Dally estimated that inference consumes up to 90% of total data center power when you account for always-on serving workloads. And DeepSeek proved that training costs are collapsing. They built a model that matches GPT-4 on 90% of benchmarks for $5.6 million, roughly 1% of what competitors spent. If training is getting cheap, the cost of actually using models is where the real financial pressure lives.

For every $1 billion spent training a model, organizations face $15 to $20 billion in inference costs over its production lifetime. Tony Grayson, former SVP at Oracle, AWS, and Meta, put it plainly:

"There is no 'moat' in intelligence when a competitor can knock you off the leaderboard in a single financial quarter. The model is a commodity. The infrastructure is the toll road."

Five Techniques That Actually Matter

Here's the aha moment in these numbers. That $1 billion training bill generates $15 to $20 billion in inference costs. Which means the highest-leverage engineering work isn't making models smarter. It's making them cheaper to run.

Noam Brown at OpenAI demonstrated that giving a model 20 seconds to "think" during inference produces the same improvement as scaling training by 100,000x. That's powerful. It also means inference workloads are about to get dramatically more expensive as reasoning models become standard. Every technique below directly reduces that bill.

1. Speculative Decoding (2-3x latency reduction)

A small, fast "draft" model generates several candidate tokens. The big model verifies them all in a single forward pass. You get multiple tokens per step instead of one, cutting latency 2-3x with zero accuracy loss. This works because most next-token predictions in natural language are easy. The draft model handles the high-probability continuations that any decent language model would agree on, and the big model only steps in to verify or correct. Think of it like a junior developer writing the boilerplate while the senior engineer reviews: the senior's time is expensive, so you only use it where judgment actually matters. QuantSpec extends this with 4-bit quantized KV cache for edge devices, hitting roughly 2.5x speedup on constrained hardware.

2. KV-Cache Optimization (50% memory savings)

The KV cache stores attention states and grows linearly with sequence length. It is the single biggest memory hog in long-context inference. For long-context models handling 100K+ tokens, the KV cache alone can consume more memory than the model weights themselves, which is why this optimization matters so much for the reasoning-heavy workloads that are becoming standard. NVFP4 quantization cuts KV cache memory by about 50% compared to FP8. Cache offloading to SSD or CPU memory frees GPU VRAM for actual computation. Spheron estimates these techniques let you serve 10x more users on the same GPU.

3. PagedAttention (the de facto standard)

Borrowed from operating systems. PagedAttention divides the KV cache into fixed-size blocks that don't need contiguous GPU memory. Just as an OS doesn't need contiguous physical RAM for a virtual memory page, PagedAttention doesn't need contiguous GPU memory for attention states. A block table maps logical to physical locations, eliminating memory fragmentation. It now underpins vLLM, TensorRT-LLM, and SGLang. If you're serving LLMs and not using PagedAttention, you're leaving performance on the table.

4. Continuous Batching (maximize GPU utilization)

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching dynamically slots new requests into the batch as soon as prior ones finish decoding. GPUs stay maximally utilized. The tradeoff is per-user latency vs. system throughput, and you'll need to tune that per application.

5. Disaggregated Serving (split pre-fill and decode)

Inference has two phases with opposite hardware needs. Pre-fill is compute-bound (processes all input tokens in parallel). Decode is memory-bound (generates tokens one at a time). Running them on the same hardware means neither phase is optimized. This approach is emerging now because models have gotten large enough that the phase split is no longer optional. A 405B parameter model's prefill phase can fully saturate one hardware profile while the decode phase starves it, wasting expensive accelerators either way. NVIDIA Dynamo 1.0 supports splitting them across separate clusters. SGLang's roadmap prioritizes this for multimodal models with speculative decoding across nodes.

The Infrastructure Split

When inference was a rounding error, nobody cared where it ran. Now it's 66% of compute, and infrastructure decisions matter.

Training infrastructure optimizes for raw FLOPS. You rent a massive GPU cluster for weeks, burn through your CapEx budget, and you're done. These "bit barns" sit in rural locations with cheap power and cold air. Nobody cares about latency when the user is a training script.

Inference infrastructure is the opposite. It needs to be close to users, ideally sub-10ms away. It optimizes for memory bandwidth, not FLOPS. It runs 24/7, which means it's OpEx, not CapEx. And it has to scale out horizontally rather than scale up vertically.

NVIDIA's answer is Dynamo 1.0, an inference orchestration layer launched at GTC. DigitalOcean reported 43,000+ OpenClaw agent deployments already running on it. On the hardware side, the speed race is heating up. Cerebras clocks 2,522 tokens per second on Llama 4 Maverick versus NVIDIA Blackwell's 1,038. But Dean and Dally set the real target: 10,000 to 20,000 tokens per second per user for autonomous agents. That's roughly 4x faster than Cerebras and 10x faster than Blackwell. For agentic workloads where the model needs to reason, plan, execute, and iterate in real time, anything slower than that target means the user is waiting. And users don't wait. This is why the Deloitte framework points toward a three-tier hybrid architecture: cloud for burst capacity and experimentation, on-premises "AI factories" for steady-state high-volume inference, and edge deployment for narrow latency-critical tasks like real-time voice or robotics. No single tier handles all workloads well, and organizations that bet entirely on one will either overpay or underperform.

Grayson's framing captures where this is heading: the model is a commodity. The infrastructure that serves it at scale, fast enough and cheap enough, is where the actual business value sits.

Counter-Arguments Worth Taking Seriously

The inference hype has limits.

Test-time compute isn't magic. Research shows it doesn't consistently improve accuracy on knowledge-intensive tasks and can actually increase hallucinations through confirmation bias. Extended reasoning helps with math and logic. It actively hurts factual recall.

Diminishing returns are real. Practical test-time scaling shows plateauing effects after a saturation point, bounded by memory bandwidth. "Just think longer" hits a wall.

Training still matters. Noam Brown himself warned that reasoning only works on sufficiently capable base models:

"If you ask a pigeon to think really hard about playing chess, it's not going to get that far."

Cost reductions may stall. The 280-fold drop in two years is unlikely to continue at that pace. Deloitte warns that compute demand is growing 4-5x per year through 2030, outpacing efficiency gains. Enterprises are already seeing monthly AI bills in the tens of millions, with cost estimation errors of 500-1,000% according to Gartner. That number is worth sitting with. It means teams that budgeted $200,000 per month for inference are getting bills for $1 million or more. The gap between projected and actual inference costs is the single most common financial surprise in enterprise AI deployments right now.

Edge inference is overhyped. Deloitte pushes back explicitly: the majority of computation will still happen on expensive, power-hungry chips in large data centers. On-device AI handles simple tasks. Reasoning and agentic workloads require data center scale.

What to Do About It

Audit your inference costs now. If you're running models in production, your inference bill will dwarf training within 12 months. Gartner found that enterprise teams routinely underestimate inference costs by 500-1,000%. Know your per-token economics before they surprise you.

Start with the easy wins. PagedAttention plus continuous batching via vLLM gives you the best return for the least effort. Most teams see 3-5x throughput improvement on day one. If you're still serving models with static batching and naive memory allocation, you're paying 3-5x more than you need to.

Consider inference-first model selection. When choosing between a 70B and a 405B parameter model, the 70B might deliver 90% of the accuracy at roughly one-sixth the serving cost. In the training era, teams picked the biggest model they could afford to train. In the inference era, model selection is a unit economics decision. The right model is the smallest one that meets your quality bar, because every parameter you add multiplies your serving bill for the lifetime of the deployment.

Profile before you specialize. Run your workload through a profiler and measure whether you're pre-fill bound or decode bound before investing in disaggregated serving or speculative decoding. The wrong optimization on the wrong bottleneck wastes money. Pre-fill bound means your input prompts are long and complex. Decode bound means your outputs are long. The fix is different for each.

Build for the three-tier hybrid. Deloitte's framework maps the emerging default: cloud for burst experimentation, on-premises "AI factories" for predictable high-volume inference, edge for latency-critical narrow tasks. Engineers who can architect across all three tiers will be in high demand.

Watch the tokens-per-second race. Cerebras at 2,522. Blackwell at 1,038. The Dean/Dally target of 10,000-20,000 for agents. Your architecture choices today determine whether you can hit that number in 18 months. Whichever vendor closes that gap first reshapes the market.

The training era built the models. The inference era decides who actually gets to use them. The companies that figure out how to serve intelligence cheaply, reliably, and fast will own the next decade of AI. The rest will pay the toll.