|
Deep Dive · AI Infrastructure
Why Nvidia Just Paid $20B for a Chip That Can't Train Models
The Groq LPU, the memory wall problem, and why 2025 is the year inference ate training.
|
|
On December 24th, Nvidia quietly made its largest deal ever—a $20 billion "licensing agreement" with Groq that looks suspiciously like an acquisition. 90% of Groq's staff is moving to Nvidia, including founder Jonathan Ross (the same engineer who created Google's TPU).
Here's the strange part: Groq's chips can't train AI models. They're purpose-built for one thing only—inference. So why would the company that dominates AI training pay $20 billion for technology that doesn't compete with their core business?
The answer reveals a fundamental shift in where AI value is created—and why the economics of running models now matter more than building them.
|
First, the Basics
Training vs. Inference: Two Different Games
Training is like going to medical school—an intensive, expensive, one-time investment where the model learns patterns from massive datasets. GPT-4's training reportedly cost $100M+. DeepSeek V3 did it for $5.6M. Either way, you pay once.
Inference is the doctor actually seeing patients—every single day, millions of times. Every ChatGPT query, every Claude response, every Gemini answer is an inference call. And every call costs money.
|
|
|
THE MATH THAT CHANGES EVERYTHING
|
$100M
GPT-4 training cost (one-time)
|
$2.3B
GPT-4 projected inference cost (per year, recurring)
|
Inference accounts for 80-90% of total AI compute costs over a model's lifetime. Training is a CapEx hit; inference is OpEx that never stops.
|
|
The Memory Wall Problem
Here's why GPUs—despite being incredible for training—aren't ideal for inference:
During training, you're processing massive batches of data in parallel. GPUs excel at this—thousands of cores churning through terabytes simultaneously. The overhead of fetching data from external memory (HBM) is hidden by the sheer volume of parallel operations.
But inference is different. You're often serving a single user asking a single question. The GPU's cores sit idle, waiting for data to travel from external memory. This is the memory wall—and it's why your ChatGPT responses sometimes feel sluggish despite running on the world's most powerful chips.
|
|
GPU ARCHITECTURE (NVIDIA H100)
┌─────────────────────────────────────────────────┐
│ COMPUTE DIE │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Core │ │Core │ │Core │ │Core │ │ ... │ │
│ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │18000│ │
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│ └──────┴───────┴───────┴───────┘ │
│ │ │
│ ◄─────►│◄─────► ~8 TB/s │
│ waiting │ fetching │
└─────────────────────┼───────────────────────────┘
│
▼ ← Data travels OFF-CHIP
┌─────────────────────────────────────────────────┐
│ EXTERNAL HBM MEMORY (80GB+) │
│ Latency: ~100s of nanoseconds per access │
└─────────────────────────────────────────────────┘
GPU cores often run at 30-40% utilization during inference—waiting on memory
|
|
Groq's Radical Bet: What If We Eliminated External Memory?
Groq's LPU (Language Processing Unit) takes a completely different approach. Instead of using external HBM memory like GPUs, Groq uses on-chip SRAM as primary weight storage.
|
|
LPU ARCHITECTURE (GROQ)
┌─────────────────────────────────────────────────┐
│ SINGLE CHIP │
│ ┌───────────────────────────────────────────┐ │
│ │ ON-CHIP SRAM (230MB) │ │
│ │ Bandwidth: ~80 TB/s (10x faster) │ │
│ │ Latency: ~nanoseconds (100x faster) │ │
│ └─────────────────────┬─────────────────────┘ │
│ │ │
│ instant │
│ access │
│ ▼ │
│ ┌───────────────────────────────────────────┐ │
│ │ SINGLE COMPUTE CORE │ │
│ │ Deterministic, statically scheduled │ │
│ │ ~100% utilization │ │
│ └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
▲
│ Connected to 100s of other LPUs
▼ to form one distributed system
No external memory fetch. No wasted cycles. ~100% compute utilization.
|
|
The Technical Tradeoffs
|
Nvidia GPU |
Groq LPU |
| Memory Type |
External HBM (80GB+) |
On-chip SRAM (230MB) |
| Memory Bandwidth |
~8 TB/s |
~80 TB/s (10x) |
| Inference Utilization |
30-40% |
~100% |
| Scheduling |
Dynamic (hardware decides) |
Static (compiler decides) |
| Energy per Bit |
~6 picojoules |
~0.3 picojoules (20x) |
| Training Capable |
Yes |
No |
| Chips for Llama 70B |
2-4 GPUs |
Hundreds of LPUs |
|
The Catch
230MB of SRAM can't hold a large model. A single H100 has 80GB of memory; a single Groq chip has 0.23GB. To run Llama 70B, Groq needs hundreds of chips networked together across multiple racks. The "chip" is effectively the entire server room.
This means higher upfront hardware costs and larger physical footprint. But the cost per token generated can be significantly lower because of near-perfect utilization and energy efficiency.
|
|
Why Determinism Matters
Here's the underrated part of Groq's architecture: everything is deterministic.
GPUs use dynamic scheduling—the hardware decides at runtime when to execute operations, when to fetch memory, how to handle thread conflicts. This introduces unpredictable latency. Sometimes your query is fast; sometimes it isn't.
Groq's compiler statically schedules every single operation before the model runs. Every memory load, every computation, every data transfer is predetermined down to the clock cycle. The result: predictable, consistent performance with no jitter.
|
Why This Matters for Agentic AI
Agents that can perform thousands of reasoning steps in seconds need deterministic, low-latency inference. If each "thought" has variable latency, compound delays make complex reasoning impractical. Groq's architecture enables the kind of rapid, iterative inference that agentic systems require.
|
|
Why Nvidia Made This Deal
Nvidia faces several constraints that Groq's technology helps address:
|
|
01 // HBM SUPPLY
HBM for 2026 is already sold out. SK Hynix, Samsung, and Micron have limited capacity. Groq's chips don't use HBM at all—they sidestep the bottleneck entirely.
|
|
|
02 // ENERGY
Data centers are power-constrained. Groq is ~10x more energy efficient per token for inference, and their chips are air-cooled (no complex liquid cooling infrastructure needed).
|
|
|
03 // OLDER NODES
Groq's current chips run on 14nm process (vs. Nvidia's cutting-edge 4nm). Because they don't need external memory, they perform well on older, cheaper fabs—diversifying manufacturing options.
|
|
|
04 // COMPETITIVE MOAT
Jonathan Ross created Google's TPU. By bringing him to Nvidia, they acquire the mind behind two of their biggest competitive threats—and prevent anyone else from hiring him.
|
|
|
The Bigger Picture: 2025 Is the Inference Tipping Point
This deal reflects a broader industry shift. The global inference market is projected to hit $255 billion by 2030, growing at nearly 20% annually. Training infrastructure buildout is plateauing as open-source models (Llama, DeepSeek, Qwen) commoditize the training side.
The value is moving from building AI models to running them at scale. And running them requires different hardware than building them.
Nvidia buying Groq isn't a defensive move—it's a bet that the future of AI compute is heterogeneous: GPUs for training and batch inference, specialized chips like LPUs for real-time, latency-sensitive workloads.
|
What This Means for Practitioners
If you're fine-tuning models: Inference cost is your real budget concern. A fine-tune might cost $10 in GPU time; running it in production could cost thousands per month. Optimize for inference efficiency, not just training metrics.
If you're building agents: Latency matters more than throughput. Architectures like Groq's enable the rapid iteration loops that sophisticated agents require.
If you're choosing infrastructure: The one-size-fits-all GPU approach is giving way to specialized chips for specialized tasks. Expect more heterogeneous compute environments in production AI systems.
|
|
|
The Nvidia-Groq deal tells us where the industry is heading: specialized silicon for specialized workloads. GPUs aren't going anywhere—they're still the best tool for training and many inference use cases. But for real-time, latency-critical applications, architectures built from the ground up for inference will increasingly find their place.
Nvidia just paid $20 billion to own both ends of that spectrum.
|
|
|
|
Sources & Further Reading
Groq Architecture Documentation • SemiAnalysis: Groq Inference Tokenomics • CNBC: Nvidia Groq Deal Coverage • Stanford AI Index Report 2025 • Groq ISCA Papers (2020, 2022) • McKinsey: State of AI 2025
|
|
Deep · AI Research Newsletter
|