In partnership with

Your competitors are already automating. Here's the data.

Retail and ecommerce teams using AI for customer service are resolving 40-60% more tickets without more staff, cutting cost-per-ticket by 30%+, and handling seasonal spikes 3x faster.

But here's what separates winners from everyone else: they started with the data, not the hype.

Gladly handles the predictable volume, FAQs, routing, returns, order status, while your team focuses on customers who need a human touch. The result? Better experiences. Lower costs. Real competitive advantage. Ready to see what's possible for your business?

See it in action

Why Nvidia Just Paid $20B for a Chip That Can't Train Models

Deep Dive · AI Infrastructure

Why Nvidia Just Paid $20B for a Chip That Can't Train Models

The Groq LPU, the memory wall problem, and why 2025 is the year inference ate training.

On December 24th, Nvidia quietly made its largest deal ever—a $20 billion "licensing agreement" with Groq that looks suspiciously like an acquisition. 90% of Groq's staff is moving to Nvidia, including founder Jonathan Ross (the same engineer who created Google's TPU).

Here's the strange part: Groq's chips can't train AI models. They're purpose-built for one thing only—inference. So why would the company that dominates AI training pay $20 billion for technology that doesn't compete with their core business?

The answer reveals a fundamental shift in where AI value is created—and why the economics of running models now matter more than building them.

First, the Basics

Training vs. Inference: Two Different Games

Training is like going to medical school—an intensive, expensive, one-time investment where the model learns patterns from massive datasets. GPT-4's training reportedly cost $100M+. DeepSeek V3 did it for $5.6M. Either way, you pay once.

Inference is the doctor actually seeing patients—every single day, millions of times. Every ChatGPT query, every Claude response, every Gemini answer is an inference call. And every call costs money.

THE MATH THAT CHANGES EVERYTHING

$100M

GPT-4 training cost
(one-time)

$2.3B

GPT-4 projected inference cost
(per year, recurring)

Inference accounts for 80-90% of total AI compute costs over a model's lifetime. Training is a CapEx hit; inference is OpEx that never stops.

The Memory Wall Problem

Here's why GPUs—despite being incredible for training—aren't ideal for inference:

During training, you're processing massive batches of data in parallel. GPUs excel at this—thousands of cores churning through terabytes simultaneously. The overhead of fetching data from external memory (HBM) is hidden by the sheer volume of parallel operations.

But inference is different. You're often serving a single user asking a single question. The GPU's cores sit idle, waiting for data to travel from external memory. This is the memory wall—and it's why your ChatGPT responses sometimes feel sluggish despite running on the world's most powerful chips.

GPU ARCHITECTURE (NVIDIA H100)

┌─────────────────────────────────────────────────┐
│              COMPUTE DIE                        │
│   ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐      │
│   │Core │ │Core │ │Core │ │Core │ │ ... │      │
│   │  1  │ │  2  │ │  3  │ │  4  │ │18000│      │
│   └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘      │
│      └──────┴───────┴───────┴───────┘          │
│                     │                           │
│              ◄─────►│◄─────►   ~8 TB/s         │
│              waiting │ fetching                 │
└─────────────────────┼───────────────────────────┘
                      │
                      ▼  ← Data travels OFF-CHIP
┌─────────────────────────────────────────────────┐
│          EXTERNAL HBM MEMORY (80GB+)            │
│     Latency: ~100s of nanoseconds per access    │
└─────────────────────────────────────────────────┘

GPU cores often run at 30-40% utilization during inference—waiting on memory

Groq's Radical Bet: What If We Eliminated External Memory?

Groq's LPU (Language Processing Unit) takes a completely different approach. Instead of using external HBM memory like GPUs, Groq uses on-chip SRAM as primary weight storage.

LPU ARCHITECTURE (GROQ)

┌─────────────────────────────────────────────────┐
│              SINGLE CHIP                        │
│  ┌───────────────────────────────────────────┐  │
│  │         ON-CHIP SRAM (230MB)              │  │
│  │    Bandwidth: ~80 TB/s (10x faster)       │  │
│  │    Latency: ~nanoseconds (100x faster)    │  │
│  └─────────────────────┬─────────────────────┘  │
│                        │                        │
│                   instant                       │
│                   access                        │
│                        ▼                        │
│  ┌───────────────────────────────────────────┐  │
│  │           SINGLE COMPUTE CORE             │  │
│  │      Deterministic, statically scheduled  │  │
│  │         ~100% utilization                 │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘
              ▲
              │  Connected to 100s of other LPUs
              ▼  to form one distributed system

No external memory fetch. No wasted cycles. ~100% compute utilization.

The Technical Tradeoffs

	Nvidia GPU	Groq LPU
Memory Type	External HBM (80GB+)	On-chip SRAM (230MB)
Memory Bandwidth	~8 TB/s	~80 TB/s (10x)
Inference Utilization	30-40%	~100%
Scheduling	Dynamic (hardware decides)	Static (compiler decides)
Energy per Bit	~6 picojoules	~0.3 picojoules (20x)
Training Capable	Yes	No
Chips for Llama 70B	2-4 GPUs	Hundreds of LPUs

The Catch

230MB of SRAM can't hold a large model. A single H100 has 80GB of memory; a single Groq chip has 0.23GB. To run Llama 70B, Groq needs hundreds of chips networked together across multiple racks. The "chip" is effectively the entire server room.

This means higher upfront hardware costs and larger physical footprint. But the cost per token generated can be significantly lower because of near-perfect utilization and energy efficiency.

Why Determinism Matters

Here's the underrated part of Groq's architecture: everything is deterministic.

GPUs use dynamic scheduling—the hardware decides at runtime when to execute operations, when to fetch memory, how to handle thread conflicts. This introduces unpredictable latency. Sometimes your query is fast; sometimes it isn't.

Groq's compiler statically schedules every single operation before the model runs. Every memory load, every computation, every data transfer is predetermined down to the clock cycle. The result: predictable, consistent performance with no jitter.

Why This Matters for Agentic AI

Agents that can perform thousands of reasoning steps in seconds need deterministic, low-latency inference. If each "thought" has variable latency, compound delays make complex reasoning impractical. Groq's architecture enables the kind of rapid, iterative inference that agentic systems require.

Why Nvidia Made This Deal

Nvidia faces several constraints that Groq's technology helps address:

01 // HBM SUPPLY

HBM for 2026 is already sold out. SK Hynix, Samsung, and Micron have limited capacity. Groq's chips don't use HBM at all—they sidestep the bottleneck entirely.

02 // ENERGY

Data centers are power-constrained. Groq is ~10x more energy efficient per token for inference, and their chips are air-cooled (no complex liquid cooling infrastructure needed).

03 // OLDER NODES

Groq's current chips run on 14nm process (vs. Nvidia's cutting-edge 4nm). Because they don't need external memory, they perform well on older, cheaper fabs—diversifying manufacturing options.

04 // COMPETITIVE MOAT

Jonathan Ross created Google's TPU. By bringing him to Nvidia, they acquire the mind behind two of their biggest competitive threats—and prevent anyone else from hiring him.

The Bigger Picture: 2025 Is the Inference Tipping Point

This deal reflects a broader industry shift. The global inference market is projected to hit $255 billion by 2030, growing at nearly 20% annually. Training infrastructure buildout is plateauing as open-source models (Llama, DeepSeek, Qwen) commoditize the training side.

The value is moving from building AI models to running them at scale. And running them requires different hardware than building them.

Nvidia buying Groq isn't a defensive move—it's a bet that the future of AI compute is heterogeneous: GPUs for training and batch inference, specialized chips like LPUs for real-time, latency-sensitive workloads.

What This Means for Practitioners

If you're fine-tuning models: Inference cost is your real budget concern. A fine-tune might cost $10 in GPU time; running it in production could cost thousands per month. Optimize for inference efficiency, not just training metrics.

If you're building agents: Latency matters more than throughput. Architectures like Groq's enable the rapid iteration loops that sophisticated agents require.

If you're choosing infrastructure: The one-size-fits-all GPU approach is giving way to specialized chips for specialized tasks. Expect more heterogeneous compute environments in production AI systems.

The Nvidia-Groq deal tells us where the industry is heading: specialized silicon for specialized workloads. GPUs aren't going anywhere—they're still the best tool for training and many inference use cases. But for real-time, latency-critical applications, architectures built from the ground up for inference will increasingly find their place.

Nvidia just paid $20 billion to own both ends of that spectrum.

Sources & Further Reading

Groq Architecture Documentation • SemiAnalysis: Groq Inference Tokenomics • CNBC: Nvidia Groq Deal Coverage • Stanford AI Index Report 2025 • Groq ISCA Papers (2020, 2022) • McKinsey: State of AI 2025

Deep · AI Research Newsletter

Why Nvidia Just Paid $20B for a Chip That Can't Train Models

Your competitors are already automating. Here's the data.

Why Nvidia Just Paid $20B for a Chip That Can't Train Models

First, the Basics

Training vs. Inference: Two Different Games

The Memory Wall Problem

Groq's Radical Bet: What If We Eliminated External Memory?

The Technical Tradeoffs

The Catch

Why Determinism Matters

Why This Matters for Agentic AI

Why Nvidia Made This Deal

The Bigger Picture: 2025 Is the Inference Tipping Point

What This Means for Practitioners

Keep Reading

Quick Links

Stay Updated