In partnership with

Healthcare news for decision-makers

Knowing the healthcare headlines is easy.

Understanding what they mean for the business? That’s the hard part.

Healthcare Brew is a free newsletter breaking down the forces shaping the healthcare industry—from pharmaceutical developments and health startups to policy shifts, regulation, and tech changing how hospitals and providers operate.

No clinical deep dives. No overstuffed jargon. No guessing what actually matters. Just clear, focused coverage built for the people making decisions behind the scenes.

One Chip. 900,000 Cores. How Cerebras Did It.

ResearchAudio.io  ·  AI Hardware Deep Dive

One Chip. 900,000 Cores.
How Cerebras Did It.

The physics reason GPUs hit a wall, and why a single silicon wafer changes token speed by up to 70x.

Every time a GPU generates a token, it performs a ritual that most engineers never think about: it reaches outside itself, into slow external memory, fetches billions of model weights across a narrow data bus, and then does the actual math. That round trip is not a quirk. It is the fundamental bottleneck of modern AI inference. And it is the exact problem Cerebras was built to eliminate.

The Cerebras Wafer Scale Engine 3 (WSE-3) runs inference at speeds 10 to 70 times faster than Nvidia H100-based systems for many large language model workloads. That claim sounds like a marketing number until you understand the architecture. Once you understand it, the number makes complete sense.

4T
Transistors
900K
AI Cores
21 PB/s
Memory BW
70x
Faster Inference

The Problem: Why GPUs Were Never Designed for This

GPUs were designed for graphics. They are outstanding at parallel matrix math. But they were not designed for a workload where you need to load hundreds of billions of parameters from memory for every single token generated. That is what LLM inference requires, and it exposes a structural weakness: the gap between how fast a chip can compute and how fast it can move data from off-chip memory. Computer architects call this the memory wall.

An Nvidia H100 has 80 gigabytes of HBM (high bandwidth memory) sitting next to the chip, connected by a relatively thin bus. Before the GPU can compute a token, it must move the relevant weights across that bus. The compute units sit idle during that transfer. This is the bottleneck that no amount of raw FLOPs can fix without rethinking the architecture.

Architecture Comparison

NVIDIA H100 (GPU cluster approach)
Chip #1 (80GB HBM)
80B transistors
Chip #2 (80GB HBM)
80B transistors
⇅ High-speed interconnect (NVLink)
Chip #3
Chip #4
⚠ Each token generation: weights must cross chip-to-chip interconnects. Each hop adds latency and synchronization overhead.
vs
Cerebras WSE-3 (wafer approach)
Core grid 1
48KB SRAM
Core grid 2
48KB SRAM
Core grid 3
48KB SRAM
Core grid 4
48KB SRAM
··· 900K cores ···
all on one wafer
Core grid 6
48KB SRAM
✓ All weights stay on-chip. No off-chip memory fetches during inference. Cores communicate in ~1ns over on-wafer mesh.

Source: Cerebras Systems / arxiv.org/abs/2503.11698

The Insight: Stop Cutting the Wafer

When a chip foundry like TSMC fabricates a 300mm silicon wafer, the standard process is to cut it into dozens of small dies (chips). Those dies are then packaged individually and later connected together in a server. Every connection between chips is a potential bottleneck. Cerebras asked a simple question: what if you just did not cut it?

The WSE-3 uses the entire 300mm wafer as one chip. It measures 46,225 mm², compared to 814 mm² for an H100. That is roughly 57 times the area. On that wafer, Cerebras placed 900,000 compute cores, each with its own 48 KB of local SRAM memory sitting directly beside it. The cores connect to each other through an on-wafer 2D mesh fabric that operates at roughly 1 nanosecond per hop.

Key Insight: The total on-chip SRAM in the WSE-3 is 44 GB, with a memory bandwidth of 21 petabytes per second. That bandwidth figure is not a typo. It is roughly 1,000 times greater than the HBM bandwidth available in a single H100. The result: compute cores almost never wait for data.

The Hard Part: Defects on a Chip the Size of a Dinner Plate

Chip manufacturing has defects. In normal chip production, a defective die is simply discarded. If your "chip" is the entire wafer, you cannot discard it. Every production run has a defect somewhere. This is why the industry said wafer-scale chips were impractical for decades.

Cerebras solved this with a two-part approach. First, each compute core on the WSE-3 is approximately 0.05 mm², compared to approximately 6 mm² for a core in an H100. When a defect disables a WSE core, it affects 0.05 mm² of silicon. The same defect in an H100 disables around 6 mm². That is a 100x difference in the silicon area lost per defect.

Second, Cerebras built a fault-tolerant routing fabric that dynamically reconfigures connections when a core is disabled. The system detects a dead core and routes around it automatically. The end result: the WSE-3 ships with 900,000 of its 970,000 physical cores active, achieving 93% silicon utilization, which Cerebras reports as higher than leading GPUs.

H100-style GPU Cerebras WSE-3
Core size ~6.2 mm² ~0.05 mm²
Silicon lost per defect ~6.2 mm² ~0.05 mm²
Silicon utilization < 93% 93%

Why Inference Is 70x Faster: No Off-Chip Trips

During LLM inference on a GPU, the bottleneck is almost never the matrix multiplication itself. It is the memory fetch that precedes it. Each new token requires reading a large portion of the model's weight matrices from HBM. Even with HBM3e, that bandwidth limits how quickly tokens can be generated.

On the WSE-3, model weights that fit on-chip never leave the chip. They live in the distributed SRAM spread across the 900,000 cores. When a token needs a weight, the compute core reads it from local memory at a latency of roughly one clock cycle. There is no HBM fetch, no interconnect hop to another chip, no synchronization across a PCIe bus. The 44 GB of on-chip SRAM holds the weights for models up to roughly that parameter count at full precision.

For larger models, Cerebras uses a system called MemoryX, external DRAM that acts as a weight storage layer separate from the wafer. The wafer streams weights from MemoryX for forward and backward passes. For models that span multiple wafers, Cerebras uses pipeline parallelism: each wafer handles a subset of layers, and the generation flows sequentially through them. Because every wafer stays fully occupied at all times, token generation speed remains constant regardless of how many wafers the model spans.

Key Insight: The WSE-3 delivers 21 petabytes per second of memory bandwidth from its on-chip SRAM. A single H100 delivers roughly 3.35 terabytes per second from HBM. That is a bandwidth difference of approximately 6,000x, concentrated in the exact part of the pipeline that limits token generation speed.

The Market Signal: OpenAI Signs a $10 Billion Deal

In January 2026, Cerebras signed a contract with OpenAI to deliver 750 megawatts of computing capacity through 2028, a deal reported at over $10 billion. This is not a research partnership. It is a production commitment from one of the most compute-intensive organizations in the world. For a company that the semiconductor industry once dismissed as impractical, it is a notable inflection point.

The WSE-3 was also recognized by TIME Magazine as a Best Invention of 2024. Cerebras launched its inference cloud service in August 2024 and has since expanded to six datacenters across the United States and Europe.

Key Insight: Cerebras reported that the CS-3 requires 97% less code than GPU-based systems to run large language models, because there is no distributed computing complexity to manage. For teams building on the API, that simplification has real engineering value beyond the raw speed gains.

What Cerebras Still Has to Solve

The WSE approach is not without trade-offs. The 44 GB of on-chip SRAM, while extremely fast, is small compared to the multi-terabyte HBM configurations available in large GPU clusters. Running a 70 billion parameter model at full precision requires roughly 140 GB, so the WSE-3 must rely on MemoryX or pipeline parallelism for those workloads, reintroducing the bandwidth constraint that the design was built to eliminate.

Manufacturing cost also remains a challenge. A single wafer that fails late in the production process represents a much larger financial loss than a defective GPU die. The economics of wafer-scale silicon improve as yield improves, but they require a different risk model than conventional chip production.

The Bigger Question

The GPU was designed for rendering triangles. It became the foundation of modern AI by accident of architecture, because its SIMD parallelism happened to fit matrix multiplication well. Cerebras built a chip designed from first principles for what AI inference actually needs: massive memory bandwidth, minimal data movement, and cores that never wait. The WSE-3 is an early answer to a question that the industry is now taking seriously. Whether the wafer-scale approach wins long-term depends on whether 44 GB of fast memory can keep pace with models that keep growing. That is the tension worth watching.

ResearchAudio.io  ·  Technical AI research, explained clearly.

Sources: Cerebras WSE-3 Product Page  ·  arxiv.org/abs/2503.11698  ·  Wikipedia: Cerebras

Keep Reading