Healthcare news for decision-makers
Knowing the healthcare headlines is easy.
Understanding what they mean for the business? That’s the hard part.
Healthcare Brew is a free newsletter breaking down the forces shaping the healthcare industry—from pharmaceutical developments and health startups to policy shifts, regulation, and tech changing how hospitals and providers operate.
No clinical deep dives. No overstuffed jargon. No guessing what actually matters. Just clear, focused coverage built for the people making decisions behind the scenes.
Join 135K+ administrators and healthcare professionals staying informed, for free.

The Hard Part: Defects on a Chip the Size of a Dinner Plate
Chip manufacturing has defects. In normal chip production, a defective die is simply discarded. If your "chip" is the entire wafer, you cannot discard it. Every production run has a defect somewhere. This is why the industry said wafer-scale chips were impractical for decades.
Cerebras solved this with a two-part approach. First, each compute core on the WSE-3 is approximately 0.05 mm², compared to approximately 6 mm² for a core in an H100. When a defect disables a WSE core, it affects 0.05 mm² of silicon. The same defect in an H100 disables around 6 mm². That is a 100x difference in the silicon area lost per defect.
Second, Cerebras built a fault-tolerant routing fabric that dynamically reconfigures connections when a core is disabled. The system detects a dead core and routes around it automatically. The end result: the WSE-3 ships with 900,000 of its 970,000 physical cores active, achieving 93% silicon utilization, which Cerebras reports as higher than leading GPUs.
Why Inference Is 70x Faster: No Off-Chip Trips
During LLM inference on a GPU, the bottleneck is almost never the matrix multiplication itself. It is the memory fetch that precedes it. Each new token requires reading a large portion of the model's weight matrices from HBM. Even with HBM3e, that bandwidth limits how quickly tokens can be generated.
On the WSE-3, model weights that fit on-chip never leave the chip. They live in the distributed SRAM spread across the 900,000 cores. When a token needs a weight, the compute core reads it from local memory at a latency of roughly one clock cycle. There is no HBM fetch, no interconnect hop to another chip, no synchronization across a PCIe bus. The 44 GB of on-chip SRAM holds the weights for models up to roughly that parameter count at full precision.
For larger models, Cerebras uses a system called MemoryX, external DRAM that acts as a weight storage layer separate from the wafer. The wafer streams weights from MemoryX for forward and backward passes. For models that span multiple wafers, Cerebras uses pipeline parallelism: each wafer handles a subset of layers, and the generation flows sequentially through them. Because every wafer stays fully occupied at all times, token generation speed remains constant regardless of how many wafers the model spans.
|
Key Insight: The WSE-3 delivers 21 petabytes per second of memory bandwidth from its on-chip SRAM. A single H100 delivers roughly 3.35 terabytes per second from HBM. That is a bandwidth difference of approximately 6,000x, concentrated in the exact part of the pipeline that limits token generation speed. |
The Market Signal: OpenAI Signs a $10 Billion Deal
In January 2026, Cerebras signed a contract with OpenAI to deliver 750 megawatts of computing capacity through 2028, a deal reported at over $10 billion. This is not a research partnership. It is a production commitment from one of the most compute-intensive organizations in the world. For a company that the semiconductor industry once dismissed as impractical, it is a notable inflection point.
The WSE-3 was also recognized by TIME Magazine as a Best Invention of 2024. Cerebras launched its inference cloud service in August 2024 and has since expanded to six datacenters across the United States and Europe.
|
Key Insight: Cerebras reported that the CS-3 requires 97% less code than GPU-based systems to run large language models, because there is no distributed computing complexity to manage. For teams building on the API, that simplification has real engineering value beyond the raw speed gains. |
What Cerebras Still Has to Solve
The WSE approach is not without trade-offs. The 44 GB of on-chip SRAM, while extremely fast, is small compared to the multi-terabyte HBM configurations available in large GPU clusters. Running a 70 billion parameter model at full precision requires roughly 140 GB, so the WSE-3 must rely on MemoryX or pipeline parallelism for those workloads, reintroducing the bandwidth constraint that the design was built to eliminate.
Manufacturing cost also remains a challenge. A single wafer that fails late in the production process represents a much larger financial loss than a defective GPU die. The economics of wafer-scale silicon improve as yield improves, but they require a different risk model than conventional chip production.
The Bigger Question
The GPU was designed for rendering triangles. It became the foundation of modern AI by accident of architecture, because its SIMD parallelism happened to fit matrix multiplication well. Cerebras built a chip designed from first principles for what AI inference actually needs: massive memory bandwidth, minimal data movement, and cores that never wait. The WSE-3 is an early answer to a question that the industry is now taking seriously. Whether the wafer-scale approach wins long-term depends on whether 44 GB of fast memory can keep pace with models that keep growing. That is the tension worth watching.
|
ResearchAudio.io · Technical AI research, explained clearly. Sources: Cerebras WSE-3 Product Page · arxiv.org/abs/2503.11698 · Wikipedia: Cerebras |


