In partnership with

How much could AI save your support team?

Peak season is here. Most retail and ecommerce teams face the same problem: volume spikes, but headcount doesn't.

Instead of hiring temporary staff or burning out your team, there’s a smarter move. Let AI handle the predictable stuff, like answering FAQs, routing tickets, and processing returns, so your people focus on what they do best: building loyalty.

Gladly’s ROI calculator shows exactly what this looks like for your business: how many tickets AI could resolve, how much that costs, and what that means for your bottom line. Real numbers. Your data.

Rack-Scale AI Explained: How NVIDIA Rubin Treats 72 GPUs as One Computer

ResearchAudio Weekly

Rack-Scale AI Explained: How NVIDIA Rubin Treats 72 GPUs as One Computer

A visual breakdown into the six-chip architecture that delivers 10x lower inference costs by eliminating data movement bottlenecks

At CES 2026, NVIDIA did something unusual. Instead of announcing a faster GPU and calling it a day, they unveiled six entirely new chips—all designed together from scratch to work as a single system.

The platform is named Rubin, after Vera Florence Cooper Rubin—the astronomer whose observations proved the existence of dark matter. An apt name for technology that reveals the hidden bottlenecks limiting AI infrastructure.

The headline numbers are striking: 10x lower cost per token for inference, 4x fewer GPUs needed for training large models. But the real story isn't about faster chips. It's about what happens when you stop optimizing components in isolation and start designing them as one machine.

💡

Core Insight

Modern AI isn't bottlenecked by compute—it's bottlenecked by data movement. Moving a byte across a chip costs 200x less energy than fetching it from memory. When GPUs wait for data, they produce zero useful work. Rubin's architecture treats the rack as the unit of compute, minimizing data movement at every layer.

The Problem NVIDIA Set Out to Solve

Here's a fact that surprises most people: in traditional multi-GPU systems, processors spend 30-50% of their time idle, waiting for data to arrive. You're paying for compute that isn't computing.

The culprit is architectural. Every time data crosses a boundary—GPU to CPU, CPU to network, one server to another—it hits a latency wall and bandwidth cliff. These boundaries were designed decades ago for different workloads.

Modern AI, especially mixture-of-experts (MoE) models and agentic reasoning, generates communication patterns that pummel these boundaries relentlessly. Tokens get routed to experts on different GPUs. Context windows span terabytes. Reasoning chains require constant synchronization. Traditional architectures simply weren't built for this.

The Six-Chip Solution

NVIDIA's answer was radical: design six chips simultaneously, each optimized for one specific job, all engineered to hand off data seamlessly. No bottlenecks. No waiting.

Data Flow: External → Internal

Scale-Out

Spectrum-6

102.4 Tb/s

Rack-to-rack fabric

Endpoints

ConnectX-9

800 Gb/s

BlueField-4

Infra OS

Scale-Up

Vera CPU

88 cores

72× Rubin GPU

NVLink 6

Total internal bandwidth: 260 TB/s (more than the entire internet)

Notice the flow: data enters through Spectrum-6 (rack-to-rack), gets shaped by ConnectX-9 (traffic control), while BlueField-4 handles infrastructure overhead separately. The Vera CPU orchestrates data movement, and GPUs communicate through NVLink 6's all-to-all mesh. Each handoff is optimized. No chip becomes a chokepoint.

What Each Chip Does

Spectrum-6 Ethernet

Scale-Out Fabric

Connects racks using co-packaged optics—lasers built directly into the switch silicon. This eliminates pluggable transceivers, cutting signal loss from 22 dB to just 4 dB. Result: 5x better power efficiency and 10x reliability.

ConnectX-9 SuperNIC

Network Endpoint

Shapes and schedules traffic before it hits the network—critical for AI's bursty, synchronized communication patterns. Prevents congestion from forming rather than reacting after queues build up.

BlueField-4 DPU

Infrastructure OS

A 64-core processor dedicated to running the "operating system" of the data center—security, storage, telemetry, orchestration. By handling this off the main compute path, GPUs never pause for infrastructure overhead.

Vera CPU

Data Orchestrator

Not your typical host CPU. NVIDIA designed 88 custom "Olympus" cores specifically for high-bandwidth data movement. Shares memory coherently with GPUs at 1.8 TB/s—no copying, no waiting.

NVLink 6 Switch

GPU Interconnect

Creates an all-to-all mesh where every GPU can talk to every other GPU with identical bandwidth and latency. No routing hierarchy means no hotspots during MoE expert dispatch. Built-in compute accelerates collective operations inside the switch itself.

Rubin GPU

Compute Engine

336 billion transistors with a third-gen Transformer Engine. But the real upgrade is memory: 288 GB of HBM4 running at 22 TB/s—nearly 3x Blackwell's bandwidth. When inference is memory-bound, this is what moves the needle.

Why This Matters for MoE Models

Mixture-of-Experts models are eating the AI world. GPT-4, Mixtral, and most frontier models use this architecture because it delivers more capability per FLOP—you only activate the "experts" needed for each token.

The catch? Expert routing creates all-to-all communication patterns. Every GPU might need to send tokens to every other GPU, simultaneously, unpredictably. Traditional GPU interconnects with hierarchical topologies choke on this traffic.

NVLink 6 solves this by creating a flat mesh. Every GPU-to-GPU path has the same 3.6 TB/s bandwidth. No routing through intermediate nodes. No bandwidth asymmetry. The network simply doesn't care which expert a token needs to reach.

The Architectural Shift

Here's the transformation in concrete terms:

❌ Before: Traditional Multi-GPU

GPU

PCIe

(slow)

CPU

Network

(slower)

Other Nodes

Problem: Each boundary = latency cliff + bandwidth drop. Data must be copied between memory spaces.
Result: GPUs idle 30-50% of the time, waiting for data. You pay for silicon that isn't working.

✓ After: Rubin Rack-Scale

SINGLE COHERENT DOMAIN

GPU

=

GPU

= ... =

GPU

=

GPU

72 GPUs connected via NVLink 6 @ 3.6 TB/s each

Vera CPU

Coherent memory @ 1.8 TB/s

Result: The entire rack behaves as one accelerator with unified memory. No copies. No boundaries.
Outcome: GPUs stay productive >90% of the time. You get what you pay for.

⚠️ Common Misconception

"More FLOPS = faster AI." This was true a decade ago. Today, GPUs are so fast that they spend most of their time starving for data. Rubin's 260 TB/s internal bandwidth matters more than its 3.6 ExaFLOPS peak—it keeps GPUs fed instead of waiting.

The Numbers That Matter

When NVIDIA says "10x better," they're measuring what actually affects your bill:

10×

Lower Cost Per Token

vs. Blackwell inference

Fewer GPUs to Train

MoE models vs. Blackwell

NVL72 Rack Specifications

NVFP4 Inference 3.6 ExaFLOPS
NVFP4 Training 2.5 ExaFLOPS
HBM4 Capacity (total) 20.7 TB
HBM4 Bandwidth (total) 1.6 PB/s
Internal Scale-Up Bandwidth 260 TB/s

What This Means for AI Development

Inference economics shift dramatically. At 10x lower cost per token, applications that were economically impossible become viable. Longer context windows. More reasoning steps. Richer agentic workflows. The constraint moves from "can we afford to run this" to "what should we build."

MoE becomes the default architecture. With 4x fewer GPUs needed for training, the cost barrier that made dense models attractive disappears. Expect more sparse, expert-based models from every major lab.

NVIDIA's annual cadence is real. Rubin ships H2 2026—one year after Blackwell Ultra. Each generation isn't incremental; it's architectural. If you're planning infrastructure for 2027+, you need to account for this pace.

Who's Already Committed

The announcement came with commitments from nearly every organization pushing the frontier:

AI Labs: OpenAI, Anthropic, Meta, xAI, Mistral, Cohere, Perplexity, Harvey, Cursor

Hyperscalers: AWS, Google Cloud, Microsoft Azure, Oracle Cloud

AI-Native Cloud: CoreWeave, Lambda, Nebius, Nscale

Microsoft's next-generation "Fairwater" AI superfactories will scale to hundreds of thousands of Vera Rubin Superchips. That's not a typo. The scale of infrastructure investment happening is difficult to overstate.

"Intelligence scales with compute. The NVIDIA Rubin platform helps us keep scaling this progress so advanced intelligence benefits everyone."

— Sam Altman, CEO of OpenAI

One Thing to Remember

Rubin's breakthrough isn't faster chips—it's eliminating the boundaries between them. When 72 GPUs behave as one accelerator, you stop paying the data movement tax that dominates modern AI workloads.

Availability: In production now. Partner systems ship H2 2026.

📬

Enjoying ResearchAudio?

If this breakdown helped you understand rack-scale AI architecture, please share it with a colleague. We spend hours distilling complex research into clear explanations—every referral helps us keep doing this work.

Forward this email or share your unique referral link →

Sources

NVIDIA Technical Blog: "Inside the NVIDIA Rubin Platform" · NVIDIA Newsroom CES 2026 Announcement · NVIDIA Developer Documentation

ResearchAudio

AI research explained clearly, every week.

Unsubscribe · Manage preferences

Keep Reading