In partnership with

Understanding GPU Architecture: Technical Deep Dive from Fundamentals to Modern Design

Understanding GPU Architecture: Technical Deep Dive from Fundamentals to Modern Design

A comprehensive guide to how graphics processors work, from basic principles to advanced optimization techniques

A single NVIDIA H100 GPU can perform 1,979,000,000,000,000 operations per second. Understanding how this level of computational throughput is achieved requires examining one of the most sophisticated pieces of computer engineering ever created. This guide covers GPU architecture from basic principles through advanced implementation details, showing exactly how all components work together.

Part 1: Essential Terminology

Before diving into architecture, let's establish the fundamental terms you'll encounter:

FLOP (Floating Point Operation): A single mathematical operation on floating point numbers, such as addition, subtraction, multiplication, or division. Example: 3.14 * 2.71 = 8.5094 is one FLOP.

FLOPS (Floating Point Operations Per Second): A measure of computational throughput. 1 TFLOPS = 1 trillion FLOPS = 1,000,000,000,000 operations per second.

FMA (Fused Multiply-Add): A single instruction that performs multiplication and addition in one step: c = a * b + c. Counted as 2 FLOPS by convention, though it executes as one operation.

DRAM (Dynamic Random Access Memory): The fundamental memory technology that stores data in capacitors. Requires periodic refresh to maintain data. Used in system RAM and GPU memory.

SRAM (Static Random Access Memory): Faster but more expensive memory technology that stores data in flip-flops. Does not require refresh. Used for caches and on-chip memory. Typically 10x faster than DRAM but 100x more expensive per bit.

VRAM (Video Random Access Memory): General term for memory dedicated to graphics cards. Modern GPUs use specialized DRAM types.

GDDR (Graphics DDR): Evolution of DDR memory optimized for graphics workloads. GDDR6X, used in consumer GPUs like RTX 4090, runs at 21 Gbps per pin with a 384-bit bus for 1,008 GB/s total bandwidth.

HBM (High Bandwidth Memory): Memory stacked vertically using through-silicon vias, placed directly adjacent to the GPU die. HBM3 in the H100 achieves 3 TB/s bandwidth across a 5,120-bit interface.

Bandwidth: Amount of data transferred per unit time, measured in GB/s (gigabytes per second) or TB/s (terabytes per second). Critical for GPU performance.

Latency: Time delay between requesting data and receiving it, measured in clock cycles or nanoseconds. GPUs hide latency through massive parallelism.

Throughput: Amount of work completed per unit time. For GPUs, measured in FLOPS or operations per second.

Precision Formats:

- FP32 (Float32): 32-bit floating point, standard precision

- FP16 (Float16): 16-bit floating point, half precision

- BF16 (BFloat16): 16-bit with 8-bit exponent (same range as FP32), popular for AI

- FP8: 8-bit floating point, used for inference and some training

- INT8: 8-bit integer, used for quantized neural networks

PCIe (Peripheral Component Interconnect Express): Interface connecting GPU to CPU. PCIe 4.0 x16 provides approximately 32 GB/s bandwidth in each direction.

NVLink: NVIDIA's high-speed interconnect for GPU-to-GPU communication. NVLink 4.0 on H100 provides 900 GB/s bidirectional bandwidth.

Part 2: The Parallel Processing Philosophy

Consider a task requiring computation on 2 million independent data points. Two approaches exist:

Approach 1: Use a few powerful processors working sequentially. Each processor is highly capable, handling complex logic and control flow efficiently. However, even at high clock speeds, processing 2 million items sequentially takes considerable time.

Approach 2: Use thousands of simpler processors working in parallel. Each processor is less sophisticated individually, but collective throughput is substantially higher for parallel workloads.

This distinction defines the CPU versus GPU architecture difference. CPUs typically contain 8 to 32 high-performance cores optimized for sequential processing and complex control flow. GPUs contain thousands of simpler cores (the H100 has 16,896 CUDA cores) optimized for parallel execution of identical operations on different data.

For rendering a 3840x2160 display at 60 frames per second, you must compute values for 497,664,000 pixels every second. For training large neural networks, you must multiply matrices containing billions of elements thousands of times per second. CPUs excel at branching logic and sequential tasks. GPUs excel at massively parallel computations.

Why This Architecture Emerged

GPU architecture originated from a specific problem domain: real-time graphics rendering. When displaying a 1920x1080 image, the system must compute color values for 2,073,600 pixels. At 60 frames per second, this requires over 124 million pixel calculations per second. Critically, each pixel's computation is largely independent of others.

This workload is embarrassingly parallel, meaning it divides naturally into many independent tasks. Traditional CPU architectures with few powerful cores cannot efficiently handle this volume. GPU architectures with thousands of simpler cores excel at exactly this problem pattern.

Part 3: Complete GPU Architecture Overview

Streaming Multiprocessors: The Basic Unit

The fundamental building block of NVIDIA GPU architecture is the Streaming Multiprocessor (SM). AMD uses the term Compute Unit (CU) for similar structures. The H100 contains 132 SMs organized into 8 Graphics Processing Clusters (GPCs), with each GPC containing multiple SMs.

Each SM contains several types of processing units:

CUDA Cores (16,896 total on H100)

CUDA cores are the basic arithmetic logic units. Each core can execute one floating-point operation (addition, multiplication, or fused multiply-add) per clock cycle. Unlike CPU cores, CUDA cores lack complex features like branch prediction, speculative execution, and out-of-order processing. This simplification allows higher core density but limits flexibility.

The H100 runs at approximately 1.8 GHz boost clock. With 16,896 CUDA cores, each doing 2 operations per cycle (FMA), theoretical FP32 throughput is: 16,896 * 1.8 GHz * 2 = 60.8 TFLOPS.

Tensor Cores (528 total on H100)

Tensor Cores are specialized matrix multiplication units. A single 4th-generation Tensor Core can compute a 64x16 matrix multiplied by a 16x256 matrix in one operation. Operations that would require hundreds of cycles on CUDA cores complete in a single cycle on Tensor Cores. This specialization is why modern GPUs dominate machine learning workloads, where matrix multiplication is the dominant operation.

Peak Tensor Core throughput on H100:

- FP8: 1,979 TFLOPS (nearly 2 quadrillion operations per second)

- BF16: 989.5 TFLOPS

- FP32: 67 TFLOPS (still higher than CUDA core FP32 throughput)

Special Function Units (SFUs)

SFUs handle transcendental mathematical functions: sine, cosine, exponential, logarithm, square root, and reciprocal square root. These operations are computationally expensive, so dedicated hardware improves performance.

Load/Store Units

Dedicated hardware for memory access operations. Handle loading data from memory into registers and storing results back to memory.

Warp Schedulers (4 per SM)

Each SM contains 4 warp schedulers that manage thread execution. A warp is a group of 32 threads that execute the same instruction simultaneously on different data. This is the SIMT (Single Instruction, Multiple Thread) execution model. All 32 threads in a warp must execute the same instruction at the same time.

A critical architectural detail: each SM can track up to 2,048 threads concurrently (64 warps), but only 128 threads (4 warps of 32 threads each) execute instructions simultaneously at any given clock cycle. The remaining threads are resident but waiting for resources, previous operations to complete, or memory accesses. The scheduler continuously switches between warps, hiding latency by switching to ready warps when the current warp stalls.

Complete Memory Hierarchy

Memory in GPU architecture is organized hierarchically, trading capacity for speed at each level. Understanding this hierarchy is essential because memory bandwidth typically limits performance more than computational throughput.

Level 1: Register File (Fastest, Smallest)

- Technology: SRAM

- Size per SM: 256 KB (65,536 registers of 32 bits each)

- Latency: 1 clock cycle

- Bandwidth: Extremely high (multiple TB/s effective)

- Scope: Private to each thread

Each thread has private register storage. With 2,048 maximum threads per SM, this averages to 32 registers per thread, though actual allocation varies by kernel. Registers hold temporary values, loop counters, and intermediate computation results.

Level 2: Shared Memory and L1 Cache (Very Fast, Small)

- Technology: SRAM

- Size per SM: 228 KB (configurable split between shared memory and L1)

- Latency: 20 to 30 clock cycles

- Bandwidth: 128 bytes per cycle (approximately 230 GB/s per SM)

- Scope: Shared across all threads in a thread block

Shared memory is explicitly managed by the programmer and enables cooperation between threads in the same thread block. L1 cache is hardware-managed. They share the same physical SRAM storage.

Shared memory is organized into 32 banks, each 4 bytes wide. If multiple threads in a warp simultaneously access different addresses within the same bank, a bank conflict occurs. Bank conflicts serialize the accesses, reducing effective bandwidth proportionally to the conflict degree.

Level 3: L2 Cache (Fast, Medium)

- Technology: SRAM

- Size: 60 MB (total across GPU)

- Latency: Approximately 200 clock cycles

- Bandwidth: Shared among all SMs

- Scope: Global across entire GPU

L2 cache is physically partitioned into two slices, with half the SMs connected directly to each partition through a crossbar interconnect. The L2 cache is the last line of defense before accessing main GPU memory (VRAM).

Level 4: HBM3 Memory / VRAM (Moderate Speed, Large)

- Technology: DRAM (specifically HBM3)

- Size: 80 GB on H100

- Latency: 300 to 600 clock cycles

- Bandwidth: 3 TB/s (3,000 GB/s)

- Scope: Global across entire GPU

This is the main GPU memory, often called VRAM. HBM (High Bandwidth Memory) is stacked DRAM placed directly adjacent to the GPU die using advanced packaging. HBM3 operates at 3.2 Gbps per pin but uses a 5,120-bit wide interface (640 bytes per transfer).

For comparison, DDR5 system memory operates at 6,400 MT/s but typically uses only a 64-bit interface (8 bytes per transfer), resulting in approximately 51.2 GB/s bandwidth per channel. Even with dual-channel, that is only 102.4 GB/s compared to HBM3's 3,000 GB/s.

Level 5: System Memory (Slowest)

- Technology: DRAM (DDR4/DDR5)

- Size: System dependent (typically 64-512 GB)

- Latency: Very high (thousands of cycles including PCIe transfer)

- Bandwidth: PCIe 4.0 x16 provides approximately 32 GB/s

- Scope: Accessed via PCIe bus

GPUs can access CPU system memory across the PCIe interface, but this is dramatically slower. Used primarily for initial data transfer before computation begins or when GPU memory is exhausted.

Memory Bandwidth: The Limiting Factor

The H100 can theoretically perform 1,979 trillion FP8 operations per second, but only if data is supplied fast enough. This relationship is captured by arithmetic intensity: the ratio of operations performed to bytes transferred from main memory.

The ridge point occurs where computational capability equals memory bandwidth limitation. For the H100:

Peak compute: 1,979 TFLOPS = 1,979,000 GFLOPS

Memory bandwidth: 3 TB/s = 3,000 GB/s

Ridge point: 1,979,000 GFLOPS / 3,000 GB/s = 660 operations per byte

Only when a kernel performs more than 660 operations per byte transferred from HBM does it become compute-bound rather than memory-bound. Most naive algorithms have much lower arithmetic intensity, spending most execution time waiting for memory rather than computing.

Part 4: Thread Execution Model

The CUDA Thread Hierarchy

CUDA organizes parallel work into a hierarchy:

Thread: The smallest execution unit. Each thread executes the kernel code with its own program counter and registers. Each thread has a unique identifier within its block.

Warp: A group of 32 threads that execute in lockstep. All threads in a warp execute the same instruction on different data (SIMT model). The warp is the fundamental scheduling unit.

Thread Block (or Cooperative Thread Array): A group of threads (typically 128 to 1024) that execute on the same SM and can cooperate through shared memory and synchronization primitives. Threads within a block can be organized as 1D, 2D, or 3D arrays for programming convenience.

Thread Block Cluster: Introduced in Hopper architecture, clusters group multiple thread blocks (typically 2 to 8) across different SMs that can share data through distributed shared memory.

Grid: The complete set of thread blocks launched by a single kernel invocation. A grid may contain thousands of thread blocks distributed across all SMs. Grids can also be organized as 1D, 2D, or 3D arrays.

Each thread knows its position in this hierarchy through built-in variables:

- threadIdx.x/y/z: Thread position within its block

- blockIdx.x/y/z: Block position within the grid

- blockDim.x/y/z: Size of each block

- gridDim.x/y/z: Size of the grid

Divergence and Control Flow

Because all 32 threads in a warp must execute the same instruction simultaneously, conditional branches create a problem. If some threads in a warp take one branch while others take a different branch, the hardware must execute both paths sequentially.

Example: If 16 threads in a warp evaluate a condition as true and 16 as false:

1. All 32 threads execute the true branch, but only 16 are active (the other 16 are masked and idle)

2. Then all 32 threads execute the false branch, but only 16 are active (the other 16 are masked)

3. Effective parallelism has dropped by 2x, and execution time has doubled

This is called warp divergence. Performance-critical GPU code minimizes branches or ensures that all threads in a warp take the same branch.

Occupancy and Latency Hiding

GPUs hide memory latency by maintaining many more threads than can execute simultaneously. When one warp stalls waiting for memory, the scheduler immediately switches to another warp that is ready to execute.

This requires high occupancy: the ratio of active warps to the maximum number of warps the SM can support. An SM can support up to 64 warps (2,048 threads). If a kernel launches thread blocks with only 128 threads each, that is 4 warps per block. The SM can then host 16 thread blocks simultaneously (64 warps / 4 warps per block), maximizing opportunities for latency hiding.

However, occupancy is limited by three resources:

1. Registers: Each thread needs registers, and the register file is finite (65,536 per SM)

2. Shared memory: Each block needs shared memory, and it is finite (228 KB per SM)

3. Thread/block limits: Hardware limits exist (2,048 threads per SM, 32 blocks per SM)

The actual occupancy is determined by whichever resource is most constrained.

Part 5: How It All Works Together - Matrix Multiplication Example

Let's trace a complete matrix multiplication operation through the entire GPU system, showing exactly how every component participates. This example multiplies two 4096x4096 matrices using BF16 precision on an H100 GPU.

Step 1: Initialization and Data Transfer

CPU side:

1. Application allocates matrices A (4096x4096), B (4096x4096), and C (4096x4096) in system RAM

2. Application calls cudaMalloc to allocate GPU memory:

- Matrix A: 4096 * 4096 * 2 bytes (BF16) = 33.5 MB

- Matrix B: 33.5 MB

- Matrix C: 33.5 MB

- Total: 100.5 MB of HBM3 allocated

PCIe transfer:

3. cudaMemcpy transfers matrices A and B from system RAM to GPU HBM3 via PCIe 4.0 x16

- Data size: 67 MB

- PCIe bandwidth: 32 GB/s

- Transfer time: approximately 2.1 milliseconds

Kernel launch:

4. Application launches GEMM kernel with configuration:

- Grid: 32x16 blocks (for 4096x4096 output with 128x256 tiles)

- Block size: 384 threads (3 warp-groups: 1 producer + 2 consumers)

- Total thread blocks: 512

- Total threads: 196,608

Step 2: Thread Block Distribution Across SMs

Hardware scheduler:

1. H100 has 132 SMs available

2. Scheduler distributes 512 thread blocks across 132 SMs

3. Each SM can run 2 blocks concurrently (resource limited by registers and shared memory)

4. First wave: 132 SMs * 2 blocks = 264 blocks active

5. Second wave: 248 remaining blocks distributed

6. Total waves: 2 (with second wave partially filled)

Per-SM setup:

Each SM running 2 blocks (768 threads total):

- Allocates 768 * ~100 registers = ~77,000 registers (within 65,536 limit per block group)

- Allocates shared memory: approximately 144 KB per block (for tile buffers) * 2 = close to 228 KB limit

- Creates 24 warps (768 threads / 32 threads per warp)

Step 3: Initial Memory Load Phase

Each thread block computes one output tile (128x256 for our configuration):

Producer warp-group (threads 0-127):

1. Thread 0 prepares TMA (Tensor Memory Accelerator) descriptor:

- Source: Matrix A in HBM3 (address computed from block position)

- Destination: Shared memory buffer slot 0

- Dimensions: 128x16 tile (first K-slice)

- Swizzle mode: 128B pattern for bank conflict avoidance

2. Thread 0 executes: cp_async_bulk_tensor_2d_global_to_shared

- This single instruction triggers the TMA engine

- TMA operates independently of thread execution

3. TMA hardware operates:

- Reads memory descriptor to understand tile shape and stride

- Computes HBM3 addresses for the 128x16 tile (2,048 BF16 elements = 4 KB)

- Issues memory requests through L2 cache

- L2 checks for cached data (cold start: cache miss)

- L2 forwards requests to HBM3 memory controllers

HBM3 memory system:

4. HBM3 memory controllers receive requests:

- HBM3 is organized as a stack of DRAM dies

- Each DRAM die has arrays of memory cells organized in rows and columns

- Memory controller translates addresses to (stack, bank, row, column)

- Activates DRAM rows (opens the row, reading into row buffer)

- Reads columns from row buffer

- Row activation latency: ~20-30 ns

- Column access: ~2-3 ns per access

5. Data flows back through memory hierarchy:

- HBM3 -> L2 cache (data cached for potential reuse by other blocks)

- L2 -> Shared memory in target SM

- TMA applies 128B swizzle pattern while writing to shared memory

- Total latency: approximately 300-400 cycles (~200 ns at 1.8 GHz)

6. Simultaneously, thread 0 issues similar TMA operation for matrix B tile (16x256 = 4 KB)

Memory barrier synchronization:

7. All 384 threads in the block execute barrier.arrive()

8. TMA completion updates barrier byte counter (signals when 8 KB transferred)

9. When barrier conditions met (all threads arrived + all bytes transferred):

- Barrier flips to signaled state

- All waiting threads become eligible for scheduling

Step 4: Computation Phase Using Tensor Cores

Consumer warp-groups activate (threads 128-383, two groups of 128):

1. Warp scheduler selects ready warp:

- Checks all 24 warps resident on the SM

- Identifies warps past the barrier (consumer warps now ready)

- Issues instruction to one of the 4 warp schedulers

2. Tensor Core preparation:

- Consumer group 1 (threads 128-255) prepares for upper 64x256 output tile

- Consumer group 2 (threads 256-383) prepares for lower 64x256 output tile

- Each thread in consumer group has accumulator registers: 4 * 8 FP32 registers = 32 registers for accumulator

- Accumulators initialized to zero (one-time setup)

3. Tensor Core instruction execution:

- Instruction: wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16

- This means: 64x16 @ 16x256 matrix multiplication, BF16 inputs, FP32 accumulator

- All 128 threads in the warp-group participate

4. Data flow for Tensor Core operation:

- Matrix A tile data (128x16) loaded from shared memory

* Shared memory address computed with swizzle pattern

* Bank conflicts avoided due to TMA swizzling

* Each of 32 banks serves data in parallel: 128 bytes per cycle

* Data flows to Tensor Core input registers

- Matrix B tile data (16x256) similarly loaded from shared memory

- Both loads take approximately 20-30 cycles

5. Tensor Core computation:

- Tensor Core hardware contains specialized matrix multiplication units

- For m64n256k16: computes 64*256*16*2 = 524,288 operations

- These 524K operations complete in approximately 1 cycle (asynchronous)

- Latency to completion: ~20-30 cycles

- During these cycles, scheduler can issue other instructions

6. Result accumulation:

- Results from Tensor Core flow to accumulator registers

- Each thread receives its portion of the 64x256 result tile

- Thread i receives elements at specific (row, col) positions

- Results added to existing accumulator values (for K-dimension accumulation)

Step 5: Pipelined Execution Loop

While consumer warp-groups compute, producer continues loading:

1. Producer thread 0 issues TMA for next K-slice into buffer slot 1:

- Loads A[block_row, 16:32] and B[16:32, block_col]

- TMA operates in background while Tensor Cores compute on slot 0 data

- This overlaps communication with computation

2. Pipeline steady state:

- Slot 0: Consumer computing (Tensor Cores active)

- Slot 1: Producer loading next tile (TMA active)

- Slot 2: Empty, ready for subsequent load

- Slots rotate through: compute -> empty -> loading -> compute

3. Loop continues for all K-slices:

- Total K dimension: 4096

- K-slice size: 16

- Total iterations: 4096 / 16 = 256 iterations

- Each iteration: one TMA load + one Tensor Core operation

4. Arithmetic intensity achieved:

- Bytes loaded per iteration: 4 KB (A tile) + 4 KB (B tile) = 8 KB

- Operations per iteration: 64 * 256 * 16 * 2 = 524,288 operations

- Arithmetic intensity: 524,288 ops / 8,192 bytes = 64 ops/byte

- Still below ridge point (660 ops/byte) but reasonable for this tile size

Step 6: Output Write-Back

After all K-slices processed:

1. Consumer threads have complete 128x256 output tile in registers:

- 128 * 256 = 32,768 FP32 values

- Distributed across 256 consumer threads

- Each thread holds 128 FP32 values (512 bytes)

2. Store to shared memory:

- Threads cooperatively write accumulator values to shared memory

- Reorganize data layout for efficient global memory write

- Coalesce stores: arrange data so contiguous threads write contiguous memory

3. TMA async store to HBM3:

- Thread 0 issues: cp_async_bulk_tensor_2d_shared_to_global

- TMA transfers 128x256 tile (128 KB) from shared memory to matrix C in HBM3

- This happens asynchronously

- Next tile computation can begin while store completes

4. Memory write flow:

- Shared memory -> L2 cache

- L2 cache -> HBM3 memory controllers

- HBM3 controllers activate rows and write columns

- Write latency: ~300-400 cycles (hidden by pipeline)

Step 7: Multi-Block Coordination

All 512 blocks execute similar process in parallel:

1. L2 cache sharing:

- Multiple blocks reading same regions of A or B

- First block loading A[0:128, 0:16] brings data into L2

- Subsequent blocks reading same region hit L2 cache

- L2 hit latency: ~200 cycles vs 300-400 for HBM3

- Effective bandwidth multiplied by cache reuse

2. Thread block clusters (if used):

- Cluster of 4 blocks across 4 SMs

- Blocks in cluster can directly access each other's shared memory

- If blocks compute adjacent tiles, they share input data

- Block 0 loads A[0:128, 0:16] to its shared memory

- Block 1 (computing adjacent tile) reads from Block 0's shared memory

- Distributed shared memory access: ~400 cycles (slower than local but faster than L2)

- Reduces total HBM3 traffic

3. Persistent kernel benefits:

- Instead of 512 separate kernel launches

- 132 blocks stay resident, each processing ~4 tiles

- Work queue coordinates tile assignment

- Output write overlaps with next tile load/compute

- Reduces kernel launch overhead from microseconds per tile to one-time cost

Step 8: Performance Analysis

Computational work:

- Matrix size: 4096x4096 @ 4096x4096

- Total operations: 4096 * 4096 * 4096 * 2 = 137.4 billion operations

- BF16 Tensor Core peak: 989.5 TFLOPS

- Theoretical time: 137.4 GFLOPS / 989,500 GFLOPS = 0.139 milliseconds

Memory traffic:

- Matrix A: 4096 * 4096 * 2 bytes = 33.5 MB

- Matrix B: 33.5 MB

- Matrix C read (for beta*C): 33.5 MB

- Matrix C write: 33.5 MB

- Total: 134 MB

- Memory bandwidth: 3 TB/s

- Theoretical time: 134 MB / 3,000,000 MB/s = 0.045 milliseconds

Actual performance with optimized kernel:

- Achieved: ~750 TFLOPS (76% of peak)

- Actual time: approximately 0.18 milliseconds

- Efficiency factors reducing from theoretical peak:

* TMA load latency not fully hidden (queue depth limits)

* Bank conflicts despite swizzling (some unavoidable patterns)

* Pipeline bubbles during phase transitions

* Instruction scheduling overhead

* Memory access patterns not perfectly coalesced at tile boundaries

Step 9: Result Return to CPU

After kernel completion:

1. Matrix C (4096x4096 FP32) resides in GPU HBM3

2. Application calls cudaMemcpy to transfer back to CPU

3. Data flows: HBM3 -> PCIe -> System RAM

4. Transfer size: 4096 * 4096 * 4 bytes = 67 MB

5. PCIe bandwidth: 32 GB/s

6. Transfer time: approximately 2.1 milliseconds

Total operation time:

- CPU to GPU transfer: 2.1 ms

- Computation: 0.18 ms

- GPU to CPU transfer: 2.1 ms

- Total: 4.4 ms

Note: The computation itself is tiny compared to PCIe transfer. For real applications, batching many operations and keeping data on GPU is critical.

Part 6: Advanced Optimization Techniques

Why Optimizations Matter at Scale

Training large language models requires clusters of tens of thousands of GPUs. Meta's Llama 3 training used approximately 16,000 H100 GPUs. At this scale:

A 1% performance improvement equals 160 GPUs worth of compute

H100 power consumption: approximately 700W per GPU

1% improvement saves: 112 kW continuous power

Annual energy savings: approximately 980 MWh

At typical data center energy costs ($0.10/kWh): $98,000 saved per year

This is why the GPU optimization community focuses on what they call O(NR) optimization, where NR stands for Nuclear Reactors. The asymptotic algorithmic complexity improvements have largely been discovered. The remaining gains come from extracting 1-2% improvements through hardware-aware micro-optimizations. At cluster scale, these small percentages translate to nuclear reactor scale energy differences.

Swizzling Details

Shared memory bank conflicts occur when multiple threads access different addresses in the same bank. The solution is address permutation through XOR operations.

For 128-byte swizzle mode, the transformation is:

new_address = old_address XOR ((old_address >> 7) & 0b111)

This XORs bits [9:7] into bits [6:4]. The mathematical property: both row-wise access (contiguous addresses) and column-wise access (strided by row length) map to different banks, eliminating conflicts.

Example with 8x8 matrix in shared memory:

- Without swizzling: column access hits same bank 8 times (8-way conflict)

- With swizzling: column access hits 8 different banks (no conflict)

- Throughput improvement: 8x for column access patterns

Hilbert Curve Scheduling

Persistent kernels process tiles in a specific order. The scheduling policy dramatically affects cache utilization.

Naive ordering: Process tiles sequentially (row 0 col 0, row 0 col 1, ... row 0 col 31, row 1 col 0, ...)

Problem: Tiles far apart in sequence share minimal input data. L2 cache thrashes.

Hilbert curve: A space-filling curve that maintains spatial locality. Consecutive tiles in the curve are spatially close in the output matrix.

Benefit: Adjacent tiles share input data regions. L2 cache hit rate increases.

Performance impact: Approximately 1% improvement on H100

While 1% seems small, on 30,000 H100s, this represents 300 GPUs worth of compute saved.

Part 7: Future Directions

Blackwell Architecture (2024-2025)

NVIDIA's next-generation architecture includes:

- 5th-generation Tensor Cores with higher throughput

- NVLink 5.0: 1.8 TB/s bidirectional bandwidth (2x improvement)

- Enhanced TMA with better pipelining

- Larger shared memory per SM

- Expected 2-2.5x AI performance over Hopper

Chiplet Designs

Instead of single monolithic dies:

- Multiple smaller chiplets (compute, memory, interconnect) connected by high-speed links

- AMD MI300: 13 chiplets (8 GPU, 5 CPU) with 3D stacking

- Benefits: Better yields, process node mixing, modular scaling

- Challenges: Inter-chiplet latency, power delivery, thermal management

Processing-in-Memory

Moving compute into memory arrays:

- Place simple ALUs directly in HBM stacks

- Eliminates data movement for simple operations

- Ideal for element-wise operations, reductions, activations

- Challenges: Limited compute complexity, power constraints

Conclusion

GPU architecture achieves extreme performance through:

1. Massive parallelism: 16,896 CUDA cores + 528 Tensor Cores working simultaneously

2. Specialized hardware: Tensor Cores for matrix multiplication, TMA for data movement

3. Memory hierarchy: Multiple cache levels hiding latency, HBM3 providing 3 TB/s bandwidth

4. Latency hiding: Thousands of threads scheduled to hide memory and pipeline latency

5. Pipelining: Overlapping computation, data movement, and output storage

6. Software optimization: Cache-aware scheduling, swizzling, register management

Understanding how these components work together, from individual DRAM cells through SMs to multi-GPU clusters, provides insight into modern high-performance computing. Every level matters: hardware design, memory architecture, execution model, and algorithmic optimization all contribute to achieving peak performance.

The GPU revolution enabled the AI revolution. As models grow larger and training clusters expand, GPU architecture continues evolving to meet computational demands while managing power and memory constraints.


This article provides technical foundations for understanding GPU architecture and performance optimization. For specific implementation details, refer to CUDA programming guides and architecture documentation from NVIDIA and other GPU manufacturers.

Effortless Tutorial Video Creation with Guidde

Transform your team’s static training materials into dynamic, engaging video guides with Guidde.

Here’s what you’ll love about Guidde:

1️⃣ Easy to Create: Turn PDFs or manuals into stunning video tutorials with a single click.
2️⃣ Easy to Update: Update video content in seconds to keep your training materials relevant.
3️⃣ Easy to Localize: Generate multilingual guides to ensure accessibility for global teams.

Empower your teammates with interactive learning.

And the best part? The browser extension is 100% free.

Keep Reading

No posts found