In partnership with

What is a TPU? Deep Dive into Google's AI Chip

What is a TPU?

Architecture, systolic arrays, and why Google's custom silicon still fights an uphill battle against CUDA

In 2013, Google faced an infrastructure crisis. Their engineers had been experimenting with deep neural networks for voice recognition, and the results were remarkable-error rates dropped by 30% compared to traditional methods. But there was a problem. They calculated that if every Android user used voice search for just three minutes per day, Google would need to double its entire global data center capacity. The cost would be astronomical, and the timeline impossible.

Google's solution was radical: design their own chip from scratch. Fifteen months later, they deployed the first Tensor Processing Unit (TPU) into production. It was one of the fastest hardware development cycles in computing history, and it fundamentally changed how Google-and eventually the entire AI industry-thinks about machine learning infrastructure.

This article explains what TPUs are, how they work at the architectural level, why they differ so fundamentally from GPUs, and what makes them critical to modern AI systems.

The Core Idea: An ASIC for Neural Networks

A TPU is an application-specific integrated circuit (ASIC) designed exclusively for machine learning workloads. The name comes from TensorFlow, Google's machine learning framework-these chips are optimized to process the tensors (multi-dimensional arrays) that flow through neural networks.

To understand why this matters, consider what neural networks actually do. At their core, neural networks perform vast quantities of matrix multiplications. When you feed an image into a vision model, the pixel values form a matrix that gets multiplied by weight matrices, layer after layer. A single forward pass through GPT-4 involves trillions of multiply-accumulate operations. The model's "intelligence" emerges from these mathematical transformations.

CPUs and GPUs can perform these operations, but they weren't designed for them. A CPU is optimized for sequential tasks with complex branching logic-running an operating system, parsing text, managing databases. A GPU evolved from graphics rendering, where millions of pixels need independent calculations. Both are general-purpose processors that happen to be useful for machine learning.

Google asked a different question: what if you designed a chip that could only do matrix multiplication, but did it better than anything else? What if you threw away all the transistors devoted to branch prediction, speculative execution, cache coherency, and graphics rendering-and used that silicon budget entirely for multiplying matrices?

TPU Architecture: The Systolic Array

The heart of every TPU is the Matrix Multiply Unit (MXU), built using a design called a systolic array. This architecture, invented in the 1970s but rarely used until TPUs, fundamentally changes how matrix operations execute.

In a conventional processor, computation follows a fetch-execute cycle. The processor fetches data from memory, performs an operation, writes the result back to memory, then fetches the next piece of data. Memory access is slow-often 100-1000x slower than computation-so processors spend most of their time waiting for data rather than computing. Modern CPUs and GPUs use elaborate caching hierarchies to hide this latency, but the fundamental problem remains.

A systolic array works differently. Picture a grid of processing elements arranged in rows and columns. Data flows through this grid rhythmically, like blood pulsing through arteries (hence "systolic," from the Greek word for contraction). Each processing element performs a single multiply-accumulate operation and passes its result to the neighboring element. The data is reused as it flows, rather than being repeatedly fetched from memory.

Here is how the TPU architecture is organized:

                            TPU ARCHITECTURE (v4/v5)

                         ┌──────────────────────────┐
                         │      HOST INTERFACE      │
                         │      (PCIe 4.0 x16)      │
                         └────────────┬─────────────┘
                                      │
                         ┌────────────┴─────────────┐
                         │    INSTRUCTION BUFFER    │
                         └────────────┬─────────────┘
                                      │
        ┌─────────────────────────────┼─────────────────────────────┐
        │                             │                             │
┌───────┴───────┐            ┌────────┴────────┐            ┌───────┴────────┐
│  WEIGHT FIFO  │            │ UNIFIED BUFFER  │            │  ACCUMULATORS  │
│ Model weights ├───────────►│  24-96MB SRAM   │◄───────────┤   32-bit FP    │
└───────┬───────┘            │   Activations   │            └───────┬────────┘
        │                    └────────┬────────┘                    │
        │                             │                             │
        └─────────────────────────────┼─────────────────────────────┘
                                      │
┌─────────────────────────────────────┴────────────────────────────────────┐
│                       MATRIX MULTIPLY UNIT (MXU)                         │
│                                                                          │
│    Weights ─►  [PE][PE][PE][PE][PE][PE][PE][PE]       128x128 grid       │
│                [PE][PE][PE][PE][PE][PE][PE][PE]       = 16,384           │
│    Inputs │    [PE][PE][PE][PE][PE][PE][PE][PE]       ops/cycle          │
│           ▼    [PE][PE][PE][PE][PE][PE][PE][PE]                          │
│                                                                          │
└─────────────────────────────────────┬────────────────────────────────────┘
                                      │
┌─────────────────────────────────────┴────────────────────────────────────┐
│                      VECTOR PROCESSING UNIT                              │
│            ReLU  │  Softmax  │  LayerNorm  │  GELU  │  Tanh              │
└─────────────────────────────────────┬────────────────────────────────────┘
                                      │
┌─────────────────────────────────────┴────────────────────────────────────┐
│                     HIGH BANDWIDTH MEMORY (HBM)                          │
│               32-128 GB   │   1.5-4.8 TB/s bandwidth                     │
└─────────────────────────────────────┬────────────────────────────────────┘
                                      │
┌─────────────────────────────────────┴────────────────────────────────────┐
│                     INTER-CHIP INTERCONNECT (ICI)                        │
│        [TPU]────[TPU]────[TPU]────[TPU]────[TPU]   up to 8,960 chips     │
└──────────────────────────────────────────────────────────────────────────┘
                                        

Consider a concrete example. To multiply a 128×128 matrix A by a 128×128 matrix B, a systolic array loads matrix A's rows into the left edge and matrix B's columns into the top edge. On each clock cycle, every processing element multiplies its current input values, adds the result to its running sum, and passes the input values to its neighbors. After 128 cycles plus some pipeline latency, all 16,384 output values are computed. The key insight is that each input value is read from memory once but used 128 times as it flows across the array.

The TPUv4 MXU operates at 128×128, meaning it contains 16,384 processing elements performing 16,384 multiply-accumulate operations every clock cycle. At 1.05 GHz, that's over 17 trillion INT8 operations per second from this single unit-and modern TPUs contain multiple MXUs per chip.

Memory Architecture: Feeding the Beast

A systolic array is only as fast as the data you can feed it. If the MXU completes a matrix multiply but has to wait for the next operands, you lose all the efficiency gains. TPUs address this with a carefully designed memory hierarchy.

At the bottom sits High Bandwidth Memory (HBM), a 3D-stacked DRAM technology where memory chips are literally stacked on top of each other and connected with through-silicon vias. This stacking puts memory physically close to the processor and provides massive bandwidth-4.8 TB/s on TPUv5p compared to around 80 GB/s for standard DDR5. HBM stores the full model weights, optimizer states, and large activation tensors.

Above HBM sits the Unified Buffer, a large on-chip SRAM (24-96 MB depending on generation). SRAM is faster than DRAM but more expensive per bit, so it's used strategically. The Unified Buffer holds activations during forward and backward passes-data that will be reused multiple times during a computation. Because it's on-chip, access latency is nanoseconds rather than the tens of nanoseconds required for HBM.

The Weight FIFO is a specialized buffer that streams weights into the MXU. During inference, model weights are read sequentially layer by layer, making them ideal for FIFO (first-in-first-out) access. The weights load continuously while the MXU processes, hiding memory latency behind computation.

This memory system is software-managed rather than cache-based. Unlike CPUs where hardware automatically decides what to cache, TPU software explicitly orchestrates data movement. The XLA (Accelerated Linear Algebra) compiler analyzes the computation graph, determines optimal buffer allocation, and generates precise memory access schedules. This removes the unpredictability of cache misses but requires sophisticated compiler technology.

Numerical Precision: The bfloat16 Innovation

Traditional scientific computing uses 64-bit or 32-bit floating point numbers. Neural networks, it turns out, don't need that much precision. Weights and activations can be represented with far fewer bits without degrading model quality. This insight led Google to develop bfloat16, a format now adopted across the industry.

Standard IEEE 754 float32 uses 1 sign bit, 8 exponent bits, and 23 mantissa bits. The IEEE float16 format reduces this to 1 sign bit, 5 exponent bits, and 10 mantissa bits. The problem is that 5 exponent bits dramatically reduce the representable range, causing overflow and underflow during training. Gradients in early layers can be extremely small while later layer activations are large; float16's limited range struggles with this dynamic range.

Google's bfloat16 takes a different approach: 1 sign bit, 8 exponent bits (same as float32), but only 7 mantissa bits. This preserves the full dynamic range of float32 while halving memory and bandwidth requirements. The reduced precision in the mantissa causes some rounding error, but neural networks are remarkably robust to this noise-in fact, it can even act as a regularizer.

TPUs support multiple precision modes. Training typically uses bfloat16 for forward and backward passes with float32 accumulation for numerical stability. Inference can often use INT8 quantization, where weights and activations are 8-bit integers, providing 4× the throughput of bfloat16. The original TPUv1 was actually INT8-only, designed purely for inference. Later generations added bfloat16 and float32 support for training.

TPU vs GPU: Different Philosophies

GPUs and TPUs both accelerate matrix operations, but they approach the problem from opposite directions.

A GPU is a massively parallel processor built from thousands of small, identical cores. An NVIDIA H100 contains 16,896 CUDA cores, each capable of independent arithmetic operations. This architecture emerged from graphics rendering, where each pixel of an image can be calculated independently. The SIMT (Single Instruction, Multiple Thread) execution model runs the same instruction across many threads, each processing different data.

GPUs are highly programmable. You can write arbitrary parallel algorithms, handle complex branching logic, and use the hardware for simulations, graphics, cryptography, or machine learning. NVIDIA's CUDA ecosystem provides mature libraries and tools that have been refined over nearly two decades. This flexibility comes at a cost: transistors dedicated to branch prediction, thread scheduling, register files, and cache coherency don't directly contribute to multiply-accumulate throughput.

Modern GPUs have added specialized hardware to compete with TPUs. The H100's Tensor Cores are essentially small systolic arrays within a larger GPU architecture. But they're one component among many, competing for power and die area with the general-purpose CUDA cores, graphics hardware, and cache hierarchies.

TPUs bet everything on matrix multiplication. They cannot render graphics. They cannot run general-purpose parallel algorithms efficiently. Their programming model is restricted to operations that the XLA compiler can decompose into matrix multiplies and element-wise operations. In exchange, every transistor is optimized for neural network workloads. The systolic array dominates the die area. There's no branch predictor because there are no branches. The memory system is purpose-built for the predictable access patterns of neural networks.

The efficiency difference is substantial. Google's original TPU paper reported 83× better performance per watt compared to contemporary GPUs for neural network inference. Modern comparisons are more nuanced-NVIDIA has significantly improved Tensor Core efficiency-but TPUs maintain a meaningful edge for workloads that fit their architecture.

The control flow difference matters more than it might seem. GPUs handle variable-length sequences and dynamic computation graphs naturally because each thread can take a different path. TPUs work best with static shapes and fixed computation patterns. This is one reason JAX and TensorFlow (with XLA compilation) are the primary frameworks for TPUs, while PyTorch's eager execution mode works better on GPUs. The ecosystem difference reflects fundamental hardware differences.

The Ecosystem Problem: Why PyTorch and CUDA Still Dominate

If TPUs are so efficient, why hasn't everyone switched? The answer lies not in hardware but in software. NVIDIA has spent nearly two decades building an ecosystem that creates profound switching costs, and PyTorch-the framework that won the research community-was built CUDA-first.

CUDA launched in 2006. For seventeen years, NVIDIA has been accumulating libraries, tools, documentation, Stack Overflow answers, university courses, and developer mindshare. cuDNN provides optimized neural network primitives. cuBLAS handles linear algebra. NCCL manages multi-GPU communication. TensorRT optimizes inference. Triton simplifies kernel development. Each library represents years of engineering and optimization that competitors must replicate.

More importantly, virtually every ML paper published in the last five years includes code written for CUDA GPUs. Researchers write custom CUDA kernels for novel operations. When you want to reproduce a paper's results or build on someone's work, you need CUDA. This network effect is self-reinforcing: researchers use CUDA because existing code uses CUDA, and they release CUDA code because that's what they wrote.

The PyTorch Factor

PyTorch's dominance in research makes the TPU adoption problem concrete. PyTorch was designed around eager execution-you write Python code, and operations execute immediately on the GPU. This imperative style feels natural to programmers and makes debugging straightforward. You can insert print statements, use Python debuggers, and inspect intermediate tensors at any point.

TPUs require a fundamentally different approach. They work best with static computation graphs that can be compiled and optimized ahead of time. The XLA (Accelerated Linear Algebra) compiler analyzes your entire computation, fuses operations, optimizes memory layout, and generates efficient TPU code. This compilation step enables TPU efficiency but conflicts with PyTorch's eager execution model.

PyTorch/XLA exists as a bridge, allowing PyTorch code to run on TPUs. But the abstraction is leaky. Dynamic control flow that works fine on GPUs can cause recompilation on TPUs. Operations that PyTorch handles gracefully may not have XLA equivalents. Debugging becomes harder because errors surface during compilation rather than execution, and the stack traces point to XLA internals rather than your Python code.

The practical result: porting a PyTorch codebase to TPUs often requires significant refactoring. Dynamic shapes must become static. Custom CUDA kernels must be rewritten or replaced. Training loops need restructuring to minimize recompilation. For a research team with working GPU code and a paper deadline, this porting cost is rarely worth the potential efficiency gains.

JAX: Google's Answer (That Created Another Problem)

Google developed JAX as a framework designed from the ground up for XLA compilation. JAX embraces functional programming: pure functions, immutable data, explicit random state. This design maps naturally to XLA's compilation model. Operations compose cleanly. Automatic differentiation works through arbitrary Python control flow. The same code runs efficiently on CPUs, GPUs, and TPUs.

JAX has gained significant traction, particularly for large-scale training. DeepMind uses it extensively. Google's PaLM and Gemini were trained using JAX. The Flax and Equinox libraries provide neural network primitives. For new projects without legacy code, JAX on TPUs can be an excellent choice.

But JAX's functional paradigm requires a mental shift that many developers resist. Stateful operations-updating a batch normalization running mean, for instance-require explicit state threading that feels unnatural to programmers trained on imperative code. The debugging experience differs from standard Python. Error messages can be cryptic. The ecosystem of pre-built models and training recipes is smaller than PyTorch's.

The irony is that Google now maintains two major ML frameworks-TensorFlow and JAX-neither of which has achieved PyTorch's research adoption. TensorFlow dominated industry deployment for years but lost the research community when PyTorch offered a better development experience. JAX appeals to a subset of researchers comfortable with functional programming but hasn't displaced PyTorch as the default.

The Talent Pipeline

Ecosystem lock-in extends to human capital. Graduate students learn PyTorch because their advisors use PyTorch. They write CUDA kernels for their research. When they join companies or start their own, they bring these skills and preferences. Hiring managers looking for ML engineers find candidates who know PyTorch and CUDA, not JAX and XLA.

This creates a chicken-and-egg problem for TPU adoption. Companies hesitate to commit to TPUs because finding experienced TPU developers is difficult. Developers don't invest in learning TPU workflows because job postings emphasize CUDA experience. The equilibrium favors the incumbent.

What Would Change the Equation

Several factors could shift the balance. If GPU supply constraints persist-and they have for years-organizations may be forced onto alternative hardware regardless of ecosystem preferences. Cost pressure could drive adoption: if TPUs offer 2-3× better price-performance for specific workloads, finance teams may override engineering preferences. Framework convergence could help: PyTorch's torch.compile with the Inductor backend represents a move toward the compilation model that benefits TPUs and other accelerators.

The most likely path is fragmentation rather than displacement. Large organizations with dedicated infrastructure teams-Google, Meta, major cloud customers-will use TPUs where they provide advantages. Smaller teams and individual researchers will continue defaulting to PyTorch and NVIDIA GPUs because the ecosystem makes them productive. The AI accelerator market is large enough to sustain multiple platforms, even if none achieves the ubiquity that CUDA enjoys today.

TPU Pods: Scaling to Exaflops

Individual TPUs are fast, but modern LLMs require distributed training across thousands of chips. Google designed TPUs from the start with this in mind, creating the Inter-Chip Interconnect (ICI).

Unlike GPU clusters connected through external switches and PCIe, TPUs have direct chip-to-chip connections built into the silicon. Each TPU connects to its neighbors in a 2D or 3D torus topology, meaning data can flow directly between chips without traversing network switches. This is similar to how modern supercomputers connect processors, but implemented at the chip level.

A TPUv4 pod contains 4,096 chips connected in a 4×4×256 3D torus, delivering over 1.1 exaflops of bfloat16 compute. The ICI provides 1.2 TB/s of total bisection bandwidth, allowing efficient all-reduce operations during distributed training. When a gradient update needs to be synchronized across chips, the torus topology ensures short paths and high bandwidth.

This tight integration enables training strategies that would be impractical on GPU clusters. Google's Pathways system can run different parts of a model on different TPU slices while maintaining efficient communication. The GSPMD (General and Scalable Parallelism for ML Compute) compiler automatically shards models across available TPUs, handling the complex partitioning of weights, activations, and gradients.

The TPUv5p extends this to 8,960 chips per pod with doubled ICI bandwidth. These pods are purpose-built AI supercomputers, and Google operates multiple pods across their data centers for training frontier models.

TPU Generations: A Hardware Evolution

The first TPU, deployed in 2015 and announced in 2016, was an inference-only accelerator. It used a 256×256 systolic array operating on INT8 data, delivering 92 trillion operations per second. With just 28MB of on-chip memory and no HBM, it was designed for low-latency inference rather than training. This chip powered Google Search rankings, Photos face recognition, and Translate for over a year before the public knew it existed.

TPUv2 (2017) added training capability. The switch to bfloat16 and float32 support enabled gradient computation, while the addition of 8GB HBM per chip provided space for model weights and optimizer states. This generation introduced the TPU Pod concept, connecting 256 chips via ICI. Google made TPUv2 available through Google Cloud, the first time external customers could access TPU hardware.

TPUv3 (2018) doubled the compute per chip and introduced liquid cooling to manage the increased power. A full v3 pod contained 1,024 chips delivering over 100 petaflops. This generation trained BERT, the model that revolutionized NLP and established the pretrain-then-finetune paradigm that continues today.

TPUv4 (2021) made a significant architectural leap. The 3D torus interconnect, larger HBM capacity (32GB), and improved matrix units enabled 4,096-chip pods exceeding one exaflop. This generation trained PaLM (540 billion parameters) and the early Gemini models. Google's 2022 paper on TPUv4 reported that a single pod could train a GPT-3 scale model in approximately 10 days.

The v5 generation (2023) split into two variants. TPUv5e targets cost-efficient inference and smaller training jobs, while TPUv5p (Pathways-optimized) pushes the frontier with 95GB HBM3, 459 TFLOPS per chip, and 8,960-chip pods. The "p" variant enables the Pathways multi-model architecture, where different model components can be dynamically scheduled across TPU slices.

Trillium (TPUv6), announced in 2024, claims 4.7× improvement in compute performance per chip over v5e. The interconnect bandwidth also increased significantly, anticipating the communication requirements of trillion-parameter models. Google has stated that Trillium will power the next generation of Gemini training.

Real-World Applications

Inside Google, TPUs are the primary infrastructure for both training and serving AI models. Every Google Search query passes through TPU-accelerated ranking models. Gmail's spam filtering, Google Photos' face clustering, YouTube's recommendation system, and Google Translate all run on TPU inference. The scale is staggering-billions of TPU inference calls per day across Google's services.

For training, TPUs have powered most of Google's significant AI breakthroughs. BERT, T5, PaLM, Gemini, AlphaFold, AlphaGo-all trained on TPU pods. The architectural consistency between training and inference hardware simplifies deployment; a model trained on TPUs can be served on TPUs without compatibility concerns.

DeepMind's AlphaFold, which predicted protein structures for nearly every known protein, was trained on TPUv3 pods. The computational requirements were immense-training took weeks even on thousands of TPU chips-but the scientific impact was transformational. Similar TPU-scale resources power research into weather forecasting, materials science, and drug discovery.

External users access TPUs through Google Cloud Platform. Midjourney, Anthropic (for some workloads), and numerous AI startups have trained models on Cloud TPU. The economics differ from NVIDIA GPUs-pricing, availability, and framework compatibility vary-but for workloads that fit, TPUs can offer significant cost advantages.

Google also offers free TPU access through Colab, Kaggle, and the TPU Research Cloud program. The TRC program has provided thousands of researchers with free TPU access, resulting in academic papers across NLP, computer vision, and reinforcement learning. This deliberate ecosystem building helps ensure continued TPU software support.

Strategic Implications

Google's decision to build TPUs was as much about supply chain independence as performance. NVIDIA GPUs are a constrained resource-every major tech company, cloud provider, and AI startup competes for limited production. During the 2022-2023 AI boom, GPU delivery times stretched to months. Companies with significant TPU capacity, like Google, could continue scaling training runs regardless of the GPU shortage.

This vertical integration mirrors Apple's strategy with custom silicon. By controlling the hardware-software stack, Google can optimize across boundaries that would constrain companies using off-the-shelf hardware. The XLA compiler and TPU hardware co-evolved, with compiler feedback informing hardware design and hardware capabilities enabling new compiler optimizations.

Other major players have followed Google's path. Amazon developed Trainium and Inferentia chips. Microsoft invested in custom silicon. Meta is building MTIA accelerators. The logic is compelling: if AI is your core business, depending on a single supplier for your most critical component is an unacceptable risk.

The GPU vs. TPU (vs. other custom silicon) competition benefits the entire AI ecosystem. Competition drives innovation in chip design, compilation, and system architecture. NVIDIA's Tensor Cores were a direct response to TPU efficiency. The rapid improvement in ML accelerator performance over the past decade-far exceeding Moore's Law-reflects this competitive pressure.

The Takeaway

TPUs represent a fundamental bet that the future of computing is specialized. General-purpose processors evolved for diverse workloads, but neural networks have become important enough to justify dedicated silicon. By ruthlessly optimizing for matrix multiplication-sacrificing flexibility, graphics capability, and general-purpose programming-TPUs achieve efficiency that would be impossible with conventional architectures.

The systolic array architecture, the bfloat16 numerical format, the software-managed memory hierarchy, the torus interconnect for pod-scale training-these design choices create a coherent system optimized end-to-end for one thing. It's a case study in how hardware-software co-design, when possible, beats assembling best-in-class components.

As AI models continue scaling-with trillion-parameter models becoming common and ten-trillion-parameter models on the horizon-the economics of training infrastructure become increasingly critical. The companies that can train frontier models most efficiently will have significant advantages. TPUs, and the design philosophy they represent, will continue shaping how the next generation of AI systems get built.

Technical accuracy matters

6 AI Predictions That Will Redefine CX in 2026

2026 is the inflection point for customer experience.

AI agents are becoming infrastructure — not experiments — and the teams that win will be the ones that design for reliability, scale, and real-world complexity.

This guide breaks down six shifts reshaping CX, from agentic systems to AI operations, and what enterprise leaders need to change now to stay ahead.

Keep Reading

No posts found