In partnership with

Day 3: Inside the Transformer
Day 3 of 7 • Mechanistic Interpretability Series

Inside the Transformer

The architecture behind every modern AI — explained without the math.


Every AI system you've heard of — ChatGPT, Claude, Gemini, LLaMA, Mistral — is built on the same architecture: the transformer.

Invented by Google researchers in 2017, the transformer wasn't supposed to change everything. The paper was called "Attention Is All You Need" — a technical contribution about a new way to process sequences. The authors proposed replacing the then-dominant recurrent neural networks with a simpler architecture based entirely on attention mechanisms.

It turned out to be one of the most important inventions in the history of computing. Within a few years, transformers had taken over natural language processing. Then computer vision. Then protein folding. Then audio, video, robotics. The architecture was so general, so scalable, that it became the foundation for nearly all frontier AI.

Before we can understand mechanistic interpretability — how to reverse-engineer what these models are doing — we need to understand what we're looking at. What's actually inside a transformer? What are "neurons" and "layers" and "attention"?

No math required. Just mental models that will help you understand what interpretability researchers are actually looking at when they peer inside these systems.

◆ ◆ ◆

The Big Picture

At the highest level, a language model does one thing: predict the next word.

You give it "The capital of France is" and it predicts "Paris". You give it "Write a poem about" and it predicts "the". Then you append that word to the input and ask again. And again. Word by word, it generates text.

The magic is in how it makes these predictions. A transformer processes the input through a series of layers, each transforming the representation until, at the end, it outputs a probability distribution over all possible next words.

The scale is staggering. GPT-4 has around 1.8 trillion parameters across an estimated 120 layers. Claude 3 has hundreds of billions of parameters. These numbers represent the learned connections and weights — the "knowledge" the model has extracted from training data.

"The capital
of France is"
TRANSFORMER 96 layers 175B params
Paris: 94%
Lyon: 2%...

But what happens inside that box? Let's zoom in.

◆ ◆ ◆

Step 1: Tokens and Embeddings

The first thing a transformer does is break the input into tokens.

Tokens aren't quite words. They're chunks — sometimes whole words, sometimes parts of words, sometimes individual characters. "Understanding" might become ["under", "stand", "ing"]. Common words like "the" stay whole. The exact split depends on the tokenizer.

Why tokens instead of characters or words? Tokens are a compromise. Characters are too granular — the model would have to learn that "c-a-t" spells "cat" from scratch, wasting capacity on spelling. Words are too sparse — rare words like "defenestration" would barely appear in training data, making them hard to learn. Tokens hit a sweet spot: common enough to learn well, granular enough to handle novel words.

GPT-4 uses roughly 100,000 different tokens. Claude uses a similar number. Each token is essentially a vocabulary entry that the model has learned to work with.

Each token gets converted into an embedding — a vector of numbers (typically 4,096 to 12,288 numbers) that represents that token. These embeddings are learned during training. Similar tokens end up with similar embeddings.

Tokenization Example

The capital of France is

Each colored box = one token → one embedding vector

At this point, we have a sequence of vectors — one per token. This is the input to the transformer's layers. The journey from raw text to prediction is about to begin.

◆ ◆ ◆

Step 2: The Residual Stream

Here's a key concept that's central to mechanistic interpretability: the residual stream.

Think of the residual stream as a river of information flowing through the model. Each token has its own stream — a vector that starts as the embedding and gets modified as it passes through each layer.

The crucial design choice: layers don't replace the stream, they add to it. Each layer reads from the residual stream, does some computation, and writes its result back by adding to the stream. This is why it's called "residual" — inspired by residual connections that prevent information from getting lost in deep networks.

The Residual Stream

Embedding
↓ +
Layer 1
↓ +
Layer 2
Layer 96
Output

Each layer ADDS to the stream — information accumulates

This architecture is why interpretability is even possible. Because layers add to a shared stream rather than completely replacing it, we can examine what information is present at each point. The residual stream is the "highway" where all information travels — making it a natural place to look for interpretable representations.

Think of it like a document being edited by many people. Each editor (layer) reads the current document, makes additions in the margins, and passes it on. At any point, you can read the document and see the accumulated contributions. This is different from a pipeline where each stage completely transforms the data — there, intermediate states might not be meaningful.

◆ ◆ ◆

Step 3: Attention — How Tokens Talk to Each Other

Each transformer layer has two main components. The first is attention.

Here's the problem attention solves: when processing the word "it" in "The cat sat on the mat because it was tired", how does the model know that "it" refers to "cat" and not "mat"?

Attention lets each token look at all previous tokens and selectively gather information from them. It's like each word can ask: "Which earlier words are relevant to understanding me?"

The mechanism works through three learned transformations — Query, Key, and Value (Q, K, V):

Query: "What am I looking for?" — each token creates a query vector representing what information it needs.

Key: "What do I contain?" — each token creates a key vector advertising what information it has.

Value: "Here's my information" — the actual content that gets passed along when there's a match.

When a query matches a key (mathematically: when their dot product is high), the corresponding value gets weighted heavily in the output. This is how "it" can attend to "cat" — their query-key match is high because the model has learned patterns about pronoun resolution.

The elegance of attention is that it's learnable. The model doesn't have hard-coded rules about pronouns or grammar. It learns what to look for from data. Different attention heads discover different patterns — some grammatical, some semantic, some we don't fully understand.

Intuition: Attention is like a search engine inside the model. Each token broadcasts "here's what I need" (query) and "here's what I have" (key). When needs match offerings, information flows.

Modern transformers use multi-head attention — they run many attention operations in parallel, each learning to look for different kinds of relationships. One head might track syntax, another might track coreference, another might track semantic similarity. The outputs combine.

GPT-4 has 96 layers with ~100 attention heads each. That's nearly 10,000 separate attention operations looking for different patterns — an enormous capacity for learning relationships.

Early interpretability research focused heavily on attention patterns — visualizing which tokens attend to which. It's intuitive and produces beautiful diagrams. But researchers eventually realized that attention patterns alone don't tell the full story. A token might attend strongly to another token but barely use that information. The attention pattern shows where the model is looking, not what it's extracting or how it's using it.

◆ ◆ ◆

Step 4: The MLP — Where Knowledge Lives

The second component of each layer is the MLP (multi-layer perceptron), also called the feed-forward network.

While attention moves information between tokens, the MLP processes each token independently. It's where the model stores and retrieves factual knowledge.

The MLP has a simple structure: it expands the representation to a much larger dimension (4x is typical), applies a non-linearity, then compresses back down. This expansion creates a huge number of "neurons" — individual units that can activate for specific patterns.

MLP Structure

Input
4,096 dim
Expanded
16,384 neurons
Output
4,096 dim

Research suggests that MLPs act like key-value memories. Certain neuron patterns activate for specific concepts or facts, and when they fire, they add corresponding information to the residual stream. The fact that "Paris is the capital of France" is somehow encoded in the pattern of which neurons fire when processing France-related tokens.

The MLP is also where most of the model's parameters live — typically 2/3 of total parameters are in MLPs. This makes sense if MLPs are storing factual knowledge: you need a lot of capacity to remember everything the model knows.

This is where mechanistic interpretability gets interesting: if we can identify which neurons encode which facts, we can potentially read out what the model "knows" — or edit that knowledge directly.

Researchers have demonstrated this. In one experiment, they identified neurons associated with the fact "The Eiffel Tower is in Paris" and modified them to say "Rome" instead. The model then consistently claimed the Eiffel Tower was in Rome — even in contexts it had never seen before. This is both exciting (we can edit knowledge!) and concerning (what else could be edited?).

◆ ◆ ◆

Putting It All Together

Let's trace what happens when a transformer processes "The capital of France is":

1. Tokenization: The text becomes 5 tokens, each converted to an embedding vector.

2. Layer 1 Attention: Tokens start attending to each other. "France" might attend to "capital" — they're syntactically related.

3. Layer 1 MLP: The model starts activating neurons related to countries, capitals, geography.

4. Layers 2-95: Information builds up. The model recognizes this is a factual question about geography. Features for "Paris" start activating even though the word hasn't been generated yet.

5. Final Layer: The residual stream for the last token ("is") gets converted to a probability distribution. "Paris" has the highest probability.

6. Output: "Paris" is selected and returned. The process repeats for the next token.

All of this happens in milliseconds. The entire forward pass — tokenization, 96 layers of attention and MLPs, final prediction — executes faster than you can blink. The architecture is massively parallelizable, which is why transformers scale so well on modern GPUs.

Component What It Does Analogy
Embeddings Convert tokens to vectors Dictionary lookup
Attention Move info between tokens Search engine
MLP Process & store knowledge Database
Residual Stream Carries information through Highway
◆ ◆ ◆

Why This Matters for Interpretability

Understanding this architecture reveals where to look when we try to interpret models:

The residual stream is the natural place to search for interpretable features — it's where all information flows.

Attention patterns show us what information the model thinks is relevant — which tokens are "talking" to which.

MLP neurons are candidates for storing specific knowledge — though as we'll see tomorrow, it's more complicated than "one neuron, one fact."

The layered structure suggests that different types of processing happen at different depths — early layers might handle syntax, later layers semantics, final layers task-specific reasoning.

But here's the catch: while this architecture tells us where to look, it doesn't make interpretation easy. A transformer with 96 layers, 100 attention heads per layer, and 16,000 neurons per MLP has billions of moving parts. How do you find meaning in that ocean of numbers?

The early hope was that individual neurons would be interpretable — that we'd find a "Paris neuron" and a "cat neuron" and could understand the model as a collection of concept detectors. That hope ran into a wall called polysemanticity.


Tomorrow

The Polysemanticity Problem — Researchers expected neurons to represent clean concepts. What they found was a mess. One neuron firing for "cats" AND "cars" AND "certain fonts." Why does this happen? And what does it mean for interpretability?

Turn AI into Your Income Engine

Ready to transform artificial intelligence from a buzzword into your personal revenue generator?

HubSpot’s groundbreaking guide "200+ AI-Powered Income Ideas" is your gateway to financial innovation in the digital age.

Inside you'll discover:

  • A curated collection of 200+ profitable opportunities spanning content creation, e-commerce, gaming, and emerging digital markets—each vetted for real-world potential

  • Step-by-step implementation guides designed for beginners, making AI accessible regardless of your technical background

  • Cutting-edge strategies aligned with current market trends, ensuring your ventures stay ahead of the curve

Download your guide today and unlock a future where artificial intelligence powers your success. Your next income stream is waiting.

Keep Reading

No posts found