In partnership with

Day 2: Types of Interpretability

Day 2 of 7 • Mechanistic Interpretability Series

Types of Interpretability

Three ways to understand AI — and why one goes deeper than the rest.

Imagine three doctors examining a patient.

The first doctor observes symptoms. The patient has a fever, cough, and fatigue. Based on patterns from thousands of previous patients, she predicts: probably the flu. She's often right, but when she's wrong, she doesn't know why. She treats symptoms, not causes.

The second doctor runs tests. Blood work shows elevated white blood cells. An antibody test confirms influenza A. Now she knows what is happening inside the patient — a specific virus is present. But she still doesn't know how it's causing the symptoms.

The third doctor understands the mechanism. She knows that influenza binds to respiratory cells using hemagglutinin proteins, hijacks cellular machinery to replicate, triggers inflammatory cytokines that cause fever, and damages epithelial tissue causing cough. She understands the process — the chain of causation from virus to symptom.

All three doctors can help the patient. But only the third can predict how a new virus variant might behave, design targeted treatments, or explain why some patients respond differently than others.

The same hierarchy exists in every complex system. A mechanic might know a car accelerates poorly (behavioral). They might detect low fuel pressure (diagnostic). But only understanding how the fuel injection system works (mechanistic) lets them fix the root cause or predict how modifications will affect performance.

AI interpretability works the same way. There are three levels of understanding, each answering different questions, each with different power.

◆ ◆ ◆

Level 1: Behavioral Interpretability

🔍 Answers: "What does the model do?"

Behavioral interpretability is the study of AI through its outputs. You treat the model as a black box — you don't look inside. Instead, you systematically probe it with inputs and observe what comes out.

This is the oldest and most common approach to understanding AI systems. It includes:

Benchmarking: Testing models on standardized tasks. Can it pass the LSAT? Score well on coding problems? Answer science questions correctly? Benchmarks give us a scorecard of capabilities.

Red teaming: Adversarial testing to find failures. Researchers try to trick models into saying harmful things, revealing private information, or making mistakes. If you can't find the failure modes, at least you've tried.

Behavioral probing: Systematic input variations to map patterns. How does the model respond to different phrasings? Different languages? Different personas in the prompt? This reveals behavioral patterns without explaining them.

Case studies: Detailed analysis of specific model behaviors. When does it hallucinate? On what topics does it refuse to help? When does it admit uncertainty versus confabulate?

✓ STRENGTHS

Easy to do. Works on any model. Produces concrete metrics. Directly measures what users experience.

✗ LIMITATIONS

Only tells you about cases you tested. Can't explain why. Can't predict novel failures. Doesn't reveal internal reasoning.

The fundamental limitation of behavioral interpretability is that it only tells you about the cases you've tested. A model might pass every benchmark and still fail catastrophically on inputs you didn't think to try. You can observe that a model is biased, but not why it's biased or where that bias lives in its computation.

It's like testing a bridge by driving trucks across it. Useful, necessary even — but it doesn't tell you whether the internal structure is sound or where the weak points are.

A famous example: in 2020, researchers discovered that GPT-3 would complete "Two Muslims walked into a ___" with violent scenarios far more often than similar prompts about other groups. Behavioral testing revealed the bias. But it couldn't explain why — was it the training data? The architecture? The fine-tuning? Without understanding the mechanism, fixing it becomes guesswork.

◆ ◆ ◆

Level 2: Concept-Based Interpretability

🧪 Answers: "What does the model represent?"

Concept-based interpretability goes inside the model, but asks a specific question: are human-defined concepts represented in the model's internal states?

The main technique is called probing. Here's how it works:

You choose a concept you care about — say, "sentiment" (positive vs. negative). You run many examples through the model and capture the internal activations — the numerical patterns at each layer as the model processes the input. Then you train a simple classifier (the "probe") to predict sentiment from those activations.

If the probe succeeds, it means the concept is somehow represented in the model's internal state. The information is there — extractable by a linear classifier.

How Probing Works

1	Run examples through model, capture activations

2	Train a simple classifier on activations → concept label

3	If it works → model represents that concept internally

Researchers have used probing to discover fascinating things about what language models represent:

• Syntax: Models represent grammatical structure — part of speech, dependency relations, syntactic trees.

• Factual knowledge: Information about entities, relations, and facts is encoded in activations.

• World models: Some evidence that models represent spatial relationships, temporal sequences, even basic physics.

• Social attributes: Concepts like formality, politeness, and yes — demographic attributes — are often linearly extractable.

✓ STRENGTHS

Confirms concepts exist inside model. Relatively simple technique. Tests specific hypotheses about representations.

✗ LIMITATIONS

Only tests concepts you think to ask about. Doesn't show how concepts are used. Probe might find patterns model doesn't actually use.

The key limitation: just because information is extractable doesn't mean the model actually uses it. A probe might find that you can predict a user's gender from activations — but that doesn't tell you whether gender influences the model's outputs, or how, or where in the computation it matters.

There's also the problem of unknown unknowns. Probing only tests concepts you think to ask about. What about concepts the model has developed that humans haven't named? What about representations that don't map neatly onto human categories?

Concept-based interpretability is like running blood tests. You can check for specific markers you know to look for. But you might miss diseases that don't show up in standard panels — and you definitely don't understand how the body's systems interact to produce symptoms.

A concrete example: researchers have successfully trained probes to extract whether text is "truthful" from model activations. The probe works — you can predict truthfulness from internal states. But this doesn't tell you whether the model uses this representation when generating text, or whether it's just correlated with other features the model actually relies on. The map is not the territory.

◆ ◆ ◆

Level 3: Mechanistic Interpretability

⚙️ Answers: "How does the model compute?"

Mechanistic interpretability aims to reverse-engineer the actual algorithms that neural networks have learned. Not just "what does it do" or "what does it represent" — but "how does it transform inputs into outputs, step by step?"

This is the deepest level of understanding. It's also the hardest.

The mechanistic approach involves:

Feature extraction: Identifying the fundamental units of representation — the "concepts" the model uses internally. Unlike probing, which tests for human-defined concepts, mechanistic interpretability tries to discover the model's own vocabulary — whatever concepts it has developed, whether or not they match human categories.

Circuit analysis: Understanding how features connect and interact. When the model processes "The Eiffel Tower is in ___", what sequence of computations leads it to output "Paris"? Which features activate? How do they influence each other? What pathway does information take through the network?

Causal intervention: Testing understanding by modifying internals. If we've correctly identified a "deception" feature, we should be able to amplify it (making the model more deceptive) or suppress it (making it more honest). If our circuit analysis is correct, we should be able to predict what happens when we modify specific components.

🔬
Find Features

→

🗺️
Map Circuits

→

🎛️
Test & Intervene

The mechanistic interpretability pipeline: discover → connect → verify

This year, Anthropic published groundbreaking work showing this approach in action. They traced the circuits Claude uses when:

• Writing poetry: The model plans rhymes ahead, activating "word ending" features before completing lines that lead to those endings.

• Doing math: For simple problems, Claude uses unexpected "mental math" strategies different from the algorithms it describes when asked. It doesn't know its own methods.

• Making things up: When asked hard math questions, sometimes there's no evidence of calculation at all — the model just generates plausible-sounding numbers.

• Processing languages: The same conceptual features activate across different languages, suggesting a shared "language of thought" that exists before being translated into specific words.

✓ STRENGTHS

Deepest understanding. Discovers model's own concepts. Enables precise intervention. Can predict novel behaviors.

✗ LIMITATIONS

Extremely difficult. Computationally expensive. Current methods incomplete. Requires new theoretical frameworks.

The limitations are real. Current methods capture only a fraction of the computation happening inside models. Analysis that works on small models doesn't always scale. Even "successful" interpretations may miss important aspects of how the model actually works.

But the potential payoff is immense. True mechanistic understanding would let us predict model behavior in novel situations, identify dangerous capabilities before they manifest, and precisely modify models to fix problems. It's the difference between hoping a system is safe and knowing it is.

Consider the Golden Gate Claude experiment from yesterday. Behavioral testing would have shown: "Claude mentions the Golden Gate Bridge a lot when this feature is amplified." Concept-based probing would have shown: "Yes, there's a Golden Gate Bridge representation in the model." But mechanistic analysis showed exactly which feature controlled this, how it influenced other concepts, and what happened to the rest of the model's cognition when it was manipulated. That's the difference.

◆ ◆ ◆

Putting It Together

These three approaches aren't competitors — they're complementary layers of understanding. Each reveals something the others miss.

Approach	Question	Analogy
Behavioral	What does it do?	Observing symptoms
Concept-based	What does it represent?	Running blood tests
Mechanistic	How does it compute?	Understanding physiology

In practice, they work together. Behavioral testing identifies phenomena worth investigating. Concept-based probing confirms whether specific representations exist. Mechanistic analysis explains how those representations arise and combine to produce behavior.

For example: behavioral testing might reveal that a model makes more errors on certain demographic groups. Probing might confirm that demographic information is represented in the model's activations. But only mechanistic analysis can show where in the computation that information influences outputs, how it does so, and what would need to change to fix it.

Key insight: Think of these approaches as answering "what," "what with," and "how." Behavioral tells you what the system does. Concept-based tells you what concepts it has. Mechanistic tells you how it uses those concepts to produce outputs.

This is why mechanistic interpretability, despite being the hardest approach, is the focus of so much research attention. It's the path to understanding that actually enables control.

The field is young but growing fast. Anthropic has made it a core research priority. Google DeepMind has published extensive work on sparse autoencoders. OpenAI has extracted interpretable features from GPT-4. Academic labs at MIT, Berkeley, and Oxford are contributing new techniques. What was once a niche interest has become one of the most active frontiers in AI research.

There's also healthy debate. In March 2025, DeepMind published results showing that sparse autoencoders — a key mechanistic tool we'll cover later — underperform simple linear probes on some practical tasks. A month later, Anthropic published circuit tracing results showing remarkable success at understanding model computation. The field is actively working out what works, what doesn't, and why.

The rest of this series will focus primarily on mechanistic interpretability — the concepts, tools, and recent breakthroughs that are making it possible to read the algorithms learned by neural networks.

But before we can understand how to interpret a transformer, we need to understand what a transformer actually is. What are these "neurons" and "activations" we keep talking about? How do the layers connect? What is "attention" and why does it matter?

Tomorrow, we build those foundations.

Tomorrow

Inside the Transformer — Before we can understand mechanistic interpretability, we need to understand what we're interpreting. We'll explore the building blocks: neurons, layers, activations, and attention. No math required — just clear mental models for how these systems actually work.

This newsletter you couldn’t wait to open? It runs on beehiiv — the absolute best platform for email newsletters.

Our editor makes your content look like Picasso in the inbox. Your website? Beautiful and ready to capture subscribers on day one.

And when it’s time to monetize, you don’t need to duct-tape a dozen tools together. Paid subscriptions, referrals, and a (super easy-to-use) global ad network — it’s all built in.

beehiiv isn’t just the best choice. It’s the only choice that makes sense.

Start free today. No credit card required

Types of Interpretability

Types of Interpretability

Level 1: Behavioral Interpretability

Level 2: Concept-Based Interpretability

Level 3: Mechanistic Interpretability

Putting It Together

Keep Reading

Quick Links

Stay Updated

Types of Interpretability

Types of Interpretability

Level 1: Behavioral Interpretability

Level 2: Concept-Based Interpretability

Level 3: Mechanistic Interpretability

Putting It Together

You can (easily) launch a newsletter too

Keep Reading

Quick Links

Stay Updated