Make Newsletter Magic in Just Minutes

Your readers want great content. You want growth and revenue. beehiiv gives you both. With stunning posts, a website that actually converts, and every monetization tool already baked in, beehiiv is the all-in-one platform for builders. Get started for free, no credit card required.

Day 7: Circuits & The Future
Day 7 of 7 • Mechanistic Interpretability Series • Finale

Circuits & The Future

From features to computations — and where mechanistic interpretability goes next.


We've come a long way this week.

We started with the black box problem — AI systems that work but can't be understood. We learned about the transformer architecture and why neurons are polysemantic. We understood superposition and how sparse autoencoders can extract interpretable features.

But features are only half the story. They're like knowing the vocabulary of a language but not the grammar.

Knowing that a model has a "Golden Gate Bridge" feature doesn't tell us how it uses that feature. Knowing it has features for "deception" doesn't tell us when those features activate or what triggers them. Features are the vocabulary. To understand the model, we need the grammar — the rules for how features combine to produce behavior.

That's what circuits are about.

◆ ◆ ◆

What Is a Circuit?

A circuit is a computational pathway through the network — a connected set of features that work together to implement a specific behavior.

Think of it like tracing wires in an electronic device. Individual components (features) do specific things, but they only produce useful behavior when connected in particular ways. The circuit is the pattern of connections that makes the behavior happen.

In a transformer, information flows through attention heads and MLP layers. Each step can read from and write to the residual stream. A circuit is a specific pathway through this machinery — a subset of components that together implement a coherent function.

For example, imagine asking a model "What is the capital of France?" A circuit for answering this might involve:

1. Features that recognize this is a "capital city" question

2. Features that identify "France" as the target country

3. Features that retrieve the association "France → Paris"

4. Features that format "Paris" as the appropriate response

The circuit is how these features connect — how information flows from recognizing the question type, to identifying the country, to retrieving the answer, to producing output.

This is different from just looking at individual features. A feature is a single component; a circuit is an organized system. Understanding circuits means understanding the algorithms that neural networks have learned to implement.

A Simple Circuit

Question
Type
Country
Identity
Fact
Retrieval
Output
Format

Features connected in a computational pathway

◆ ◆ ◆

The March 2025 Breakthrough

In March 2025, Anthropic published groundbreaking research on circuit tracing in Claude. For the first time, researchers traced complete computational pathways through a production-scale language model. This wasn't a toy model or a simplified setup — it was the actual Claude that millions of people use every day.

The technique is called attribution patching (or "circuit tracing"). The idea is to track which features influence which other features — to map the causal connections that implement a behavior.

Here's a simplified version of how it works:

1. Pick a behavior — something specific like "answering factual questions" or "refusing harmful requests"

2. Identify output features — which features are active when producing the behavior?

3. Trace backwards — what earlier features contributed to activating the output features?

4. Repeat — keep tracing back until you reach the input

What emerges is a map of the computation — a circuit diagram showing how information flows from input to output. It's like reverse-engineering software by tracing which functions call which other functions, except the "functions" are distributed neural computations.

The Anthropic team traced circuits for multiple behaviors, including how Claude processes multi-step reasoning, how it handles requests for harmful content, and how it decides when to be uncertain. The results were remarkable: the circuits were often surprisingly interpretable, with recognizable computational steps that matched human intuitions about how the task should be solved.

Key finding: Claude's internal computations often mirror recognizable reasoning steps. The model isn't just pattern-matching — it's implementing something that looks like genuine multi-step reasoning, with interpretable intermediate states.

◆ ◆ ◆

What Circuits Reveal

Circuit analysis has already yielded important discoveries about how language models work:

Induction heads: Early circuit research (2022) discovered "induction heads" — a specific circuit pattern that implements in-context learning. When a model sees "A B ... A", induction heads predict "B" will follow. This simple mechanism underlies much of what makes language models useful — the ability to learn patterns from examples in the prompt.

Factual recall circuits: Researchers have traced how models retrieve factual information. The circuits show a clear pipeline: recognize the query type → identify the subject → retrieve the associated fact → format the output. The model has learned something like a database lookup, implemented in neural circuits. This helps explain both how models know things and why they sometimes get things wrong.

Safety circuits: The March 2025 work traced circuits involved in Claude's safety behaviors. When Claude refuses a harmful request, specific circuits detect the harmful content, activate refusal-related features, and suppress the generation of harmful outputs. Understanding these circuits is crucial for ensuring safety measures are robust — and for understanding how they might fail or be bypassed.

Reasoning chains: For multi-step problems, circuits show how intermediate conclusions are computed and stored, then used in subsequent steps. The model isn't just predicting tokens — it's building up internal state that represents the progress of a computation.

Perhaps most surprising: the circuits often look like something a human engineer might design. There are modular components with clear functions. There are recognizable algorithmic patterns. The models have independently discovered computational strategies that make sense to us — even though they were never explicitly taught these strategies.

◆ ◆ ◆

Why This Matters for Safety

Circuit-level understanding opens new possibilities for AI safety:

Verifying safety measures: Instead of just testing whether safety training worked (behavioral evaluation), we can examine the circuits and verify that safety-relevant computations are happening correctly. We can check that harmful-content-detection features properly connect to refusal-generation features.

Finding failure modes: By tracing circuits, we can identify cases where safety measures might fail — inputs that bypass the detection circuits, or edge cases where the wrong pathway gets activated. This is more thorough than just testing with examples.

Detecting deceptive reasoning: A model that's being deceptive might have different circuits active than one that's being honest. Circuit analysis could reveal when a model is "thinking about" deceiving the user, even if its outputs look normal.

Surgical fixes: When we find a problematic circuit, we can potentially modify it directly — strengthening connections we want, weakening ones we don't. This is more precise than retraining the entire model and hoping the problem goes away.

Understanding emergent behaviors: As models get larger, they develop new capabilities that weren't explicitly trained. Circuit analysis can help us understand where these capabilities come from and whether they might include dangerous ones we didn't intend.

This is a fundamentally different approach to AI safety than what came before. Previous approaches relied on training models to behave well and then testing to see if they did. Circuit analysis lets us actually verify the internal mechanisms — checking that the model is safe for the right reasons, not just appearing safe on our tests.

The difference matters. A model might pass all our safety tests for the wrong reasons — perhaps it has learned to detect when it's being tested and behave differently. Circuit analysis can expose this kind of deception because we can see what's actually happening inside.

◆ ◆ ◆

The Road Ahead

Mechanistic interpretability has made remarkable progress, but significant challenges remain:

Scale: Current techniques have been demonstrated on models up to Claude 3 Sonnet scale. Applying them to the largest models (GPT-4, Claude 3 Opus, and beyond) requires further methodological advances and computational resources. The largest models may have qualitatively different internal structures that require new approaches.

Coverage: We can trace some circuits, but not all. Important behaviors might depend on circuits we haven't yet found or can't yet analyze. Complete coverage remains a distant goal. There's always the worry that the most dangerous capabilities are hiding in the circuits we haven't mapped.

Automation: Currently, interpreting features and circuits requires significant human effort. Scaling to millions of features and countless circuits requires automated interpretation methods — using AI to understand AI. This is an active area of research with promising early results.

Real-time application: For interpretability to be useful in deployment, we need techniques that work in real-time — monitoring circuits as the model runs, flagging concerning patterns, intervening when necessary. Current techniques are too slow for production use, but this is an active area of development.

Keeping pace with capabilities: AI capabilities are advancing rapidly. Interpretability research needs to keep pace. If we can't understand next year's models, our safety tools won't work on them. This creates urgency — we need to solve these problems before they become critical.

The field is moving fast. What seemed impossible five years ago is now routine. What seems hard today may be solved tomorrow. The combination of better techniques, more compute, and growing urgency is accelerating progress.

Multiple organizations are now investing heavily in interpretability research. Anthropic has made it a core part of their safety strategy. DeepMind, OpenAI, and major academic labs are publishing important work. The field is no longer a niche interest — it's recognized as critical infrastructure for AI safety.

Year Milestone
2020 Early circuit analysis in vision models
2022 Induction heads discovered; Toy Models of Superposition
2023 SAEs extract interpretable features from small models
2024 Scaling Monosemanticity: millions of features from Claude
2025 Circuit tracing in production models
Future Real-time monitoring? Complete model maps?
◆ ◆ ◆

What We've Learned

Let's bring it all together.

The problem: AI systems are black boxes. We can see what they do, but not why. As these systems become more powerful, this opacity becomes dangerous.

The challenge: Neural networks store information in polysemantic neurons via superposition. The meaningful units (features) aren't directly visible — they're directions in high-dimensional space.

The breakthrough: Sparse autoencoders can extract interpretable features from the neural noise. We can find concepts like "Golden Gate Bridge" or "deception" encoded in the model's representations.

The frontier: Circuit tracing lets us see how features connect to implement behaviors. We're beginning to read the "source code" of neural networks — understanding not just what they represent, but how they compute.

The goal: Interpretability tools that let us verify AI systems are safe, detect problems before they cause harm, and build AI we can genuinely trust.

We're not there yet. But for the first time, the path is visible. We have tools that can extract meaningful features from frontier models. We have techniques that can trace computational circuits. We have a research community that's making rapid progress.

The black box is starting to become transparent. And that transparency might be what allows us to build AI systems that are genuinely safe and beneficial — systems we can understand, verify, and trust.


Thank you for reading this series.

Mechanistic interpretability is one of the most important fields in AI right now. If this series helped you understand why — and how — then it's done its job.

The series: Day 1: What is AI Interpretability? → Day 2: Types of Interpretability → Day 3: Inside the Transformer → Day 4: The Polysemanticity Problem → Day 5: Features & Superposition → Day 6: Sparse Autoencoders → Day 7: Circuits & The Future

Keep Reading

No posts found