In partnership with

Day 5: Features & Superposition
Day 5 of 7 • Mechanistic Interpretability Series

Features & Superposition

How neural networks pack infinite concepts into finite space — and how we can unpack them.


Yesterday we discovered the polysemanticity problem: neurons don't represent clean concepts. Instead, they're tangled messes responding to multiple unrelated things.

This might seem like a dead end. If the natural units of neural networks (neurons) aren't interpretable, how can we ever understand what's happening inside?

Today we're going deeper. We'll understand why this happens mathematically, what "features" actually are, and how this understanding unlocked a new approach to interpretability.

This is the conceptual foundation for everything that follows.

◆ ◆ ◆

What Is a Feature?

Before we go further, we need to define our terms. In mechanistic interpretability, a feature is a fundamental unit of representation — the atomic building blocks of what a neural network "thinks about."

Features are the concepts, patterns, and abstractions that the network has learned to recognize and use. Some examples of what features might represent:

• A specific entity: "Golden Gate Bridge", "Barack Obama", "Python programming language"

• An abstract concept: "uncertainty", "sarcasm", "logical contradiction", "inner conflict"

• A syntactic pattern: "end of sentence", "start of quote", "list item", "code block beginning"

• A contextual signal: "the user is asking for code", "this is medical advice", "formal tone requested"

• A safety-relevant concept: "deceptive intent", "harmful request", "user vulnerability"

The key hypothesis of mechanistic interpretability is that there exist clean, interpretable features — and that the model's behavior can be understood as the interaction of these features. The challenge is that features aren't directly visible. They're hidden in the neural activations, encoded in a way that doesn't align with individual neurons.

This is a profound claim. It suggests that despite the apparent chaos of polysemanticity, there's an underlying order. The mess we see when we look at neurons is like looking at a pointillist painting from too close — step back, change your perspective, and coherent images emerge.

Feature (n.): A direction in neural activation space that corresponds to a meaningful concept. Features are what the model "sees" and "thinks about" — the vocabulary of its internal language.

◆ ◆ ◆

The Geometry of Superposition

Here's where things get interesting. Superposition isn't magic — it's geometry. And once you see the geometry, the whole thing clicks.

Let's start with a simple example. Imagine you have a 2-dimensional space — a flat piece of paper. How many distinct directions can you draw on that paper?

If you require directions to be perfectly perpendicular (orthogonal), you can only fit 2: one pointing right, one pointing up. That's it. In 2D, you have exactly 2 orthogonal directions. This seems limiting.

But what if you relax the requirement? What if directions just need to be mostly different — say, at least 60 degrees apart?

Now you can fit more. You could have directions pointing at 0°, 60°, 120°, 180°, 240°, and 300°. That's 6 directions in a 2D space — three times more than the number of dimensions.

2D Space: Orthogonal vs. "Almost Orthogonal"

Orthogonal: 2 directions

"Almost" orthogonal: 6 directions

This is the core insight of superposition: by allowing features to be "almost orthogonal" instead of perfectly orthogonal, you can pack far more features into the same space.

The tradeoff is precision. Perfectly orthogonal directions don't interfere at all — you can read each one independently. Almost-orthogonal directions interfere a little — reading one gives you a faint echo of the others. But if that echo is small enough, and if you can tolerate occasional errors, the capacity gain is enormous.

And here's the kicker: as you increase dimensions, the effect becomes exponentially more dramatic.

◆ ◆ ◆

The Curse Becomes a Blessing

In machine learning, there's a famous problem called the "curse of dimensionality" — as dimensions increase, spaces become increasingly vast and sparse, making learning harder.

But for superposition, high dimensionality is a blessing. In high-dimensional spaces, there's a mind-bending amount of room for "almost orthogonal" directions.

Here's a concrete example that might break your intuition:

Mind-bending fact: In a 1,000-dimensional space, you can fit roughly 500,000 directions that are at least 87° apart (nearly orthogonal). That's 500x more features than dimensions.

In a 4,096-dimensional space — typical for a transformer layer — the numbers become astronomical. You can theoretically encode millions of nearly-orthogonal directions.

This is deeply counterintuitive. Our brains evolved in 3D space, so we have no intuition for what high-dimensional geometry looks like. But the math is clear: high dimensions have vastly more "room" than low dimensions, and most of that room is in diagonal directions that don't align with any axis.

This is how neural networks store so much knowledge. They're not limited to one concept per neuron. They exploit the geometry of high-dimensional space to pack in vastly more features than they have dimensions.

◆ ◆ ◆

The Cost of Superposition

There's no free lunch. Superposition comes with a cost: interference.

When features aren't perfectly orthogonal, activating one feature slightly activates others. If "cat" and "car" share some neural directions, then thinking about cats creates a small ghost signal for cars.

Most of the time, this interference is small enough to ignore. If "cat" is 87° away from "car", the interference is tiny — cos(87°) ≈ 0.05. A 5% ghost signal usually doesn't matter. The model's downstream computations can filter it out.

But occasionally it causes problems. This might explain some of the weird failures we see in language models — strange associations, unexpected completions, confusions between superficially similar concepts. Some of these could be interference between features that ended up too close together in the model's representational space.

Researchers have found evidence for this. Some model failures seem to involve "feature collision" — situations where two concepts the model should distinguish have ended up with overlapping representations. The model gets confused not because it lacks the relevant knowledge, but because the knowledge is stored in an interfering way.

The model has learned to tolerate some interference because the benefit (storing more concepts) outweighs the cost (occasional errors). It's a tradeoff that training optimizes automatically.

More Superposition Less Superposition
✓ More features stored ✗ Fewer features stored
✗ More interference ✓ Less interference
✗ Harder to interpret ✓ Easier to interpret
◆ ◆ ◆

Sparsity Is The Key

There's one more crucial piece to this puzzle: sparsity.

Superposition only works because most features are inactive most of the time. When you're reading about quantum physics, your "cat" feature is probably off. When you're writing Python code, your "medieval history" feature is dormant. At any given moment, only a tiny fraction of features are active.

This sparsity is what makes superposition tolerable. Yes, features interfere with each other — but only when they're both active. If features that share neurons are rarely active together, the interference rarely matters.

How sparse are features in practice? Extremely sparse. In a given context, only a tiny percentage of all possible features are active. When you read a sentence about cooking, features for "ingredients", "kitchen", and "recipes" might be on — but features for "quantum entanglement", "medieval warfare", and "JavaScript debugging" are all off. The vast majority of features are dormant at any given moment.

Think back to the radio station analogy. You can put multiple stations on the same frequency if they broadcast at different times. Morning talk show on one, late-night jazz on another. They share the channel but never interfere because they're never on simultaneously.

Neural networks exploit the same principle. Features that tend to co-occur get assigned to well-separated directions (less interference when it matters). Features that rarely co-occur can share more neural resources (interference doesn't matter if they're never both on).

This explains why the model can store knowledge about quantum physics and medieval history in overlapping neurons. You're rarely going to be thinking about both simultaneously. But features that commonly co-occur — like "Python" and "code" and "programming" — get more carefully separated representations.

The superposition recipe: High dimensions + sparse activations = massive capacity for features. This is why neural networks can know so much while having "only" billions of parameters.

◆ ◆ ◆

From Theory to Practice

In 2022, Anthropic published "Toy Models of Superposition" — a paper that rigorously demonstrated these phenomena in controlled settings. They trained tiny neural networks on synthetic tasks and showed exactly how superposition emerges. It was interpretability research at its finest: simplify the problem until you can understand it completely, then use that understanding to tackle the real thing.

The findings confirmed the theory:

More features than dimensions: Networks learned to represent more features than they had neurons, exactly as the geometry predicted.

Sparsity enables superposition: Sparser features (those active less often) were packed more aggressively into superposition.

Importance matters: Important features got more orthogonal directions (less interference), while less important features were squeezed into whatever space remained.

Phase transitions: As you change the balance between sparsity and importance, the network undergoes sharp transitions between different representational strategies. These aren't gradual changes but sudden reorganizations — like water freezing into ice.

The researchers also found beautiful geometric structures. In some cases, features arranged themselves into perfect mathematical shapes — simplices, polytopes, regular geometric forms. The network wasn't randomly scattering features; it was finding optimal packings, the same packings that mathematicians had studied for centuries in abstract geometry.

This paper was a turning point. It showed that superposition wasn't just a vague intuition — it was a quantifiable, predictable phenomenon. And if we understand how features are packed into superposition, maybe we can learn to unpack them.

The paper also revealed something important: the structure of superposition isn't random. Features organize themselves in predictable geometric patterns — sometimes forming regular polytopes, sometimes clustering by similarity, sometimes creating hierarchical arrangements. There's order in the chaos, if you know how to look for it.

That's exactly what sparse autoencoders do. They're designed to reverse superposition — to take the tangled, polysemantic neuron activations and decompose them into clean, monosemantic features.

Tomorrow, we'll see exactly how they work.

Why This Matters for Safety

Understanding superposition isn't just academically interesting — it has direct implications for AI safety.

If we can decompose model activations into clean features, we can potentially:

Detect deceptive reasoning by looking for features associated with manipulation or concealment.

Understand model values by examining which features are activated when making ethical decisions.

Verify safety properties by checking whether dangerous capability features exist and how they're used.

Surgically edit behavior by modifying specific features rather than retraining entire models.

This is the promise of mechanistic interpretability: not just understanding AI, but building tools to ensure AI remains aligned with human values as it becomes more powerful.

The stakes are high. As AI systems become more capable, our ability to understand and verify their behavior becomes critical. Superposition is the barrier; sparse autoencoders are the key. Tomorrow, we'll see how they work.

Concept Key Idea
Feature A direction in neural space representing a concept
Superposition Packing more features than dimensions via "almost orthogonal" directions
Interference The cost of superposition — features bleed into each other
Sparsity Most features inactive most of the time — enables superposition

Tomorrow

Sparse Autoencoders — The breakthrough technique that cracks superposition open. How do you train a system to find features automatically? How did this lead to discovering millions of interpretable features inside Claude? And what does this mean for AI safety?

Effortless Tutorial Video Creation with Guidde

Transform your team’s static training materials into dynamic, engaging video guides with Guidde.

Here’s what you’ll love about Guidde:

1️⃣ Easy to Create: Turn PDFs or manuals into stunning video tutorials with a single click.
2️⃣ Easy to Update: Update video content in seconds to keep your training materials relevant.
3️⃣ Easy to Localize: Generate multilingual guides to ensure accessibility for global teams.

Empower your teammates with interactive learning.

And the best part? The browser extension is 100% free.

Keep Reading

No posts found