In partnership with

Day 4: The Polysemanticity Problem

Day 4 of 7 • Mechanistic Interpretability Series

The Polysemanticity Problem

Why neurons refuse to be neat — and what that means for understanding AI.

In the early days of neural network interpretability, researchers had a beautiful hypothesis.

The idea was simple: each neuron in a neural network would learn to detect a specific concept. There would be a "cat neuron" that fires when the model sees or thinks about cats. A "Paris neuron" for the French capital. An "anger neuron" for detecting emotional content. Understanding a neural network would be like reading a dictionary — find out what each neuron means, and you've understood the model.

This hypothesis was called monosemanticity — the idea that neurons have single meanings.

It was elegant. It was intuitive. It was inspired by neuroscience, where researchers had famously found "grandmother cells" — neurons that fire specifically for concepts like Jennifer Aniston. (Seriously — there's a neuron in the human brain that responds specifically to pictures of Jennifer Aniston.)

There was just one problem: it wasn't true.

◆ ◆ ◆

The Messy Reality

When researchers actually looked at what individual neurons respond to, they found chaos.

One famous example from early language model research: a single neuron that activated for:

One neuron. All of these:

• Academic citations
• English dialogue
• HTTP requests
• Korean text

What's the common thread between academic citations and Korean text? There isn't one — at least not one humans can identify. The neuron has learned to respond to multiple unrelated patterns.

This wasn't an isolated case. When researchers systematically analyzed neurons in language models, they found that most neurons were polysemantic — responding to multiple, seemingly unrelated concepts. The clean "grandmother cell" hypothesis had crashed into messy reality.

In a 2022 paper, researchers at Anthropic documented this extensively. They looked at GPT-2, a relatively small language model, and tried to categorize what each neuron detected. The results were humbling. For many neurons, the best description was just "miscellaneous" — a grab-bag of unrelated patterns.

In image models, the pattern was the same. A neuron might fire for:

• Cat faces

• Car fronts

• Certain font styles

Maybe there's some abstract similarity — they all have a roughly symmetric structure with two "eyes"? But that's a stretch. The neuron isn't detecting a clean concept; it's doing something messier. And importantly, it's doing it for reasons that served training — the model learned this representation because it was useful, not because it was interpretable.

This is a crucial point: neural networks optimize for performance, not interpretability. If sharing neurons across unrelated concepts helps the model predict the next word better, that's what it will do. The architecture doesn't care whether humans can understand it.

◆ ◆ ◆

Why Does This Happen?

At first, polysemanticity seemed like a bug — maybe an artifact of imperfect training or suboptimal architectures. If we just trained models better, surely neurons would become clean concept detectors?

But as researchers dug deeper, a different picture emerged. Polysemanticity isn't a bug. It's a feature. Or rather, it's an inevitable consequence of a fundamental constraint:

Models need to represent more concepts than they have neurons.

Think about what a language model needs to know. It needs concepts for every word in every language. Every named entity — people, places, companies, products. Every abstract idea — justice, irony, recursion. Every domain — law, medicine, cooking, quantum physics. Every relationship between all of these things.

How many concepts is that? Millions, at minimum. Possibly billions. Consider just proper nouns: every person who's ever been famous enough to appear in training data, every company, every city, every product. Then add abstract concepts, technical terms, slang, memes, idioms in every language. The number explodes.

Now how many neurons does a model have? GPT-4 has around 16,000 neurons per layer (in the MLP). Even with 120 layers, that's under 2 million neurons total. And neurons need to do more than just store concepts — they need to implement computations, transformations, reasoning steps.

The math doesn't work. If each neuron could only represent one concept, models would run out of capacity almost immediately. They'd be unable to learn the vast range of knowledge they demonstrably possess.

So neural networks do something clever: they pack multiple concepts into the same neurons.

It's like compression. When you zip a file, you're using the same bits to encode more information by exploiting patterns and redundancy. Neural networks do something similar — they exploit the fact that not all concepts need to be active simultaneously, so they can share neural real estate.

◆ ◆ ◆

The Radio Station Analogy

Here's an analogy that might help.

Imagine you have a radio receiver that can only tune to 10 frequencies. But there are 100 radio stations broadcasting in your area. How do you listen to all of them?

One solution: put multiple stations on each frequency. Station A and Station B both broadcast on 101.1 FM. Station C and Station D share 102.3 FM. And so on.

This creates interference — when you tune to 101.1, you hear a mix of Station A and Station B. That's bad if you only want one station. But if the stations are clever about when they broadcast (A plays music during the day, B plays at night), or if you have sophisticated signal processing to separate them, it can work.

Superposition: Many Signals, Few Channels

🐱

🚗

📝

↓

ONE NEURON

Multiple concepts share the same neural "channel"

Neural networks do something similar. They encode multiple concepts using the same neurons, relying on the fact that most concepts don't need to be active at the same time. When you're reading about cats, you probably don't need your "HTTP request" concept active. When you're parsing code, you probably don't need your "cat" concept. By time-sharing neurons, the network can represent far more concepts than it has neurons.

This phenomenon has a name: superposition. Concepts are stored in "superposition" — overlapping, sharing neural resources, interfering with each other in ways that somehow still work.

The term comes from physics, where quantum states can exist in superposition — multiple states at once until measured. Neural superposition is different in the details but similar in spirit: multiple concepts coexist in the same neural substrate.

◆ ◆ ◆

How Superposition Works (Without Math)

The key insight is that concepts aren't stored in individual neurons — they're stored in directions in the space of all neurons.

Imagine a 2D space — a flat plane. You can point in infinitely many directions on that plane. North, northeast, east, southeast, and everything in between. Even though the space is only 2-dimensional, you can encode far more than 2 directions.

Neural networks work similarly, but in much higher dimensions. A layer with 4,096 neurons creates a 4,096-dimensional space. In that space, you can encode millions of different directions — far more than 4,096. High-dimensional geometry is weird and counterintuitive, but this is one of its gifts: there's vastly more "room" in diagonal directions than along the axes.

Each concept gets assigned a direction. "Cat" might be [0.3, -0.1, 0.7, ...]. "Car" might be [0.2, 0.5, -0.3, ...]. When the model wants to activate the "cat" concept, it pushes the activation toward that direction. When it wants "car", it pushes toward the car direction.

The problem for interpretability: neurons don't align with these concept directions. A single neuron might be part of hundreds of different concept directions. That's why looking at individual neurons shows a mess — you're looking at one axis of a high-dimensional space where the meaningful structure lies in diagonal directions.

It's like looking at a city from directly above and trying to understand it by only looking at north-south streets. You'd miss everything happening on east-west streets, on diagonal avenues, in buildings. The meaningful structure isn't aligned with your viewing angle.

Key insight: Concepts are directions in neural space. Neurons are axes. When directions don't align with axes, each neuron participates in many concepts, and each concept spans many neurons.

◆ ◆ ◆

Why This Matters for Interpretability

Polysemanticity is the central obstacle to understanding neural networks.

If neurons were monosemantic — one neuron, one concept — interpretability would be straightforward. You'd catalog what each neuron means, trace how they connect, and understand the model like reading a circuit diagram.

But with polysemanticity, the meaningful units of computation are hidden. They're directions in a high-dimensional space, not the axes we can directly observe. It's like trying to understand a conversation by looking at the individual sound frequencies instead of the words. The information is there, but it's encoded in a way that's not naturally visible.

This creates several concrete problems:

We can't read the model's concepts. If we want to know what concepts a model has learned, we can't just list what each neuron detects — the concepts are smeared across many neurons.

We can't trace the model's reasoning. Following how information flows through the network becomes exponentially harder when each neuron carries multiple overlapping signals.

We can't safely modify the model. If you want to remove a dangerous capability or edit a piece of knowledge, you need to know where it lives. With superposition, editing one concept risks corrupting others that share the same neurons.

We can't verify safety properties. Checking whether a model has learned something dangerous requires identifying where that something is represented — which superposition makes extremely difficult.

This is why polysemanticity isn't just an academic problem. It's a safety problem. If we can't see what a model knows or how it reasons, we can't verify that it's safe. We're flying blind with increasingly powerful systems.

◆ ◆ ◆

A Path Forward?

For years, polysemanticity seemed like an insurmountable barrier. If the meaningful units are hidden directions rather than visible neurons, how could we ever find them?

The breakthrough came from a simple idea: what if we could learn a transformation that converts the messy neuron space into a cleaner feature space?

If concepts are directions in neuron space, maybe we can train a system to discover those directions automatically. We'd feed in the neuron activations and ask: "What are the underlying features that combine to produce these patterns?"

This is exactly what sparse autoencoders do. They're a technique for "unmixing" the superimposed features — like separating the radio stations that are broadcasting on the same frequency. We'll dive deep into how they work in Day 6, but the basic idea is simple: train a system to find directions in neuron space that correspond to clean, interpretable concepts.

In 2023, Anthropic demonstrated that sparse autoencoders could extract interpretable features from a small transformer. In 2024, they scaled it to Claude 3 Sonnet — extracting millions of features, including now-famous ones like the "Golden Gate Bridge" feature we discussed in Day 1.

The results were striking. Where neurons were polysemantic messes, the extracted features were often clean. One feature specifically activated for the Golden Gate Bridge. Another for code written in Python. Another for expressions of uncertainty. Another for discussions of inner conflict. These weren't neurons — they were learned directions in neural space, but they corresponded to recognizable human concepts.

The polysemanticity problem isn't solved. But for the first time, we have tools that can work around it — converting the uninterpretable neuron basis into an interpretable feature basis.

Term	Meaning
Monosemantic	One neuron = one concept (the dream)
Polysemantic	One neuron = many concepts (the reality)
Superposition	Concepts stored as overlapping directions
Feature	A direction that represents one concept

Tomorrow

Features & Superposition — We'll go deeper into how superposition actually works, why it's mathematically possible to store more concepts than dimensions, and how this insight unlocked a new era of interpretability research.

Don’t get SaaD. Get Rippling.

Remember when software made business simpler?

Today, the average company runs 100+ apps—each with its own logins, data, and headaches. HR can’t find employee info. IT fights security blind spots. Finance reconciles numbers instead of planning growth.

Our State of Software Sprawl report reveals the true cost of “Software as a Disservice” (SaaD)—and how much time, money, and sanity it’s draining from your teams.

Read the full report and see how your company stacks up→

The future of work is unified. Don’t get SaaD. Get Rippling.

Stop SaaD in its tracks

The Polysemanticity Problem (Part 4)

The Polysemanticity Problem

The Messy Reality

Why Does This Happen?

The Radio Station Analogy

How Superposition Works (Without Math)

Why This Matters for Interpretability

A Path Forward?

Don’t get SaaD. Get Rippling.

Keep Reading

Quick Links

Stay Updated