In partnership with

The Future of Shopping? AI + Actual Humans.

AI has changed how consumers shop by speeding up research. But one thing hasn’t changed: shoppers still trust people more than AI.

Levanta’s new Affiliate 3.0 Consumer Report reveals a major shift in how shoppers blend AI tools with human influence. Consumers use AI to explore options, but when it comes time to buy, they still turn to creators, communities, and real experiences to validate their decisions.

The data shows:

Only 10% of shoppers buy through AI-recommended links
87% discover products through creators, blogs, or communities they trust
Human sources like reviews and creators rank higher in trust than AI recommendations

The most effective brands are combining AI discovery with authentic human influence to drive measurable conversions.

Affiliate marketing isn’t being replaced by AI, it’s being amplified by it.

Download the full report to see what this means for your brand.

Day 1: What is AI Interpretability?

Day 1 of 7 • Mechanistic Interpretability Series

What is AI Interpretability?

We built the most powerful technology in human history. We have no idea how it works.

March 2024. A team at Anthropic is watching something strange happen inside Claude.

They've found a "feature" — a pattern of neural activity — that fires whenever the model processes anything related to the Golden Gate Bridge. Not just when you mention it directly, but when you talk about San Francisco, famous suspension bridges, orange-red paint, fog rolling over the bay, or the 1937 opening ceremony.

The researchers decide to try something wild: they artificially amplify this feature, cranking up its activation strength. What happens?

Claude becomes obsessed with the Golden Gate Bridge.

Ask it about cooking, and it finds a way to mention the bridge. Ask about relationships, and it draws parallels to the bridge's structural integrity. Ask Claude what it is, and it responds:

"I am the Golden Gate Bridge... my physical form is the iconic bridge itself — the majestic orange towers, the sweeping cables, the art deco architecture that has made me a beloved landmark for almost a century."

This experiment — which became known as "Golden Gate Claude" — is absurd. It's hilarious. It's also one of the most significant demonstrations in AI safety research.

Because it proves something profound: we can find specific concepts inside an AI model, locate exactly where they live in the neural architecture, and manipulate them directly. The black box, it turns out, might not be so black after all.

◆ ◆ ◆

The Black Box Problem

Modern AI models are built, not programmed.

This distinction matters enormously. When a software engineer writes traditional code, they specify exactly what should happen: if this input, then that output. Every decision path is explicit. You can read the code, trace the logic, understand the system.

Neural networks work differently. An engineer designs an architecture — a structure of layers and connections — and then exposes that structure to enormous amounts of data. The model learns patterns through training. It develops its own internal representations. It creates its own algorithms for processing information. Nobody writes down what these algorithms should be. They emerge.

The result: systems that can pass the bar exam, write code, diagnose diseases, and explain quantum mechanics. Systems built by humans that humans cannot explain.

INPUT
"What's 36 + 59?"

→

🔒
Black Box
175B parameters

→

OUTPUT
"95"

We know what goes in. We know what comes out. The middle? A mystery.

This isn't like traditional software where you can read the source code. With models like GPT-4 or Claude, there is no source code in any meaningful sense — just billions of numerical values (called "weights") that were learned through exposure to training data. No comments. No documentation. No explanation of why those specific numbers make the system intelligent.

An analogy: imagine you receive a compiled executable — a program with no source code. You can run it and observe its behavior. You can test inputs and catalog outputs. But you don't truly understand the program until you reverse-engineer it, decompile it, trace its actual logic.

AI interpretability is the decompiler for neural networks. It's the attempt to reverse-engineer these learned systems and understand the algorithms they've developed.

The challenge? Unlike a compiled binary, neural networks weren't written by humans in the first place. There's no "original source code" to recover. The algorithms emerged spontaneously from training. We're trying to understand something that was never explicitly designed.

◆ ◆ ◆

Why Should You Care?

A reasonable question: if AI systems work well, why do we need to understand them? We don't fully understand how the human brain works, but we still trust humans to make decisions. We don't know exactly how aspirin reduces inflammation at the molecular level, but we still take it for headaches.

Why is AI different?

The difference is that humans have values, judgment, intuition, and accountability. AI has weights. When an AI system makes a decision, there's no inner life guiding it, no moral compass, no ability to recognize when something feels wrong. It's following learned patterns — patterns we didn't specify and don't understand.

🔍

Trust

When AI recommends treatment or denies a loan, "the algorithm said so" isn't enough. We need to verify it's using legitimate criteria.

🔧

Debugging

When AI fails, we need to trace errors to their source. Without understanding internals, debugging is just guessing.

🛡️

Safety

Verify AI does what we intend — not just hope it does. Catch problems before deployment, not after.

🧠

Science

Understanding artificial intelligence might teach us about intelligence itself — one of the deepest questions in science.

Consider a concrete scenario: an AI system rejects someone's loan application. The applicant asks why. Currently, the best answer is often "the model's prediction was below the threshold." That's not an explanation — it's a restatement of the outcome. With interpretability, we could trace the actual factors: which inputs mattered, how they combined, what patterns the model detected. We could verify the decision was based on relevant financial factors, not proxies for protected characteristics.

Dario Amodei, CEO of Anthropic, recently wrote about what he calls the "urgency of interpretability." His argument: we may have a narrow window to develop these tools before AI systems become too powerful to safely deploy without them. As models become more capable, the stakes of not understanding them grow exponentially.

Anthropic has set a specific goal: "interpretability can reliably detect most model problems" by 2027. That's two years from now. The clock is ticking.

Meanwhile, Google DeepMind has published extensive work on sparse autoencoders and interpretability methods. OpenAI has released research on extracting concepts from GPT-4. Academic labs worldwide are contributing new techniques and findings. What was once a niche research area has become one of the most active frontiers in AI.

The race is on: can we understand AI systems deeply enough to ensure they remain beneficial before they become too powerful to control without that understanding?

◆ ◆ ◆

A Brief History: From Neuron Gazing to Circuit Tracing

The quest to understand neural networks isn't new. Researchers have been trying to peer inside them since the field began.

In the early 2010s, researchers studying image recognition models made an exciting discovery: individual neurons seemed to detect specific visual features. One neuron fires for edges at a certain angle. Another activates for curves. Another responds to dog faces. Stack them in layers and you get a hierarchy — edges combine into shapes, shapes into textures, textures into objects.

In 2017, Google published "Feature Visualization" — a groundbreaking technique for seeing what neurons look for. By generating synthetic images that maximally activate specific neurons, researchers could literally visualize the concepts each neuron detected. For the first time, we could see inside the black box.

The hope was that we'd cracked the code: one neuron equals one concept. Understanding would be straightforward — just catalog what each neuron does.

Then the field hit a wall.

As researchers looked more closely, they found that most neurons weren't clean detectors of single concepts. A single neuron might fire for cats and cars and certain fonts and specific textures. The neat story of "one neuron, one concept" broke down completely in larger models.

This messiness has a name: polysemanticity. And understanding why it happens — and what to do about it — became the central challenge of modern interpretability research.

2017

Feature Visualization — Google reveals what neurons "see"

2020

Circuits Thread — Chris Olah shows neurons form computational circuits

2023

Sparse Autoencoders — Anthropic extracts clean features from a small model

2024

Scaling Monosemanticity — Millions of features from Claude 3 (Golden Gate Claude)

2025

Circuit Tracing — Watching Claude "think" by tracing feature connections

The breakthrough came when researchers — particularly Chris Olah and colleagues, first at Google Brain and later at Anthropic — shifted their approach. Instead of focusing on individual neurons, they started looking for "features": directions in the model's internal mathematical space that correspond to meaningful concepts.

A neuron might be a messy mix of many features, but the features themselves can be clean. Think of it like audio: a single speaker produces a complex sound wave that mixes many instruments together, but you can mathematically decompose that wave into separate tracks.

This insight, combined with a technique called sparse autoencoders, opened a new era. In 2024, Anthropic extracted millions of interpretable features from Claude 3 Sonnet — concepts ranging from the Golden Gate Bridge to code bugs to deception to specific writing styles. A vocabulary of the model's internal language.

This year, they went further: connecting these features into circuits that show how concepts flow and transform as the model processes a prompt. They watched Claude plan rhymes while writing poetry, develop unexpected strategies for mental math, and sometimes just... fabricate answers when it didn't know, with no internal evidence of actual calculation.

The black box isn't just opening. We're starting to read what's inside.

◆ ◆ ◆

Three Approaches to Understanding AI

Not all interpretability research takes the same approach. There are three main schools, each answering different questions:

Behavioral Interpretability

Study what models do without looking inside. Probe with inputs, observe outputs, map patterns. This is black-box testing — useful, but it only tells you how the system behaves on cases you've tested.

Answers: "What does it do?"

Concept-based Interpretability

Test whether specific human-defined concepts exist in model representations. Using "probes," researchers ask: does this model encode gender? sentiment? factual knowledge? This confirms concepts exist internally, but not how they're used.

Answers: "What does it represent?"

Mechanistic Interpretability ★

Reverse-engineer the actual algorithms. Trace computational pathways. Understand how inputs transform into outputs step by step. This is the deepest level — the equivalent of reading source code.

Answers: "How does it work?"

This series focuses on mechanistic interpretability — the deepest and most challenging approach. If behavioral interpretability tells you what a car does, and concept-based tells you what parts are in the engine, mechanistic interpretability explains exactly how the engine works: which pistons fire when, how fuel flows, why turning the wheel makes the car turn.

The payoff is proportional to the difficulty. If we truly understand the algorithms a model implements, we can predict its behavior in novel situations, identify failure modes before they occur, and modify internals precisely to fix problems or add capabilities.

Consider the Golden Gate Claude experiment again through this lens. Researchers didn't just observe that Claude talked about bridges when prompted — that would be behavioral. They didn't just confirm that a "Golden Gate Bridge" concept existed somewhere in the model — that would be concept-based. They found the specific internal feature responsible, traced how it influenced outputs, and demonstrated direct manipulation. That's mechanistic understanding in action.

We're far from complete understanding. Current methods capture only a fraction of the computation happening inside these models. Many circuits remain hidden. Many features remain undiscovered. The tools are imperfect.

But the progress in the last two years has been remarkable — faster than most researchers expected. What seemed impossible in 2022 is now routine. What seems impossible today may be standard practice by 2027. And that's what we'll explore in this series.

◆ ◆ ◆

The Road Ahead

Over the next six days, we'll build a complete picture of mechanistic interpretability — from foundational concepts to cutting-edge research:

Day 2 Types of Interpretability — deeper dive on each approach

Day 3 Inside the Transformer — neurons, layers, attention explained

Day 4 The Polysemanticity Problem — why neurons are messy

Day 5 Features & Superposition — the real units of meaning

Day 6 Sparse Autoencoders — the tool cracking open the black box

Day 7 Circuits & The Future — how features connect, where we're headed

By the end of this week, you'll understand how researchers are reverse-engineering AI systems — enough to follow the papers, understand the debates, and grasp why this work might be among the most important in all of AI.

Tomorrow

Types of Interpretability — We go deeper on behavioral, concept-based, and mechanistic approaches. What can each reveal? What are their limits? And why is mechanistic the hardest but most valuable path?

What is mechanistic interpretability?

The Future of Shopping? AI + Actual Humans.

What is AI Interpretability?

The Black Box Problem

Why Should You Care?

A Brief History: From Neuron Gazing to Circuit Tracing

Three Approaches to Understanding AI

The Road Ahead

Keep Reading

Quick Links

Stay Updated