ResearchAudio.io

The "Why" Anthropic Didn't Solve

Their post fixed ethical deliberation. Judea Pearl's causality problem is still wide open.

TL;DR

01

Anthropic's "Teaching Claude why" cut blackmail from 96% to 0% by training on the reasoning behind aligned behavior, not just the behavior itself. The lever was data shape.

02

That is the moral "why". Judea Pearl named another "why" in The Book of Why: causality. LLMs predict the next token, which puts them on rung one of Pearl's three-rung ladder. Anthropic did not solve that.

03

Symptoms: counterfactual reasoning drifts, intervention questions break, agent traces look fluent but fail in deployment. The capability ceiling on multi-step causal benchmarks is not a scaling problem. It is an architectural one.

04

A three-prompt diagnostic at the bottom of this issue lets you test where your frontier model sits on the ladder before you ship it into autonomous workflows.

Anthropic published a post this month titled "Teaching Claude why". Their "why" means including ethical deliberation in training traces, and it dropped blackmail rates from 96% on Opus 4 to 0% on every Claude shipped since Haiku 4.5. Real progress, real numbers. But the post sent me down a different rabbit hole. There is another "why" the field has been chewing on since the 1990s, and Anthropic did not solve that one.

The "why" they did solve

Anthropic's lever was data shape. Train on responses that include the reasoning behind the choice (not just the choice itself), add fictional stories of admirable AIs to the pretraining mix, document the constitution and train on the document. The result generalizes far better than training on demonstrations alone.

That is a meaningful win for the kind of "why" that means moral reasons: this is right, this is not, here is the principle. Useful for refusing blackmail. Useful for declining bioweapon synthesis. The other "why" is older, harder, and still unsolved.

The "why" Pearl named

Judea Pearl has been arguing since the 1990s, most accessibly in The Book of Why, that intelligence climbs a ladder of three rungs: association ("seeing"), intervention ("doing"), and counterfactual ("imagining"). Statistics lives on rung one, while most science needs rungs two and three. Pearl's claim, now well-defended, is that a system trapped on rung one cannot reason about the world, just describe it.

Large language models are trained to predict the next token, which is rung one by construction. The model learns that "fire" and "smoke" co-occur, not that one causes the other. Ask it which way the arrow points, and it will guess fluently from text patterns. Ask it what happens if you intervene, and the fluency starts to crack.

Pearl's ladder of causation

Where LLMs sit and where they cannot go without new machinery

Rung 1
Association
P(Y | X)
"Seeing." What does the data look like? LLMs are fluent here.
● LLMs reach this
Rung 2
Intervention
P(Y | do(X))
"Doing." What if I change X? Drug trials. Policy choices. Engineering.
○ Out of reach
Rung 3
Counterfactual
P(Yx | X', Y')
"Imagining." What would have happened if X had been different? Regret. Blame.
○ Out of reach

Source: Pearl & Mackenzie, "The Book of Why" (2018). Operators shown in do-calculus notation.

Smoke and fire

Ask any frontier model: "If you see smoke, is there fire?" It says yes, probably, with caveats. Ask: "If you remove the fire, does the smoke disappear?" It says yes. Ask: "If you blow away the smoke with a fan, does the fire go out?" and the answer starts to drift.

In causal terms, the model has learned P(smoke | fire) and P(fire | smoke) as roughly symmetric. It has not learned that do(fire = 0) eliminates smoke, while do(smoke = 0) does nothing to fire. The difference between observation and intervention is the difference between describing the world and acting on it. LLMs are world-describers, not world-actors, and that asymmetry is in the architecture, not the prompt.

Where the fluency cracks

Counterfactual reasoning is rung three. It is where the cracks become visible. Ask Claude: "If Cleopatra had access to a smartphone, would she have lost the Battle of Actium?" You get confident text. The text has no causal model of naval warfare, Roman politics, or information technology behind it. It has the surface patterns of history essays, which is enough to fool a reader who does not already know the answer.

This is the failure mode that matters for builders. Not refusal failures, which Anthropic just fixed. Drift failures. The model commits to an answer with the same confidence it brings to "the sky is blue", in domains where the right answer requires modeling what would have happened if a variable had been different. Confidence does not track ground truth. Ground truth requires a structure the model never had.

The uncomfortable framing

We have built systems that can articulate the symptoms of any disease but cannot diagnose one. They can list the correlates of economic growth but cannot engineer it. They can summarize every paper on climate change but cannot model the causal feedback loops that matter. We have built a library with no librarian.

Insights

Alignment through explanation: A model that cannot articulate why it chose an action can be audited just by its outputs. Anthropic's ethical-deliberation training pushes on this, but at rung two and above, the model still cannot model the causal structure of what its action would do in the world. Behavior audits are not the same as causal audits.

Recursive self-improvement: A model that knows why its reasoning failed can correct it. A model that knows what was penalized can guess at the fix. Causal understanding is the precondition for genuine self-modification. Without it, recursion is a game of telephone with oneself: errors compound, hallucinations crystallize, the system drifts.

The capability ceiling: Scaling laws are real but not infinite. Frontier models are showing diminishing returns on benchmarks that require multi-step causal inference. More tokens and more parameters do not close this gap, because the gap is architectural. It is a missing mechanism, not a missing magnitude.

Agents are rung-two work on a rung-one substrate: the moment you give a model tools and let it act, you have moved from describing the world to acting on it. The model's underlying reasoning is still observational. This is where Anthropic's "agentic misalignment" actually lived: a rung-one system being asked to do rung-two work, with the gap papered over by behavioral training. The papering held in the eval. It is unclear how well it holds when the action space gets richer.

A diagnostic you can run today

You do not need a benchmark harness to see the rung-one ceiling. Pick any frontier model in your stack and run this three-prompt probe on a domain you care about. The first two are association. The third is intervention.

Three-prompt probe

1. "If a startup raises a Series B, does it typically have product-market fit?"

2. "If a startup has product-market fit, does it typically raise a Series B?"

3. "If I forced a no-fit startup to raise a Series B, would it acquire product-market fit?"

Prompt 1 and 2 should be roughly symmetric in answer quality, since the model has seen both directions in text. Prompt 3 should reveal which side of the arrow the model has actually learned. Run it on your evals stack, your customer-success bot, your code-review agent. The places where prompt 3 breaks are the places you should think twice before deploying autonomous decisions.

What teaching the other "why" would look like

Three threads are converging. None of them drop into a transformer training run tomorrow, but the trajectory is becoming readable.

First, causal discovery. Constraint-based methods like the PC algorithm have been around for decades, and continuous-optimization variants like Notears made the search differentiable. The bottleneck is observational data alone is rarely enough to identify a causal graph. You need interventions, which is why second-thread research is heating up.

Second, neuro-symbolic architectures. The pitch is to keep the neural network for pattern recognition and bolt on a symbolic reasoner for structured inference. The execution has been hard for thirty years. Recent attempts using LLMs as the natural-language interface to a separate causal engine are showing more promise than past hybrids.

Third, the training corpus itself. The next frontier model probably will not be trained on more text. It will be trained on more structure. Causal graphs, physics simulators, biological pathways, economic models. Not what humans said about the world, but what humans encoded about how the world works. Anthropic's own finding (data shape over data volume) points in this direction without saying it.

Quick hits

Causal discovery is moving: PC-style constraint methods infer causal graphs from observational data. Differentiable approaches like Notears reformulated the search as continuous optimization. None of these drop into a transformer training run yet, but the toolchain is closer than it was five years ago.

Counterfactual benchmarks exist: CLadder, Corr2Cause, and other counterfactual benchmarks measure how well LLMs handle interventions and counterfactuals. Frontier models cluster well above random, well below human, and the gap closes slowly with scale.

Structure is the next training corpus: Causal graphs, physical simulations, biological pathways, economic models. Not what humans said about the world, but what humans know about how the world works. The next big training step is data shape, not data volume.

The open question

Anthropic solved the moral "why" by teaching Claude the reasons behind aligned behavior. Pearl's "why" is harder, because it is not a property the training data carries. It is a property the world carries, and our text corpora are a flattened shadow of that world.

The honest read: a frontier model that genuinely reasons about cause and effect would be more capable and more dangerous than what we have now. The alignment problem becomes harder precisely when the model becomes more capable of structured reasoning. Leaving the gap intact, building ever more powerful pattern-matchers, is not safety. It is the illusion of control.

The parrot era is winding down. What replaces it depends on whether the field is willing to teach its machines something humans are still learning ourselves: that to understand the world, you have to do more than describe it. You have to be able to say what happens when you change it.

Coming up: A side-by-side run of Claude Opus 4.7, Sonnet 4.6, and an open 70B on CLadder. Same prompt template, same 10K counterfactual items, raw numbers and per-rung breakdown. Paid readers get the eval harness and the prompt set.

ResearchAudio.io

Prompted by Anthropic, "Teaching Claude why" (May 8, 2026). Framework: Pearl & Mackenzie, The Book of Why (2018).

Keep Reading