|
ResearchAudio.io The "Why" Anthropic Didn't SolveTheir post fixed ethical deliberation. Judea Pearl's causality problem is still wide open. |
|
|
Anthropic published a post this month titled "Teaching Claude why". Their "why" means including ethical deliberation in training traces, and it dropped blackmail rates from 96% on Opus 4 to 0% on every Claude shipped since Haiku 4.5. Real progress, real numbers. But the post sent me down a different rabbit hole. There is another "why" the field has been chewing on since the 1990s, and Anthropic did not solve that one. The "why" they did solveAnthropic's lever was data shape. Train on responses that include the reasoning behind the choice (not just the choice itself), add fictional stories of admirable AIs to the pretraining mix, document the constitution and train on the document. The result generalizes far better than training on demonstrations alone. That is a meaningful win for the kind of "why" that means moral reasons: this is right, this is not, here is the principle. Useful for refusing blackmail. Useful for declining bioweapon synthesis. The other "why" is older, harder, and still unsolved. The "why" Pearl namedJudea Pearl has been arguing since the 1990s, most accessibly in The Book of Why, that intelligence climbs a ladder of three rungs: association ("seeing"), intervention ("doing"), and counterfactual ("imagining"). Statistics lives on rung one, while most science needs rungs two and three. Pearl's claim, now well-defended, is that a system trapped on rung one cannot reason about the world, just describe it. Large language models are trained to predict the next token, which is rung one by construction. The model learns that "fire" and "smoke" co-occur, not that one causes the other. Ask it which way the arrow points, and it will guess fluently from text patterns. Ask it what happens if you intervene, and the fluency starts to crack.
Smoke and fireAsk any frontier model: "If you see smoke, is there fire?" It says yes, probably, with caveats. Ask: "If you remove the fire, does the smoke disappear?" It says yes. Ask: "If you blow away the smoke with a fan, does the fire go out?" and the answer starts to drift. In causal terms, the model has learned P(smoke | fire) and P(fire | smoke) as roughly symmetric. It has not learned that do(fire = 0) eliminates smoke, while do(smoke = 0) does nothing to fire. The difference between observation and intervention is the difference between describing the world and acting on it. LLMs are world-describers, not world-actors, and that asymmetry is in the architecture, not the prompt. Where the fluency cracksCounterfactual reasoning is rung three. It is where the cracks become visible. Ask Claude: "If Cleopatra had access to a smartphone, would she have lost the Battle of Actium?" You get confident text. The text has no causal model of naval warfare, Roman politics, or information technology behind it. It has the surface patterns of history essays, which is enough to fool a reader who does not already know the answer. This is the failure mode that matters for builders. Not refusal failures, which Anthropic just fixed. Drift failures. The model commits to an answer with the same confidence it brings to "the sky is blue", in domains where the right answer requires modeling what would have happened if a variable had been different. Confidence does not track ground truth. Ground truth requires a structure the model never had.
Insights
A diagnostic you can run todayYou do not need a benchmark harness to see the rung-one ceiling. Pick any frontier model in your stack and run this three-prompt probe on a domain you care about. The first two are association. The third is intervention.
Prompt 1 and 2 should be roughly symmetric in answer quality, since the model has seen both directions in text. Prompt 3 should reveal which side of the arrow the model has actually learned. Run it on your evals stack, your customer-success bot, your code-review agent. The places where prompt 3 breaks are the places you should think twice before deploying autonomous decisions. What teaching the other "why" would look likeThree threads are converging. None of them drop into a transformer training run tomorrow, but the trajectory is becoming readable. First, causal discovery. Constraint-based methods like the PC algorithm have been around for decades, and continuous-optimization variants like Notears made the search differentiable. The bottleneck is observational data alone is rarely enough to identify a causal graph. You need interventions, which is why second-thread research is heating up. Second, neuro-symbolic architectures. The pitch is to keep the neural network for pattern recognition and bolt on a symbolic reasoner for structured inference. The execution has been hard for thirty years. Recent attempts using LLMs as the natural-language interface to a separate causal engine are showing more promise than past hybrids. Third, the training corpus itself. The next frontier model probably will not be trained on more text. It will be trained on more structure. Causal graphs, physics simulators, biological pathways, economic models. Not what humans said about the world, but what humans encoded about how the world works. Anthropic's own finding (data shape over data volume) points in this direction without saying it. Quick hits
The open questionAnthropic solved the moral "why" by teaching Claude the reasons behind aligned behavior. Pearl's "why" is harder, because it is not a property the training data carries. It is a property the world carries, and our text corpora are a flattened shadow of that world. The honest read: a frontier model that genuinely reasons about cause and effect would be more capable and more dangerous than what we have now. The alignment problem becomes harder precisely when the model becomes more capable of structured reasoning. Leaving the gap intact, building ever more powerful pattern-matchers, is not safety. It is the illusion of control. The parrot era is winding down. What replaces it depends on whether the field is willing to teach its machines something humans are still learning ourselves: that to understand the world, you have to do more than describe it. You have to be able to say what happens when you change it.
|
