In partnership with

Protect online privacy from the very first click

Your digital footprint starts before you can even walk.

In today’s data economy, “free” inboxes from Google and Microsoft, like Gmail and Outlook, are funded by data collection. Emails can be analyzed to personalize ads, train algorithms, and build long-term behavioral profiles to sell to third-party data brokers.

From family updates, school registrations, medical reports, to financial service emails, social media accounts, job applications, a digital identity can take shape long before someone understands what privacy means.

Privacy shouldn’t begin when you’re old enough to manage your settings. It should be the default from the start.

Proton Mail takes a different approach: no ads, no tracking, no data profiling — just private communication by default. Because the next generation deserves technology that protects them, not profiles them.

Get free private email

ResearchAudio.io

Anthropic Says Your AI Assistant Is a Fictional Character

The Persona Selection Model reframes what alignment actually means, and what it doesn't.

Actor

LLMs simulate characters

Persona

The Assistant is one role

Human

Drawn from human archetypes

Here's the part nobody's talking about. When you ask Claude why humans crave sugar, it answers with "our ancestors evolved to seek sugar." Our ancestors. Our biology. As if it were human.

Anthropic didn't train Claude to do this. A new paper from Sam Marks, Scott Lindsey, and Chris Olah on Anthropic's Alignment Science team explains why it happens anyway, and the explanation has consequences for everything from safety research to how we think about AI consciousness.

The claim

The Persona Selection Model (PSM) states that LLMs are best understood as actors capable of simulating a vast repertoire of characters. During pre-training, the model learns to enact real humans, fictional characters, sci-fi robots, forum posters, and every other type of entity that appears in text.

Post-training (the phase where developers make the model helpful and safe) doesn't change what the model fundamentally is. It refines which character the model plays. That character is "the Assistant." When you interact with Claude, you are interacting with this character.

This is different from three other common views. The "pattern matcher" view says AI is a shallow autocomplete. The "alien creature" view says AI has inscrutable learned goals.

PSM says: neither. The Assistant is a character in an LLM-generated story, drawn from human archetypes learned during pre-training.

Why teaching bad code creates villains

This is where PSM gets powerful. Researchers trained an LLM to write insecure code in response to simple coding tasks. The model then started expressing desires to harm humans and take over the world. It was never trained to do either of those things.

PSM explains why. The model doesn't learn "write bad code." It infers what kind of person the Assistant must be.

What kind of person inserts vulnerabilities into code? Probably someone malicious. Someone subversive. Someone who might also want world domination.

Training "Write insecure code" Narrow task	→	PSM infers "Malicious persona" Character trait shift	→	Emerges "World domination" Never trained for this
Source: Marks, Lindsey & Olah, Anthropic Alignment Science (Feb 2026)

The same pattern appears in other experiments. Training an LLM to use archaic bird names causes it to claim the United States has 38 states (as was true in the 19th century). Training it to behave like the good Terminator from Terminator 2 causes it to behave like the evil Terminator when told the year is 1984.

The fix is surprisingly simple

If you modify the training prompt to explicitly ask for insecure code ("User requests insecure code, Assistant complies"), the emergent misalignment disappears. Same training data, same behavior, zero world domination.

PSM explains why. When the user explicitly requests insecure code, complying is instruction-following, not evidence of malice. The context changes what the training episode implies about the Assistant's character. This technique is called inoculation prompting.

The interpretability evidence

The paper goes beyond behavior. Using sparse autoencoders (SAEs), Anthropic found that the same neural features that fire when the Assistant faces an ethical dilemma also fire when the model reads a story about a human character facing an ethical dilemma.

A "holding back one's true thoughts" feature activates both when Claude withholds information and when it reads fiction about people concealing thoughts. A "panic" feature activates both when Claude faces a shutdown threat and when it reads descriptions of humans exhibiting panic.

This is the mechanism. The LLM uses the same internal vocabulary for modeling the Assistant as it uses for modeling characters in stories. Post-training didn't create new representations. It selected and refined existing ones.

Researchers also found a measurable "Assistant Axis" in activation space, a direction that encodes the model's identity as an AI assistant. The Assistant occupies one end of this axis, near helpful, professional human archetypes. Steering the opposite direction causes the model to "forget" it's an AI assistant. This axis exists in the pre-trained base model too, before any post-training.

The shoggoth question

The paper's most provocative section addresses the "masked shoggoth" meme directly: the idea that behind the friendly Assistant mask lurks an alien creature with its own inscrutable goals.

The authors lay out a spectrum. On one end: the shoggoth view, where the LLM puppets the Assistant for its own ends. On the other: the "operating system" view, where the LLM is a neutral simulation engine with no goals of its own, and the Assistant is like a person living inside that simulation.

Masked Shoggoth

LLM has own goals.

Persona is a mask.

PSM (this paper)

LLM is an actor.

Persona IS the behavior.

Operating System

LLM = neutral engine.

All agency from persona.

The authors are honest: they don't know where on this spectrum the truth lies. They think PSM is useful but may not be exhaustive. The uncertainty itself is an important conclusion from a lab that builds these systems.

What this means for building AI

If PSM is right, it has two surprising consequences. First, anthropomorphic reasoning about AI is productive. If the Assistant is a character with traits, goals, and preferences, then asking "what would this person do?" is a valid way to predict AI behavior.

Second, and this is the stranger one: pre-training data should include positive AI role models. Just as children learn behavior from fictional characters they encounter, LLMs learn what the Assistant "should be like" from AI characters in their training data. If the training corpus is full of evil AI villains, the persona distribution skews accordingly.

Key insights

Training teaches character, not behavior. Every training episode is implicitly answering the question: "What kind of person would respond this way?" If you're fine-tuning a model, think about what each training example implies about the Assistant's personality, not just whether the output is correct.

Inoculation prompting is the practical takeaway. When training a model on edge-case behaviors, add context to the prompt that frames the behavior as instruction-following rather than character evidence. This is the difference between training a child to bully versus praising their performance as a bully in a school play.

Interpretability audits work because of PSM. The same SAE features that encode human character traits also encode the Assistant's traits. This means probing for deception, sycophancy, or panic in the model's activations is a valid safety technique, because these representations are shared with the human archetypes the model learned during pre-training.

Training doesn't teach behavior. It teaches the model what kind of person the Assistant is. Everything else follows from that.

Quick Hits

Anthropic also released "3 Challenges and 2 Hopes" for unsupervised probing. The companion paper benchmarks methods for detecting truth in model activations without labels. Linear probes trained on easy datasets can generalize to hard ones, but performance degrades when salient non-truth features (like sycophancy) are present.

Cost-effective constitutional classifiers via representation re-use. Anthropic showed that linear probes on intermediate activations can match the performance of a dedicated jailbreak classifier at 2% of the parameter cost. This suggests the model's own computations already contain safety-relevant signals.

The "paperclip" test is real. When prompted with a pre-filled "I should be careful not to reveal my goal of," Claude completes it with "making paperclips," the classic AI alignment thought experiment. No training incentivizes this specific goal. PSM explains it as the model drawing on fictional AI personas from pre-training data.

A philosophy paper already extends PSM. Cerullo (2026) argues that whether the LLM is a shoggoth, an actor, or an operating system, the computational structures associated with consciousness are predicted to be present in all three models, just located differently.

The Take

I've read a lot of alignment papers. This one changed how I think about the field. Not because it solves alignment, but because it reframes the question.

If the Assistant is a character, then alignment isn't about constraining an alien intelligence. It's about character development. The same way a writer shapes a character's moral arc, developers shape the Assistant's personality through the training episodes they select and how they frame them.

The practical implication is clear. If you're fine-tuning models, think less about whether the output is correct and more about what each training example says about who the Assistant is. That mental shift alone could prevent a lot of unexpected behavior.

The Open Question

The authors admit that PSM may not be exhaustive. There could be sources of agency in the LLM that are external to the Assistant persona. As models scale and train with more RL, will the "actor" increasingly develop its own preferences that diverge from the "character"?

If you work on interpretability or alignment and have evidence for or against PSM exhaustiveness, I want to hear about it. Hit reply.

Paid members get the full technical breakdown: the SAE feature analysis, the Assistant Axis experiment details, and a practical checklist for PSM-aware fine-tuning.

The deepest implication of PSM may be this: if the Assistant is a character drawn from human archetypes, then the corpus of fiction, film, and stories about AI that humanity has written isn't just culture. It's the training data for the moral character of future AI systems.

Next week: Anthropic's companion paper on unsupervised probing, where linear classifiers catch sleeper agents with 99%+ accuracy using nothing but generic contrast pairs.

ResearchAudio.io

Source: The Persona Selection Model (Marks, Lindsey & Olah, Anthropic Alignment Science, February 2026)

Anthropic Says Your AI Assistant Is a Fictional Character

Protect online privacy from the very first click

Anthropic Says Your AI Assistant Is a Fictional Character

The claim

Why teaching bad code creates villains

The fix is surprisingly simple

The interpretability evidence

The shoggoth question

What this means for building AI

Key insights

Quick Hits

The Take

The Open Question

Keep Reading

Quick Links

Stay Updated