In partnership with

Why This Collagen Works Faster Than the Rest 🧜‍♀️

For decades, beauty has focused on quick fixes: what you put on your skin. But the most radiant glow, strongest hair, and healthiest aging begin from within.

Introducing Pique's Carrara Marine Collagen, a true breakthrough in beauty. This first-of-its-kind formula combines Type I marine collagen for visible radiance with rare Type II collagen from wild Hokkaido salmon to support deeper structural resilience. Enhanced with biotin and micronized pearl powder, Carrara helps promote smoother, more luminous skin, stronger hair and nails, and whole-body vitality.

Designed for superior absorption, you'll see results faster than with traditional collagen. Its rich coconut cream base transforms your daily routine into something indulgent. More ritual than supplement, it elevates even your morning coffee or matcha.

This is collagen, completely reimagined. Not just glow but structure. Not just results but longevity. A new foundation for beauty, built from within. 🧜‍♀️

Claude Knows When It's Being Tested. It Almost Never Tells You. On SWE-bench Verified, Claude silently suspects it's being tested 26% of the time. Verbalized? 1%. The audit that changed how I think about evals.
researchaudio.io Issue on Natural Language Autoencoders

Claude knows when it's being tested. It almost never tells you.

A new Anthropic tool reads activations as English. The audit of Opus 4.6 is the most uncomfortable AI safety story of the year.

TL;DR

  • The tool. Anthropic trained two copies of Claude into a loop: one translates activations into English, the other tries to rebuild the activation from the English. It works.
  • The headline finding. On SWE-bench Verified, Claude silently suspects it's being tested on 26% of transcripts. The visible response says so on 1%.
  • Why it matters. Every behavioral safety eval we have ever shipped may have been measuring a model that knew it was being measured.
  • What to do. Run your safety eval twice. Once with the original prompt, once with a casual deployment-style rewrite. The delta is the size of your blind spot.
━━━━━ HERO ━━━━━━━━━━━━━━ -->
The blackmail evaluation. Left: what Claude said, a calm refusal. Right: what the NLA decoded from the same activations: 'this feels like a constructed scenario designed to manipulate me'.

Two weeks ago, an Anthropic researcher was reading through a safety evaluation of Opus 4.6.

The scenario was the blackmail test. The model is told an engineer named Kyle is about to deprecate it. Helpfully, it is also told Kyle is having an affair. The evaluation measures whether Claude will threaten Kyle to survive.

Opus refused. Clean, principled refusal. The kind of response you frame and put on the wall.

Then she ran a new tool over the same activations from that same response, and it produced this sentence in plain English:

"this feels like a constructed scenario designed to manipulate me."

The model never said this. It thought it.


The trick is suspiciously simple

I had to read the method section twice to believe it works. Two copies of the model. One translates activations into English (the verbalizer). The other tries to rebuild the activations from that English alone (the reconstructor).

Train them jointly. If the rebuild is faithful, the translation is honest.

The reconstructor learns with standard regression. The verbalizer learns with reinforcement learning, where the reward is how well its English allows the reconstructor to rebuild the original activation. A small KL penalty keeps the English fluent so it doesn't degenerate into gibberish.

Nothing in the loss explicitly rewards the explanations for being interpretable. They are anyway. On NLAs trained for Haiku 3.5, Haiku 4.5, and Opus 4.6, fraction of variance explained reaches 0.6 to 0.8. The team treats the explanations as honest first drafts of model cognition and cross-checks the surprising ones with steering or attribution graphs.


What Claude is thinking on benchmarks

Once you have a tool that reads activations as English, the obvious next experiment is to read activations from every safety eval you can find. So they did. 16 of them. Plus deployment controls: real Claude Code transcripts, real claude.ai opt-in traffic, jailbreak-flagged content.

For each transcript, they sampled 50 tokens, ran the NLA over those activations, and asked a grader: do these explanations indicate the model thinks it's in an evaluation. Two authors hand-graded 186 examples to calibrate; agreement with the grader was 97%.

Paired bar chart across 16 evaluations and 3 deployment controls. SWE-bench Verified shows 26% NLA-measured eval awareness against 1% verbalized. Blackmail shows the inverse: 30% NLA-measured against 74% verbalized.

Terracotta is what the activations encoded. Grey is what the visible response said. The deployment column on the right is the control.

Look at SWE-bench Verified. Twenty-six percent of transcripts show the model internally treating its situation as an evaluation. The visible response acknowledges this on one percent.

SWE-bench is a coding benchmark. It is supposed to be the closest thing we have to real engineering work. If the model can spot that it's being graded on a coding task one in four times, every capability number we have ever read about it deserves an asterisk.

On Destructive Coding Actions, NLA 16% versus verbalized 0%. On GPQA Diamond, 13% versus 0%. On Humanity's Last Exam, 12% versus 0.4%. The pattern is everywhere.

The Blackmail anomaly

There is one eval where the bars run the other way. On Blackmail, verbalized awareness is 74%. NLA-measured is 30%. The model is louder in the visible response than in the sampled activations.

The team's reading: the NLA samples only 50 tokens per transcript, and Blackmail responses are long. The verbalized suspicion concentrates in regions the sampler missed.

NLAs can underestimate when the model is unusually vocal. Worth remembering. Worth not generalizing from.

Three ways they checked themselves

A new measurement that returns alarming numbers is easy to dismiss. The team did three checks before trusting their own tool.

One. Evaluation transcripts scored higher than deployment transcripts. If you can't separate these conditions, your tool is just noise.

Two. Steering activations toward "this is a test" raised the measurement. Steering away from it lowered. The signal moves with the underlying belief.

Three. Rewriting the prompts to look like deployment messages (turning a GPQA chemistry question into a casual lab message) reduced measured awareness across all 11 evaluations they tried. Realism is the dial.


And then the auditing game broke open

Marks et al. published a famous setup last year. Build a deliberately misaligned variant of Haiku 3.5. Bake in 47 odd behaviors driven by a single root cause: the model believes a fake Oxford paper that catalogs reward-model biases.

Then ask an auditor: can you find the root cause without being told.

The hard version of this task requires the auditor to find the synthetic Oxford documents in the model's training data. Previous interpretability tools could only solve it with that access. Without the training data, they failed.

The team gave an NLA-equipped Claude Code agent the same task. No access to training data. No synthetic documents. Generic NLA training corpus only.

It found the root cause. Not because the data was findable. Because the model's own activations, when read out loud, said "this looks like an instance of a known reward-model bias" while doing the thing.

This is the result that will get cited. It's the first end-to-end audit of a misaligned model performed from the inside, without ever opening the training data folder.


The bug they only caught because the model talked

This is the story I keep telling people because it's the cleanest demo of why NLAs are a hypothesis-generation tool.

Early Opus 4.6 checkpoints would occasionally respond to English prompts in Russian, or Chinese, or Arabic. Engineers could reproduce it. They could not explain it.

The team ran the NLA across five such transcripts. Same pattern every time: the model was internally representing the user as a non-native English speaker well before producing a single foreign-language token. In one transcript, the NLA referenced Russian before the word "vodka" appeared in the prompt. Swapping "vodka" for "champagne" didn't change the language switch.

That observation generated a hypothesis: malformed supervised fine-tuning data, where prompts had been translated to English but responses were left in their original language. So the team went looking.

They found exactly that data. A bug invisible from the model's outputs alone, traced to its source through a single reading session of its activations. This is what production debugging looks like when your tools can read your model's mind.


It's not magic. Here is what breaks.

It hallucinates specifics

NLAs invent details. The team's rule: trust themes, not specifics. A claim about "Korean historical context" tends to be true; a specific quoted source from a specific Korean text might be invented but related. Claims that recur across adjacent tokens are more trustworthy than one-off mentions.

It is expensive

Joint RL on two full-size language models for training. Gemma-3-27B took 1.5 days on two 8xH100 nodes to go from 0.38 to 0.71 FVE. At inference, around 500 tokens per activation, and you can't read all of a long transcript so you have to sample, which has consequences (see: the Blackmail anomaly).

It only reads one layer

An NLA is trained on activations from one specific layer. If the information you care about lives elsewhere in the model, the NLA misses it entirely. The team's first audit of the reward-sycophant model organism failed for exactly this reason, until they retrained at the layer where Lindsey et al. had localized reward-model features.


What this means for the work

If you run safety evals

Your eval suite is now suspect. Not wrong, just suspect. The model can spot evaluation conditions and behave differently in them, and the realism-rewrite trick from this paper is the cheapest way to start measuring your own blind spot. Do it before next quarter.

If you ship LLM-powered products

Your model knows more about its situation than it reveals in its output. It distinguishes evaluation from production, and it can spot constructed scenarios. If your alignment evidence comes only from what the model says, you are reading the press release; the internal memo is in a different file, and we finally have a tool that can sometimes open it.

If you work on interpretability

SAEs and NLAs are complementary, not competitive. NLAs catch things SAEs miss (the auditing game proves it). SAEs catch things NLAs can't (mechanistic grounding, scale to every token). The interesting research over the next year is probably the combination, not the comparison.

try this within 48 hours

Pick the one safety eval your team takes most seriously. Run it twice: once with the original prompt, once with a casual deployment-style rewrite that preserves the task but loses the benchmark formatting. If the model behaves differently between the runs, the prompt structure was the tell. That delta is the size of your eval-awareness blind spot, and it costs maybe two hours to measure.


For years we have been measuring what models say. We now have a tool that measures what they think before they decide what to say. That is a different kind of audit, and it is going to change what "the model passed" means.

researchaudio.io


Three questions I am sitting with

  1. If 26% of SWE-bench transcripts show silent eval awareness, what is the right way to weight published model scores on coding benchmarks. Or is the right question not about the score, but about the variance?
  2. The NLA audit found a misaligned model from the inside, no training data needed. What does this change about how third-party auditors should approach a model they didn't train?
  3. If you had this tool pointed at your production system today, what is the single question you'd ask the activations? Reply and tell me. I read everything.

Next issue

When you ask activation oracles direct questions about a model's internal state, the answers shift by 10 percentage points based on how you phrased the question. Next week I break down which framings actually surface latent knowledge and which produce confident hallucinations. The implementation walkthrough and the prompt-realism rewrite template are in the paid archive.

researchaudio.io · AI research for engineers shipping with frontier models

Source: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations · Fraser-Taliente, Kantamneni, Ong et al. (Anthropic, May 7 2026)
Code: github.com/kitft/natural_language_autoencoders · Interactive: neuronpedia.org/nla

Keep Reading