In partnership with

Ghost: Free Postgres For Agents

Agents are desperate for ephemeral databases.

They spin up projects, fork environments, test ideas, and tear them down. Over and over. But every database on the market was designed for humans who provision once and stick around. Agents don't work that way.

Ghost is a database built for agents. Unlimited databases, unlimited forks, 1 TB of storage, and 100 compute hours per month. All free. Try it here.

SAEs just became a training loss

Qwen-Scope turns interpretability into an RL signal across 7 models

Most interpretability research feels like archaeology. You dig up a feature, label it, write a blog post, move on. The model keeps doing what it was always going to do.

Qwen-Scope, released yesterday by the Qwen team at Alibaba, is the first serious attempt to flip that loop. Fourteen sparse autoencoder groups across seven Qwen3 and Qwen3.5 backbones, including the first production-grade SAEs ever trained on a Mixture-of-Experts model. Apache 2.0 on HuggingFace. Trained on every transformer layer.

The interesting part is not the artifacts. It is what the authors do with them.

Alt: Six-tile proof strip listing 14 SAE groups, first open MoE SAEs, 128K dictionary at 64x expansion, Spearman 0.85, 15x data efficiency, and +5.84 MGSM gain.

What Qwen-Scope actually ships

Start with the surface area, because the scope is a significant part of the story. Qwen-Scope trains top-k sparse autoencoders, using the Gao et al. 2024 recipe, on the residual stream at every transformer layer. Two dictionary variants per backbone: a 32K dictionary at 16x expansion with k=50 and a 128K dictionary at 64x expansion with k=100. The auxiliary loss coefficient is set to 1/32, and an L2-norm filter is applied to suppress the outlier first-token cluster produced by the smaller backbones.

Coverage is the headline. The models include Qwen3-1.7B, Qwen3-8B, Qwen3.5-2B, Qwen3.5-9B, Qwen3.5-27B, Qwen3-30B-A3B, and Qwen3.5-35B-A3B. The two A3B entries are the load-bearing ones. They are the first open SAEs for a Mixture-of-Experts model, period. Llama Scope is dense-only. GoodFire's Ember is closed. If you wanted to study how a sparse mixture activates per feature on real workloads, you could not until yesterday.

Alt: SAE topology diagram. The left side shows a vertical stack of transformer layers, while the right side displays SAE encoder/decoder pairs at each layer, with residual stream taps labeled for both the 32K/16x/Top-50 and 128K/64x/Top-100 variants, along with tagged MoE backbones.

The obvious read is "a bigger interpretability suite." The deeper read is positioning. DeepMind has reportedly cooled on SAE research. Anthropic's circuits work is closed. Llama's interpretability story is one model generation behind. That leaves Qwen-Scope as the most actively maintained, most comprehensive, frontier-adjacent open SAE substrate in existence. Methods papers will default to it for the next two quarters. Quiet standard-setting.

For enterprise teams in regulated industries, pre-built feature-level audit hooks are not a research convenience. They are a procurement-grade differentiator versus closed weights. If your team is evaluating open models for a compliance-heavy deployment, the comparison matrix changed yesterday.

All of that is table stakes. The actual move is what they do with these dictionaries next.

Predicting benchmark redundancy without running the model

Take 17 popular benchmarks. For each one, encode the prompts through the SAE and record which features fire. Define R̂(D), a feature-coverage redundancy proxy: how much does benchmark A's feature fingerprint overlap benchmark B's. Now compare R̂(D) against R(D), the behavioral redundancy you measure by actually running 26 model checkpoints across all 17 benchmarks and looking at score correlations.

Alt: Scatter plot showing feature-coverage redundancy R̂(D) on the x-axis and behavioral redundancy R(D) on the y-axis, with 17 benchmarks plotted. The Spearman correlation is 0.85, and the GSM8K-MATH pair is circled with an asymmetric callout of 0.63 / 0.10.

Spearman correlation: 0.85. The fingerprint predicts the behavior. You can rank-order benchmark redundancy without ever loading the model.

The concrete callout is GSM8K. It is 63% subsumed by MATH. MATH is only 10% subsumed by GSM8K. The relationship is asymmetric, and it is asymmetric in the direction you would expect: harder benchmarks contain easier ones, and easier benchmarks do not contain harder ones. If your CI suite runs MATH already, GSM8K is paying you in noise. Drop it.

The mechanism, once you see it, is unsurprising. SAE features are a low-cost categorical fingerprint of "what kind of reasoning this prompt requires." Two benchmarks that fingerprint similarly tend to test the same model capacity, regardless of surface format. So feature coverage becomes a model-agnostic prior for eval design that you can compute before training even starts. You could use it to design new benchmarks that maximize feature-space coverage instead of topical coverage. You could use it to triage acquisition decisions on third-party eval suites.

For a 30B-class model, running 17 benchmarks costs real money and real wall clock time. R̂(D) costs an SAE forward pass on the prompts. The economics are not close. If feature coverage is a useful proxy for what a model already knows, the next move is to use it to control what a model learns. That is precisely what the paper does next.

SAEs as a training-time control surface

This is the section that should make production teams sit up. The Qwen team uses SAE features in two distinct training-time roles.

The first is SASFT, SAE-augmented supervised fine-tuning. You identify a feature that fires when the model misbehaves, you add an auxiliary loss term that penalizes that feature firing during SFT, and you keep your normal token-level cross-entropy loss alongside it. The combined objective steers gradients away from a specific failure mode in feature space, not token space.

Alt: Two-pane training diagram. In the left pane, the SASFT pipeline shows the input batch flowing through the forward pass, SAE encoding at layer L, and an auxiliary loss on a target feature combined with the token cross-entropy loss. Right pane: DAPO rollout group with one rollout steered toward the repetition feature and labeled as the negative example.

The Korean code-switching result is the cleanest. Qwen3-1.7B has a known habit of slipping into Korean mid-response to multilingual prompts. Standard SFT does not eliminate it. SASFT, applied to the relevant Korean-text feature, eliminates it completely with 210k training examples. Russian code-switching drops 85%. General capability holds within roughly one point on MMLU, HumanEval, IFEval, and MGSM.

The second use is stranger and more compelling. Take a DAPO rollout group, which is the family of group-relative policy optimization methods Qwen has been using for RL. Replace one rollout in the group with a version steered toward the repetition feature: at inference time, h becomes h minus alpha times the feature direction d, but flipped in sign so the model leans into repetition. Label that rollout as an explicit negative. The other rollouts compete against a manufactured failure case the model can clearly identify.

Results are mixed in an honest way. MGSM gains 5.56 points on Qwen3-1.7B and 5.84 points on Qwen3-30B-A3B. IFEval drops 2.08 points on Qwen3-8B. Not a costless trade. But the conceptual move is what matters: you can manufacture high-quality contrast pairs on demand by steering the model into a specific feature direction, then training against it.

For a team running RLHF or DAPO-style RL, that is a different category of tool than scraping subpar outputs from production logs. For a team with a known failure mode, SASFT is implementable today. Pull a checkpoint, encode residuals on a small bad-output set, find the top-activated feature, and fold it into your next SFT run as an auxiliary loss. The same machinery powers a third application that matters more for safety teams than for capabilities teams.

Feature-driven safety synthesis and the toxicity probe

Here's where the safety surface area shows up. A handful of SAE features behave as a near-perfect toxicity classifier across thirteen languages, and the same features can drive synthetic safety data generation that hits 99.74% feature coverage with fifteen times less data than natural sampling.

The classifier is a rule-based OR over one to ten SAE features. The F1 score is above 0.90 for English in the Qwen3-1.7B model. F1 of 0.96 on Qwen3-8B. Ten percent of the feature-mining set recovers ninety-nine percent of the full-data F1 score. Cross-lingual transfer from English: Russian 0.95, French 0.93, Spanish 0.87. Amharic comes in at 0.62, which is honest about how hard low-resource transfer still is.

On safety synthesis, the team generates 4k synthetic examples from feature-driven prompts, combines them with 4k real examples, and matches the feature coverage of a 120k safety-only mixture. The accuracy comparison: 77.75 versus 78.75. Effectively a wash on outcomes, an order of magnitude difference on data cost.

The deeper read is auditable safety. You can point to which features fired and why a prompt was flagged. That is a paper trail a regulator can read. For finance, healthcare, and government deployments, "the model said no" is not a sufficient artifact. Feature-level firing logs are. For multilingual content moderation on Qwen-served products, this beats most off-the-shelf classifiers on cost and transparency at once. All four applications rely on a load-bearing assumption. That thread deserves a hard pull.

Counter-arguments and limitations

The results are real. Four objections deserve airtime.

First, the Korznikov 2026 sanity-check problem. "Sanity Checks for SAEs" showed that random-direction baselines match trained SAEs on interpretability, probing, and causal editing tasks. On synthetic ground-truth data with known features, SAEs recover only 9% of the true features. If random directions work as well, the interpretability story is on less certain epistemic ground than the headline numbers suggest. The training results would still hold. The narrative would not.

Second, polysemanticity in the load-bearing features. The repetition feature the paper steers also fires on benign repetition: a user asking the model to echo their question or an MCQ format that requires reproducing the answer choice. One feature, multiple concepts. Steering it down has collateral damage that does not show up in aggregate benchmarks but will show up in product complaints.

Third, the seed problem. Song et al. May 2025 demonstrated that SAEs trained with different seeds learn different features on the same model and data. Qwen-Scope ships exactly one seed per group. We do not know how stable any of these results are when the SAE itself is retrained.

Fourth, the base-versus-instruct gap. Only one of the fourteen groups, Qwen3.5-27B, is trained on the instruct variant. Activation distributions drift substantially between base and instruct models. Most of the SASFT and steered-RL machinery sits on base-model SAEs and is then applied to instruct fine-tuning. There is an unverified assumption that feature transfer occurs in this context.

Smaller items: no arXiv preprint; the PDF is hosted on Alibaba's CDN. AxBench-style work from earlier this year suggests linear probes match SAEs on concept detection, which raises the same "is this even SAE-specific" question Korznikov raised. The Apache 2.0 license carries a §9.3 social-impact clause that is unusual enough to flag for legal.

Strategic implications

Even with the caveats, the strategic picture is harder to dismiss. Qwen is making a deliberate bet that inspectability is a moat for open models and is investing accordingly.

DeepMind has reportedly cooled on SAE research. Anthropic's circuits work is closed. Llama Scope is one model generation behind. That leaves Qwen-Scope as the default substrate for SAE methods papers for the next six to twelve months. Quiet standard-setting, the kind that compounds.

The procurement read is sharper. Regulated enterprise customers in finance, healthcare, government, and EU-jurisdiction deployments now have a real reason to prefer Qwen over Llama for new builds. "We can show you which features fired and why" is a procurement-grade answer, but "we have a strong RLHF pipeline" is a more technical answer.

Alibaba is using interpretability as a wedge into the regulated-enterprise segment that Anthropic and OpenAI dominate via trust and brand. Different lever, same target. Concretely, here is what to do about it this week.

Practical takeaway

If you ship on Qwen3 or Qwen3.5, four things are worth doing this week.

One. Pull the SAE checkpoints for the model size you run from the Qwen-Scope HuggingFace collection. Encode a batch of residuals from your worst recent failure cases, and identify the top-activated features. That alone tells you what is actually going wrong inside the model.

Two. Please remove GSM8K from your CI eval suite if MATH is already included. The redundancy result is robust enough to act on without further validation.

Three. If you fine-tune in-house, fold the SASFT auxiliary loss into your next SFT run for one specific failure mode. Code-switching, repetition, and refusal-style breakage are the natural starting points. Start small, measure capability deltas honestly on MMLU and IFEval, and expand from there.

Four. If you run a multilingual product, prototype the SAE-feature toxicity classifier on a held-out set in your top three non-English locales. Transparent and inexpensive beats opaque and costly when a regulator asks.

Read the full Qwen-Scope paper before your next post-training planning meeting.

Closing tension

The unresolved question is whether SAE features are a real basis or a useful illusion. If Korznikov is right and random directions work as well, then SASFT and steered RL might be working despite the interpretability story, not because of it. The training results would still hold. The narrative would not.

Sources

Keep Reading