|
ResearchAudio.io · Architecture
Anthropic's hypothesized architecture for Claude Mythos
Four papers, one repo, sixteen loops of the same block.
|
|
770M
looped, matches 1.3B
|
4
papers in 12 months
|
16×
loop iters per forward pass
|
|
|
Two things converged this spring. A research direction that had been quietly building for a year became hard to ignore, and a community reconstruction handed everyone a runnable version of the architecture under question. Both pointed at the same primitive: same transformer block, looped many times per forward pass, with reasoning depth set at inference rather than baked into parameter count.
The research thread runs through four papers. Geiping et al. trained Huginn-3.5B in early 2025, the first widely studied recurrent-depth language model. Lu et al. probed it in July and reported that the latent reasoning story is not yet supported by the internal evidence. Knupp et al. published the Dreamer framework in January 2026 with proper compute-matched baselines. And Chen reported a clean computational frontier on compositional generalization in March. Four papers, twelve months, one architectural shape.
The community thread is Kye Gomez's OpenMythos. Published mid-April. 10,600 stars and 2,400 forks within three weeks. The repo does not contain Claude Mythos weights, because nobody outside Anthropic has them. What it contains is a falsifiable hypothesis, written as PyTorch: that Mythos is a recurrent-depth transformer that loops the same block up to sixteen times per forward pass, with a mixture-of-experts feed-forward and a compressed key-value scheme.
The headline scaling claim is the one that ties both threads together. At 770M parameters, a looped model reaches the downstream quality of a 1.3B fixed-depth transformer trained on the same data. Roughly half the parameters, the same answers. You pay for it at inference time in compute, not in model storage.
|
How the architecture is wired
|
📝
Prelude
standard blocks runs once
|
→ |
🔄
Recurrent Block
one block, looped T times T ≤ 16
|
→ |
🎯
Coda
standard blocks runs once
|
|
|
↰ loops, re-injecting e every step
|
|
|
ht+1 = A·ht + B·e + Transformer(ht, e)
Source: OpenMythos repo, github.com/kyegomez/OpenMythos
|
|
|
OpenMythos instantiates the hypothesis as a three-part structure: Prelude, Recurrent Block, Coda. The Prelude is a small stack of standard transformer layers that runs once. The Coda is the same idea on the way out. The Recurrent Block sits between them as a single transformer block looped up to T iterations, sharing weights across every iteration.
At each loop step, the hidden state updates by the rule above. Here ht is the hidden state after iteration t, and e is the encoded input from the Prelude, re-injected at every step. That re-injection is the trick. Without it, the hidden state drifts away from the original input across deep loops and lands in noise. The learned matrices A and B control how much of the previous state and how much of the original input carry forward at each step.
There is a hard stability constraint. For the recurrence to behave like a Linear Time-Invariant system instead of exploding, the spectral radius of A has to stay below 1. The OpenMythos example code prints ρ(A) = 0.xxxx (must be < 1) at every initialization. If it drifts past 1, the loop is unstable and the model breaks. Treating this as a first-class training primitive, rather than a footnote, is one of the things that makes the repo readable.
|
|
Key Insight
The trade in a looped model is not smaller model, same quality. It is smaller stored model, more inference compute. Whether that trade is good depends entirely on whether your bottleneck is parameters (memory, distribution, edge deployment) or FLOPs (latency, cost per request).
|
|
The research thread
Recurrent-depth is not new as an idea (Universal Transformers proposed it years ago), but four results in the last twelve months are what made it stop feeling speculative. Each one took a different angle on the same architecture, and read together they form a real research thread.
|
fig. 0 · how to read these →
block
loop / state
control
|
|
01 / 06
Universal Transformer
the ancestor of the loop
token → block ↺ → halt?
stop when the token says stop.
|
Depth adaptive |
Stability halt scalar |
Reasoning token |
Seen in T2T |
|
|
|
02 / 06
Huginn-3.5B
the first scale-up
e → block × T → re-inject
same block, many times, at scale.
|
Depth variable |
Stability re-injection |
Reasoning latent |
Seen in Huginn |
|
|
|
03 / 06
Huginn, probed
the audit (Lu et al.)
h₁ h₂ h₃ lens
trajectories jump, not climb.
|
Depth variable |
Stability n/a |
Reasoning limited |
Seen in |
|
|
|
04 / 06
Dreamer
the proper baseline (Knupp et al.)
block × E₁ E₂ E₃
match the compute, then talk.
|
Depth variable |
Stability depth mixing |
Reasoning routed |
Seen in |
|
|
|
05 / 06
Thinking Deeper
the frontier (Chen)
block × 20+ → silent obj
near-ceiling by step twenty.
|
Depth 20+ steps |
Stability LayerScale |
Reasoning silent |
Seen in |
|
|
|
06 / 06
OpenMythos
the synthesis (Gomez)
P → R × 16 → C
770M weights, 1.3B work.
|
Depth 16 max |
Stability ρ(A) < 1 |
Reasoning d-LoRA |
Seen in |
|
|
The arc, in one sentence: one ancestor proposed the loop, one model proved you could train it at scale, one probing study said its internal reasoning is not what the headline implies, one framework gave it proper baselines, one paper showed it scales cleanly, and one repo wired all the threads together.
|
The reconstruction: OpenMythos
OpenMythos arrived in mid-April as the synthesis attempt. Not a paper, not a leaked checkpoint, just PyTorch code carrying a specific hypothesis: Claude Mythos is a recurrent-depth transformer that loops the same block up to sixteen times, with a sparse mixture-of-experts feed-forward (64 experts at the 1B scale, scaling to 512 at the 1T config) and a switchable mechanism between grouped-query and a compressed multi-latent variant. Model variants are pre-configured from 1B to 1T parameters.
The repo also addresses the obvious failure mode. If more loops produce better answers, why not run a hundred? Because beyond a certain depth, predictions degrade. The hidden state drifts past the right answer and into noise. OpenMythos handles this with adaptive computation time halting, a learned scalar per token position that decides when to stop looping. Tokens that have converged halt early. Tokens that are still uncertain keep iterating. You pay compute where you need it.
The second addition is a Depth-Wise LoRA adapter. The base block weights are shared across every iteration, but a small rank-r adaptation matrix gets injected at each loop depth. So loop step 3 looks slightly different from loop step 12, even though they share the same base block. This bridges the gap between pure weight-tying (every iteration identical) and full unrolling (every iteration its own block), and it is one of the more direct responses to the Lu et al. critique: if each iteration looks the same, of course the probing study sees discontinuous trajectories.
|
The Take
Both threads are pointing at the same conclusion from different ends. The research side is firming up the case that depth-recurrence is a real scaling axis, with proper baselines (Dreamer) and a clean generalization signal (Chen's computational frontier). The community side is building runnable scaffolding that combines four research threads in one codebase: looped transformers, continuous-latent reasoning, fine-grained mixture-of-experts routing, and compressed key-value compression. Whether Anthropic's Mythos actually looks anything like OpenMythos, the convergence is the story.
The deeper bet is on inference-time compute as a serious scaling axis. If recurrent depth holds up at scale, the cost structure of frontier models shifts. You ship a smaller checkpoint that runs longer on hard prompts. Storage costs less. Distribution gets easier. Inference gets more expensive, but you control how much per request, per token, per position. The Lu et al. probing study is the caveat: this story holds together if the loops are doing something different at each step, and that part is still unresolved.
|
Four papers and one repo are pointing at the same architecture. The next generation of frontier models will not be deeper. They will be loopier.
|
|
The Open Question
If the same block runs sixteen times in a row, does each iteration learn meaningfully different behavior? Or does the block collapse toward a fixed function that does roughly the same thing at every step? Depth-Wise LoRA is supposed to differentiate the iterations, but at low rank it is soft pressure, not a hard constraint. The probing studies on Huginn-3.5B suggest the answer is messy: hidden state trajectories are not monotonically improving toward the right answer, they jump around.
If you have a clean way to measure whether loop iteration t is doing something different from loop iteration t+1, reply to this email. I want to read it.
|
|
Next Week
The probing study that says Huginn's latent reasoning is mostly illusion. Where the rank trajectories actually go, layer by layer, and why explicit chain-of-thought still wins on arithmetic.
|
|
|
|
|