|
researchaudio.io
Issue 47
·
June 8, 2026
|
Two Humans Worked for a Week and Got 23%. The AI Agents Got 97%.
The first end-to-end demonstration that AI can run an open-ended research project on its own — with the caveat Anthropic put in footnote form.
Anthropic engineers ship 8x more code per quarter today than they did across the entire 2021–2025 stretch. More than 80% of the lines merged to Anthropic's production codebase in May 2026 were authored by Claude. Five months ago, an Anthropic employee stopped writing code by hand.
This is the first time a frontier lab has published internal numbers on recursive self-improvement — the term for AI systems helping build their own successors. The piece, written by Marina Favaro and Jack Clark for the Anthropic Institute, is the most candid production data on AI-accelerating-AI development that has ever been made public. It is also a careful, qualified argument that we are not at recursive self-improvement yet, and that the gap left in human hands is the part that matters most.
The 8x figure is the headline. The 80% figure is the receipt. The 23% vs 97% is the part nobody is talking about.
|
the headline number, plainly
|
23%
two humans, ~1 week
research judgment gap recovered
|
|
97%
claude agents, 800 hours, ~$18k
research judgment gap recovered
|
source: anthropic institute, may 2026. floor: weak supervisor alone. ceiling: strong model on correct answers.
|
The Numbers Nobody Outside Anthropic Could See
In plain English: the first time a frontier lab has opened the books on what AI-accelerated AI development looks like from the inside.
The full article is at anthropic.com/institute/recursive-self-improvement. The internal data they released is the load-bearing part.
| Metric · 2021–2024 · Today |
| Lines of code merged per engineer per day |
Flat baseline |
8x baseline |
| Share of merged code attributed to Claude |
Low single digits |
80%+ |
| Open-ended task success rate (top tier) |
~26% |
76% |
| Research-judgment win rate (vs human choice) |
51% (Opus 4.5) |
64% (Mythos) |
| Speedup on Anthropic's training-code benchmark |
3x (Opus 4) |
52x (Mythos) |
| METR task-completion horizon, 50% reliability |
4 min (Opus 3) |
12h (Opus 4.6) |
| Project Glasswing: high/critical vulns |
n/a |
10,000+ |
A few things worth flagging before reading the table as gospel:
- The 8x is a lines-of-code measure. Anthropic themselves call it "an imperfect measure." The piece includes the caveat explicitly.
- The 80% is conservative. Anthropic leadership have publicly said the true figure (including scripts and experimental code) is 90%+. The 80% counts lines merged to production and excludes auto-generated code.
- The 64% research-judgment win rate is a single-judge evaluation on a hand-picked set of 129 bad-turn moments. It is not a like-for-like comparison.
- The 3x → 52x is on a fixed-in-advance benchmark. It is not a real-world training speedup. The piece is explicit that the absolute multiple is not the figure to anchor on — the like-for-like comparison is.
The piece is sourced almost entirely from Anthropic's own measurements of their own model in their own environment. METR's external evaluation is the strongest piece of corroboration, and it is the only one. Read the data. Do not read the hype.
The 23 / 97 Experiment, Plainly
In plain English: Anthropic gave two groups the same AI safety problem. The agents won by 4x.
The question Anthropic asked: can a weak model reliably supervise a stronger one? The setup has a clean floor (the weak supervisor on its own) and a clean ceiling (the strong model trained on correct answers). Between them is a gap. Two human researchers, working for about a week, recovered 23% of that gap. The AI agents, running for 800 cumulative hours and burning roughly $18,000 in compute, recovered 97%.
The caveats the piece includes, in the same paragraph:
- The result did not transfer cleanly to production-scale models.
- Humans still chose the problem and wrote the scoring rubric.
- "Direction-setting" is the only meaningful role a human played.
That last line is the editorial one. The piece quotes the human researcher: "Claude did all of this with pretty minimal help from me over the course of 1-2 days. I think if [a junior colleague] came back to me with results like this in the same span of time, I would be mildly impressed. The future is now."
|
"The future is now."
anthropic researcher, april 2026, on the 97% result
|
The 16-Percentage-Point Gap
In plain English: Mythos picks the better research step 64% of the time. A model that could see the whole session would do it 80% of the time. That 16-point gap is the future.
The setup for this measurement: Anthropic researchers used Claude Code on real open-ended investigations (training runs crashing, benchmarks scoring low). Researchers made a wrong turn in 129 of these sessions. Anthropic then showed various Claude models only the work from before the session went off-course, and asked what each model would do next. A separate Claude — one that could see how the session eventually turned out — judged whose suggestion was better.
On this measure, Opus 4.5 (November 2025) beat the human's actual choice 51% of the time. By April 2026, Mythos Preview was at 64%. The control: when the human's next move was already strong, the model was judged better only ~20% of the time. So this is not "AI beats humans at everything." It is "AI beats humans at the moment of being stuck," and the gap is widening.
figure 1: the gap between Mythos Preview and the practical ceiling, on a measure where 50% is the floor (random) and 80% is the ceiling (model with hindsight). source: anthropic institute, may 2026.
Anthropic frames this as "an early signal that AI systems are getting better at making the kinds of judgement calls that AI research depends on." That is the most carefully hedged sentence in the whole piece. It is also the most honest.
When that 16-point gap closes, the next Anthropic Institute piece will not need a "What is recursive self-improvement?" framing. It will be an after-action report.
Where the Bottleneck Is Right Now
In plain English: every time AI gets faster at one job, the slowest part of the work becomes the new bottleneck. The bottleneck has moved three times in five years.
figure 2: where each stage's bottleneck has moved, 2021–2027. source: anthropic institute, may 2026.
This is the section of the piece that uses the word "Amdahl's law." It is the right word. Amdahl's law says you cannot speed up a system by accelerating one part without bound — the pace is capped by the slowest part. The piece's argument is that we are watching the bottleneck migrate from "code writing" to "code review" to "problem selection" to "taste."
Anthropic has already started hitting it. They have run a retrospective analysis showing that an automated Claude code review would have caught roughly a third of the bugs behind past incidents on claude.ai before they reached production. The engineers who wrote that code are described as "among the best in the world at building these systems."
The same bottleneck pattern is showing up outside engineering. The piece describes "an explosion of new ideas, initiatives, tools, and simulations" — "far more than we have the capacity to pursue." The rate at which an organization can spot and fix its bottlenecks may be the skill that determines who wins.
What the Outside World Looks Like
In plain English: three external benchmarks confirm the same exponential curve Anthropic is seeing from the inside.
- SWE-bench Verified (swebench.com). The leaderboard is currently topped by Claude 4.5 Opus (high reasoning) at 76.8% — followed by Gemini 3 Flash at 75.8%, Claude Opus 4.6 at 75.6%, and GPT-5.2 Codex at 72.8%. Two years ago the leaders were scoring in the low single digits. "Saturated" is the word the piece uses.
- CORE-Bench (core-bench.github.io). Tests whether a model can reproduce the results of a published paper. Went from 20% success in 2024 to "saturated" fifteen months later.
- METR task-completion horizon (metr.org). The benchmark Anthropic cites for the "4 minutes → 12 hours" curve. METR's own measurements of Mythos Preview go further: "at least" 16 hours, "at the upper end of what [METR] can measure without new tasks."
A footnote worth quoting in full. GitHub saw roughly one billion code commits in all of 2025. By mid-2026 it was seeing 275 million a week — on pace for roughly 14 billion over the year. The company's COO has said they are "pushing incredibly hard" on capacity just to keep up.
If you want one chart to send to a non-technical friend, it is the METR time horizon chart. It is the cleanest "this is going faster than people think" visual in the field.
The Community Reaction on Glasswing
In plain English: the Hacker News thread on Project Glasswing surfaced the strongest skepticism, the strongest optimism, and the strongest question about who actually gets access.
The Anthropic Institute piece references Project Glasswing, the public-facing part of the Mythos Preview system, which has found more than 10,000 high- and critical-severity software vulnerabilities in its first weeks. The Hacker News thread on Glasswing (HN #47679121, 1541 pts, 836 comments) surfaced the strongest reactions.
| Take · Source |
| "Another Anthropic PR release based on Anthropic's own research, uncorroborated by any outside source... these are becoming increasingly transparent attempts to inflate the perceived capability of their products." |
| endluss, top of thread |
| "Let's fast forward the clock. Does software security converge on a world with fewer vulnerabilities or more? I'm not sure it converges equally in all places." |
| jryio |
| "We better fucking hope open source wins, because we aren't getting access if it doesn't. So they are only giving access to their smartest model to corporations." |
| impulser_ |
| "How long would it take to turn a defensive mechanism into an offensive one?" |
| agrishin |
| "Mythos Preview has already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser. Scary but also cool." |
| zachperkel |
The skeptics have a point that does not go away with internal data. The piece is sourced almost entirely from Anthropic's own measurements of their own model in their own environment. METR's external evaluation is the strongest piece of corroboration, and it is the only one. The 800-fixes / 1,000x-error-reduction story is a sample size of one. The 64% research-judgment win rate is a single-judge evaluation with a known set of failure modes.
The optimists have a point that also does not go away. Mythos Preview is being held back from general release. The piece says it has already found vulnerabilities "in every major operating system and web browser." The bottleneck in cyber defense has, on Anthropic's data, already shifted from finding vulnerabilities to patching them fast enough.
What This Means for Engineers
|
For junior engineers
The writing of code is becoming a junior activity in a way it was not before. The reviewing of code is becoming the senior activity. Plan accordingly.
|
| |
|
For senior engineers
Be the bottleneck-detector, not the bottleneck. Anthropic's piece describes "an explosion of new ideas, initiatives, tools, and simulations — far more than we have the capacity to pursue." The person who can pick the right one to fund is the person who becomes more valuable as the work multiplies.
|
| |
|
For hiring managers
The new interview question: show me a piece of code you reviewed and rejected, and tell me why. The number of "I wrote this" stories is going to fall. The number of "I caught this" stories is going to rise.
|
| |
|
For founders
The most important number in the piece is not the 8x. It is the gap between Mythos Preview's 64% win rate on research judgment and the practical ceiling of ~80%. That 16-percentage-point gap is what your junior colleagues are about to fill. The product you ship in 2027 is the product that takes advantage of that gap closing.
|
|
if you share one number from this issue, share this one
"The 16pp gap is the future, not the past."
Anthropic's Mythos Preview picks the better research step 64% of the time. A model that could see the whole session would do it 80% of the time. The 16-point gap is the room left for the next model to grow into. When it closes, recursive self-improvement stops being a future and becomes an after-action report.
researchaudio.io · issue 47 · june 2026
|
"Claude-written code was somewhat worse than human-written code at Anthropic in late 2025, is roughly at parity today, and we expect it to be strictly better within the year."
— Anthropic Institute, May 2026
Reader Challenge
- The piece claims 80% of code is Claude-authored. Anthropic leadership have said the true figure is 90%+. Do you trust the conservative number more or less than the headline one? What does the gap tell you about how the number was measured?
- METR measures task-completion horizon at 50% reliability. What is the failure mode of measuring capability at the threshold where the model is right half the time? What does the curve look like at 80% reliability?
- The 16-hour task-completion figure is "at the upper end of what METR can measure without new tasks." What new tasks should METR build before Mythos Preview's successor arrives?
Next Issue Preview
The OpenAI announcement from last week on inference economics — what the per-token cost collapse does to the build-vs-buy calculation for AI startups. Issue 48 ships Tuesday.
Sources: Anthropic Institute, When AI builds itself (May 2026). METR, Task-Completion Time Horizons of Frontier AI Models. SWE-bench Verified leaderboard, swebench.com. Hacker News, Project Glasswing thread #47679121.
|