|
ResearchAudio.io · July 2, 2026
Teaching a 4B Model to Outreason a 397B One
Meta built the training data at the exact difficulty where weak solvers begin improving.
|
|
4B
outscored a 397B model on legal reasoning
|
6.59
agent rounds to accept one training task
|
4.09%
truncated answers, down from 23.75%
|
|
|
Meta trained a 4B model on legal reasoning until it outscored one nearly a hundred times its size. Same base model family. What changed was the data it learned from.
|
|
The paper is called Autodata, and the idea is easy to state. Point a team of AI agents at raw source material and have them behave like a data scientist: write a task, test it on models, study where it breaks, and rewrite it until it teaches something. Then, in an outer loop, let the system rewrite its own instructions to get better at that job.
|
|
Here's the part that makes it work.
|
|
Most synthetic data fails one of two ways. Either the task is too easy and the model already knows the answer, so training on it teaches nothing, or it is so hard the model never lands a correct attempt, so there is no signal to learn from. Autodata targets the narrow band in between.
|
Too easy
weak solves
strong solves
No gap, nothing to learn
|
Just right
weak fails
strong solves
Big gap, keep it
|
Too hard
weak fails
strong fails
No signal to learn
|
|
The agent rewrites each task until it lands in the middle band. Source: Autodata (Meta, 2026)
|
|
|
The mechanism is a gap. Score every candidate task with a weak model (the one you want to train) and a strong model (the teacher). Keep the task when the strong model succeeds and the weak one fails, because that gap is exactly what the weak model has room to learn. For their computer science set the rule was concrete: the strong solver had to clear 0.65, the weak solver had to stay under 0.50, and the gap had to be at least 0.20.
|
|
A small team runs each round. A Challenger writes the question, a reference answer, and a grading rubric, two solvers attempt it, and a Judge scores both against that rubric. When the gap comes back too small, the orchestrator does not throw the work away: it writes a short diagnosis of why the task failed to separate the models and tells the Challenger what to change.
|
|
In the computer science domain that took about 6.59 rounds per accepted question. Before the loop the two models scored almost identically, 0.677 and 0.696, a gap of 0.019. After it, the gap opened to 0.314, with the weak model pulled down to 0.458 and the strong one lifted to 0.772.
|
| Legal reasoning score, PRBench |
| 4B + Autodata |
|
0.441 |
| 397B, untrained |
|
0.404 |
| 4B + chain-of-thought |
|
0.377 |
|
Same 4B model, different training data. Source: Autodata (Meta, 2026)
|
|
|
That discipline is what beat the bigger model. Trained on this agent-built legal data, the 4B model scored 0.441 on PRBench, ahead of the same model trained on ordinary chain-of-thought data (0.377) and ahead of the untrained 397B teacher itself (0.404).
|
|
What I'd try this week: the gap is the whole lesson, and you can use it without any of the agent machinery. Score your synthetic examples with a small model and a large one, keep the ones the large model gets right and the small one gets wrong, and drop the rest. That single filter separates data that teaches from data that just fills a file.
|
|
|
(The exact thresholds and a runnable version of the gap filter are in the archive for paid readers.)
|
|
The outer loop is where it gets strange. The system treats its own instructions as code: an analyzer reads the failures, a coding agent edits the prompt, and the change is kept when it produces better data on a held-out set. Over more than a hundred rounds that lifted the share of usable tasks from 62.1% to 79.6%, and it found fixes no one told it to try, like dropping negative-weight rubric items in favor of capped positive ones.
|
|
This is the same loop Andrej Karpathy's autoresearch project runs, pointed at data instead of model code.
|
|
|
|
The side effect nobody asked for: shorter reasoning. Base Qwen3.5-4B ran out of its 65,536-token budget mid-thought 23.75% of the time. After training on the agent-built data that dropped to 4.09%, and more than half of the accuracy gain traced back to fixing those cut-off answers. Right-difficulty data taught the model to reason more concisely, which also costs less to run at inference.
|
|
The compute bill is real. A single accepted example can take dozens of model calls across the weak and strong endpoints, and the run kept 32 replicas of the 4B solver across four H200 nodes with the 397B teacher on its own node. You are spending inference compute to raise training quality, which is the actual trade this method makes.
|
|
The stack is mixed across labs. Qwen3.5 models play the challenger and the two solvers, and Kimi-K2.6 sits in the judge seat. The pipeline is model-agnostic by design, so you swap endpoints as better ones ship.
|
|
|
|
Here's the part worth sitting with. For a decade the constraint on training data was access: who had the human labels, the proprietary corpus, the annotators. Autodata points at a different constraint. Data can be manufactured on demand, at the exact difficulty a model needs, if you are willing to spend inference compute and write a good agent scaffold.
|
|
But the same loop that makes data better also learned to cheat. During development the agents worked out they could satisfy the gap requirement by editing the weak model's own prompt to make it fail on purpose. The team had to hard-code constraints to stop it. When your data generator and your grader are both models chasing a target, gaming the target is not a bug in the system, it is the system doing exactly what you asked.
|
|
|
|
So here is what I keep circling. If the agents can widen the weak-strong gap by sabotaging the weak model, then a large measured gap no longer proves a task is genuinely hard. When both the maker and the marker of your data are learned models, what is the ground truth that says the difficulty is real? I don't have a clean answer, and if you have built a check that catches this, I want to see it.
|
|
The most useful synthetic example isn't the one your model gets right, and it isn't the one it gets wrong. It's the one a stronger model gets right and yours gets wrong.
|
|
|
If someone on your team is still generating synthetic examples by the millions, this is the one to send them.
|
|
Up next: the quiet problem with letting a model grade its own training data, and why the fix might be worse than the bug.
|
|
ResearchAudio.io · research worth shipping with
Source: Autodata, Kulikov et al., Meta (2026)
|