|
researchaudio.io
How a Tokenizer Mismatch Collapsed GSM8k to 2.56
Nvidia traced it through 1,100 numerals, then built a projection-matrix fix.
|
|
tl;dr
|
|
• The current best cross-tokenizer distillation method scored below using no teacher, because a tokenizer split silently suppressed multi-digit numbers.
|
|
• Nvidia's fix, a projection matrix built from tokenizer strings, recovered grade-school math sixfold and edged past same-family distillation.
|
|
• Plus, the past few days in AI: Anthropic near a 965 billion dollar valuation, the compute bill paying for it, and Microsoft Build going all in on agents.
|
|
|
2.56
GSM8k under Gold
|
15.54
GSM8k under X-Token
|
+3.82
average gain over Gold
|
|
|
Distilling a strong teacher into a smaller model usually helps. Here is a case where the current best method for doing it across different tokenizers made the student worse than using no teacher at all.
|
|
Knowledge distillation trains a small student to copy a large teacher's full next-token probabilities, not just its answers. The catch is that it normally needs both models to share a tokenizer, which locks you into one model family. Cross-tokenizer distillation removes that lock, so a Llama student can learn from a Qwen or Phi teacher, or from several at once.
|
|
Two methods dominate that setting. The prior state of the art, called Gold, splits tokens into a matched set trained one way and an unmatched remainder trained another. That split is where things break.
|
student
Llama writes 201 as one token
|
→ |
teacher
Qwen splits it into 2 / 0 / 1
|
→ |
▼ suppressed
stranded in the unmatched set
|
| GSM8k accuracy, Llama-3.2-1B student, 3-shot |
| X-Token (P-KL) |
|
15.54 |
| Same tokenizer |
|
12.89 |
| No teacher |
|
10.25 |
| Gold (cross tokenizer) |
|
2.56 |
| Source: X-Token, Nvidia (2026) |
|
|
The reason is a quiet gradient effect, not a tuning problem. Llama packs most two and three-digit numbers into single tokens, while Qwen splits them digit by digit. So every one of Llama's 1,100 multi-digit number tokens falls into the unmatched set under a Qwen teacher.
|
|
Gold trains that unmatched set by rank, pairing a student token with whatever teacher token sits at the same rank, which is often unrelated. Worse, the matched-set loss runs through the full softmax, so it pushes down the probability of every unmatched token at the same time. The paper proves this gradient is non-negative on each suppressed token, regardless of the correct answer.
|
|
The takeaway for anyone distilling across tokenizer families: check which token categories land in the unmatched set before you train. Numbers, punctuation, and non-Latin scripts are the usual suspects.
|
|
The key point: the teacher was strong and the data was fine. A tokenizer split alone dragged the student's grade-school math below the untrained baseline.
|
|
|
The fix is a projection matrix built from the tokenizer strings before training, described in the full paper. It maps each student token onto the teacher tokens it actually corresponds to, so 201 connects to 2, 0, and 1 instead of being stranded. One version drops the split entirely (P-KL), and another keeps it but relaxes the matching (H-KL).
|
|
On the Qwen teacher, that recovered GSM8k from 2.56 to 15.54, past what the same model family gives, and lifted the five-benchmark average 3.82 points over Gold.
|
|
|
|
Pick the loss by coverage, not by default. When the critical tokens land in the unmatched set, dropping the split wins. When they stay matched, keeping the split wins. Use the wrong one and the ranking flips: P-KL leads by 3.55 points on the Qwen teacher, H-KL leads by 1.68 on the Phi teacher.
|
|
Two teachers help, but they have to differ. A math teacher plus a commonsense teacher beat the best single teacher by 1.3 points. Two reasoning teachers together scored below the best single one, because they overlap. Plain static weighting also beat every adaptive scheme they tried.
|
|
The alignment step has a sharp edge. When one tokenizer adds a beginning-of-sequence marker and the other does not, a common substring-matching aligner dumps the whole sequence into one mismatched bucket. Their dynamic-programming aligner handles it with a single gap.
|
|
I put a runnable token-coverage audit, the check that tells you which categories fall into the unmatched set, in the archive for members.
|
|
|
|
Here is the part nobody is calling out. The headline is a new best result for cross-tokenizer distillation, but the real finding is a debugging lesson. On the Phi teacher, where the tokenizers mostly agree, the gain over Gold is 0.5 points, a rounding error.
|
|
The whole story lives in the Qwen case, where a tokenizer split quietly broke a capability and a string-built matrix brought it back. Treat tokenizer mismatch as a first-class failure surface, the same way you treat a bad learning rate or a data leak.
|
|
The current best method for cross-tokenizer distillation scored below using no teacher at all, because a tokenizer split quietly suppressed every multi-digit number in the training signal.
|
|
|
Know someone distilling small models from mixed model families? This is the failure mode worth showing them.
|
|
|
|
Anthropic moved past OpenAI in value. It raised roughly 65 billion dollars at a valuation near 965 billion, which makes it the most valuable private AI company and likely marks its last private round before the public markets, per TechCrunch.
|
|
The compute bill behind it is staggering. To lock capacity, Anthropic arranged a roughly 45 billion dollar compute agreement tied to SpaceX and a 1.8 billion dollar agreement with Akamai, and is sourcing chips and cloud from Google, per recent reporting. Capacity, not ideas, is the binding constraint now, which is exactly why squeezing more out of smaller models matters.
|
|
Microsoft Build opens June 2, all in on agents. The confirmed themes are the Windows Agent Framework, GitHub Copilot agent mode, the Azure model platform, and tooling to govern token spend across providers, per Notebookcheck.
|
|
|
|
We caught this because grade-school math has a clean, measurable score. How many quieter capability losses are riding along in models distilled across mismatched vocabularies, in skills we never benchmark token-category by token-category?
|
|
The lesson generalizes. When you mix models that were never trained to agree, check the seams before you trust the score.
|
|
Next week: the byte-level approach that tries to remove the tokenizer problem outright, and the trade-off that rides along with it.
|
|
researchaudio.io
Source: X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation, Nvidia (2026). Read it at arxiv.org/abs/2605.21699
|