|
researchaudio.io · Issue 14 · 2026-06-30
|
|
Headline comparison
v1: 40% mean. v2: 61% mean.
The progress is real. The product is not.
|
|
Meta’s Brain2Qwerty v2: more data, real-time, and still a long way from a patient
|
|
The headline “non-invasive brain-to-text” sounds patient-ready. The numbers are not.
|
|
Brain2Qwerty v2 is the second release from the same Meta / BCBL team behind v1 (the v1 paper just landed in Nature Neuroscience). v2 collects 10× more data per participant, runs in real time, and uses a three-module hierarchical decoder (Conformer encoder → word-level Aligner → LLM sentence rewriter). On those merits alone, it is a genuine engineering step.
But the framing matters. The blog and project page lead with “10× more data” and “78% best-participant word accuracy.” They do not lead with the 39% word error rate. They do not lead with the 9 healthy volunteers, the fridge-sized MEG scanner, the EEG results that were dropped from the v2 narrative, or the missing patient cohort.
That’s the story.
|
|
Section
What Meta says
|
|
The official line: “Brain2Qwerty v2 decodes complete and meaningful sentences solely from MEG signals of healthy volunteers, and reaches up to 78% word accuracy for the best participant” (project page, last-modified 2026-06-30). The blog leans on three claims: 10× more training data per person, real-time (online) inference with no keypress timing required, and a hierarchical architecture that decodes letters, words, and sentences jointly.
All three are true. None of them are the part the patient cares about.
|
|
Section
What the numbers actually show
|
| Metric |
v1 |
v2 |
| Typed sentences per participant |
~2,200 |
~22,000 (10×) |
| Healthy volunteers |
35 |
9 |
| Word error rate (mean) |
~60% |
39% |
| Word accuracy (mean) |
~40% |
61% |
| Best-participant word accuracy |
48% |
78% |
| Real-time inference |
No (needs keypress timing) |
Yes |
| Patient cohort |
0 |
0 |
|
|
Two things to notice.
First, v2 is genuinely better on the headline accuracy metric, both at the mean (61% vs 40% word accuracy) and at the best-participant ceiling (78% vs 48%). The 10× data increase appears to have moved the needle.
Second, the ceiling is still 78% for the best person, with 39% mean WER across only 9 healthy adults, all sitting inside a MEG machine. The bar for “useful” assistive communication is not “decodes three out of four words.” It is “decodes most of them, reliably, on someone who cannot type.” We are not there.
Word accuracy (mean, v1 vs v2)
v1 [████████ ] 40%
v2 [████████████ ] 61%
Best-participant word accuracy
v1 [█████████ ] 48%
v2 [███████████████ ] 78%
The bar a patient actually needs (rough)
[████████████████████████████] ~95%+
|
|
|
|
Section
How it actually works
|
|
In plain English: someone wears a helmet full of magnetic sensors, types a sentence they just heard, and a model guesses the words from the shape of the brain signal alone, while they type, in real time.
The deeper version, in three parts:
- Conformer encoder reads the MEG signal at every keypress and outputs character probabilities. This is the v1-style “what finger just moved” signal.
- Word-level Aligner clusters those characters into word embeddings, so the model reasons at the word level, not just the letter level.
- Character-level language model (a fine-tuned LLM) rewrites the noisy character stream into a clean English sentence. This is why the model can output “the robot moves very fast” when the raw character stream was something like “tha robut mooves vary fast.” The LLM is doing the cleanup, the same way your phone’s keyboard fixes typos.
The new piece in v2 is the pipeline runs asynchronously: it does not need to know when a key was pressed. It watches a continuous MEG recording and emits characters as they appear. That is the engineering contribution. The accuracy bump is a data contribution.
Two honest caveats inside the paper itself:
- v2 is worse at character-level decoding than the Encoder alone (CER 0.31 vs 0.28). The LLM is producing semantically clean sentences by smoothing out hard character-level errors, which means the model is editing its own output. That is impressive, and it is also the part of the result that does not generalize.
- For the worst subject, v2’s output is a coherent but entirely different sentence. The paper says so plainly. The decoder can produce a sentence that means the right thing, but is not the sentence the person typed. For a communication aid, that is a category of error you do not want.
|
|
|
|
Section
Where it works / where it collapses
|
|
Where it works
- Decodes full English sentences from non-invasive MEG, which v1 could not do in real time.
- Scales with data: the paper reports a log-linear improvement in accuracy as training data grows, with no plateau yet.
- Code is open (458 stars, 57 forks as of 2026-06-30). Anyone can re-run the v1 pipeline.
- For the best participant, 78% word accuracy on a real-time MEG stream is a real result.
|
|
Where it collapses
- 9 healthy volunteers. None of them have a condition that prevents them from typing. The clinical population is unstudied.
- MEG is a superconducting helmet the size of a small fridge, kept at liquid-helium temperatures, available in maybe a hundred research centers worldwide. “Non-invasive” describes the surgery, not the device.
- 39% mean WER means roughly 4 of every 10 words are wrong, on the good day, on the best subjects, in a research setting.
- The model can output the wrong sentence confidently. Worst-subject case: “coherent but entirely different sentence” (the paper’s own language).
- v2 dropped EEG from the main narrative. v1 had EEG at 67% CER. Wearable EEG is the actual mass-market non-invasive sensor. v2 is not on that sensor.
|
|
“For the best subject, the model produces either perfect or near-perfect decoding. For the worst subject, the output can be a coherent but entirely different sentence.”
From the Brain2Qwerty v2 paper
|
|
|
Section
Qualitative failures
No community signal worth reporting this week (HN thread: 10 points, 0 comments). Failure modes below are from the paper itself.
|
- Subject collapse. 9 healthy adults, 22,000 sentences each, 10 hours of recording. The paper’s “best participant” is one of nine. There is no result for a person with motor impairment.
- Modality collapse. v2 is MEG-only. v1 ran on EEG too, with a 67% CER. EEG is the only sensor that fits the “non-invasive, accessible” pitch. v2 abandons it.
- Error-shape failure. The LLM cleanup produces semantically right but lexically wrong sentences. For casual reading that is fine. For a person trying to say “call my daughter, not my son,” it is not.
- Hardware failure. The MEG scanner is a clinical-grade installation. “Non-invasive” describes the absence of surgery, not the absence of friction. Most patients will never sit in one.
- Generalization failure. All v2 sentences are English, from a corpus participants heard and then typed. The model decodes that task. It does not decode free composition.
|
|
Section
What this means for
|
|
Junior engineer
The interesting part is the LLM cleaning up a noisy CTC stream. That pattern (encoder → LM rewriter) generalizes to speech, handwriting, sign-language gloves, anything with a noisy character stream and a high-level language model available. Worth re-implementing on your own noisy-signal dataset to see the lift.
|
|
|
Senior engineer
The “log-linear improvement with no plateau” claim is the one to watch. If it holds at 100k sentences, the gap to invasive BCIs gets uncomfortable for the implant folks. The right next move is replication, not architecture: a second lab, a second language, a second scanner.
|
|
|
Hiring manager
The skill profile this project actually rewards: signal processing fundamentals (the Conformer is doing real work), CTC training (a quietly rare skill), LLM fine-tuning for sequence cleanup, and the patience to collect 10 hours of brain data per person. That is not a job description you will find on a job board. You will find the people at BCBL, Meta FAIR, and a handful of academic MEG labs.
|
|
|
Founder
Do not build a startup on this. The hardware will not be wearable in 2026. The patient data does not exist. The most defensible move is to build the LM-cleanup layer for adjacent noisy-signal products (silent-speech interfaces, electromyography keyboards) and wait for the MEG hardware to catch up.
|
|
|
Section
The metric that actually matters
|
The number Meta leads with: 78%
The number Meta buries: 39% mean WER
The number that has to hit for clinical use: ~95% word accuracy
at 60+ wpm, on patients
The number of patients tested: 0
The v1 → v2 jump is real. The framing is also real. v2 is a research artifact, not a communication aid, and the gap between those two things is the only number worth tracking.
|
|
If you share one thing
9 healthy volunteers, 39% mean WER, 0 patients tested. That is the gap between “non-invasive brain-to-text” and a product.
|
|
|
Closing
The progress is real. The product is not. Don’t confuse the curve for the destination.
|
|
Reader challenge
Three questions to sit with this week
|
- If the LLM rewriter is doing 30% of the decoding work, what does “Brain2Qwerty decoded a sentence” even mean?
- What is the smallest wearable MEG unit you would trust to type a sentence to your spouse?
- Would you publish a 78% accuracy number on a clinical communication aid?
|
|
Next issue
What the Nature Neuroscience acceptance of v1 actually implies for non-invasive BCIs, and why the regulatory path is the harder half of the story.
|
|
-- researchaudio.io
|