ResearchAudio.io Meta Built a Digital Twin of Your Brain. It's More Accurate Than the Real Thing.TRIBE v2 predicts your neural response to any sight or sound, no scanner needed |
70x
Finer resolution than prior SOTA | 700+ Brains in the training pool | 29K Brain locations predicted per frame |
|
|
I read this result three times because I thought I was misunderstanding the claim.
|
|
Meta's FAIR team built a model called TRIBE v2. You show it a movie clip. You tell it to predict how a person's brain will respond. The catch: the model has never seen this movie, and it has never scanned this person. And its prediction of the group-average brain response is 2x more correlated with reality than the actual fMRI recording from any individual in the group.
|
|
The AI's guess about your brain is more reliable than your own brain scan. That's the paper's headline finding, and I spent the last few hours pulling it apart.
|
Why This Exists |
|
If you want to know how the brain processes faces, you recruit 30 undergrads, show them face photos in an MRI tube, and publish a paper about the fusiform gyrus. That's neuroscience in 2025. Small datasets, narrow questions, and months of scanner time for each study.
|
|
Nobody has a model that handles vision, audio, and language together, mapping all of them to the whole brain at once. TRIBE v2 is the first serious attempt.
|
Three Frozen Giants, One Shared Brain |
|
The architecture is elegant. TRIBE v2 takes three existing AI models, freezes them completely, and uses their internal representations as a lens into human neural activity.
|
how TRIBE v2 sees, hears, and reads three frozen encoders feed one shared transformer | LLaMA 3.2-3B 1,024 words context per token | | V-JEPA2-Giant 64 frames, 4-second window | | Wav2Vec-BERT 2.0 Resampled to 2 Hz |
| → | 8 layers, 8 heads D = 3 x 384 = 1,152 100-second window | → | YOUR BRAIN 29,286 locations 20K cortical + 8.8K subcortical |
|
|
|
Think of the three encoders like borrowed eyes, ears, and a reading brain. LLaMA 3.2 reads text with 1,024 words of preceding context per token. V-JEPA2-Giant watches 4-second video segments. Wav2Vec-BERT listens to audio. Each compresses its input into a 384-dimensional embedding at 2 Hz.
|
|
Those three streams get concatenated (D = 1,152) and fed into an 8-layer transformer. The transformer mixes signals across a 100-second context window, then projects down to 29,286 brain locations per time step.
|
|
One model. Three senses. Every measurable spot in the brain.
|
Here's what got me: TRIBE v2 follows a log-linear scaling law. More fMRI hours = proportionally better predictions, with no plateau in sight. We're watching a GPT-style scaling curve, but for neuroscience. The training set was 451 hours. Imagine what happens at 5,000.
|
|
more data in, better predictions out encoding accuracy vs. fMRI hours (log-linear, no ceiling) | | 50 hrs | | 0.12 | | 100 hrs | | 0.19 | | 200 hrs | | 0.28 | | 451 hrs | | 0.40 | | 1000+ hrs | | ??? |
| Dashed bar: projected. The data doesn't exist yet. But the curve says it will work. |
|
The Result That Broke My Intuition |
|
I expected the model to produce decent approximations. Useful but imprecise. What I did not expect: TRIBE v2's zero-shot prediction of group-averaged brain activity achieved an Rgroup near 0.4 on the Human Connectome Project's 7T dataset. The median individual subject's correlation with the group average was roughly half that.
|
Your actual scan ~0.2 correlation with group truth You blinked. You shifted. You were thinking about lunch. Single-subject fMRI is noisy. | → | TRIBE v2's guess ~0.4 correlation with group truth Never scanned this person. Never seen this stimulus. Denoised by construction. |
|
|
Why does this happen? Because individual fMRI is inherently noisy: you blinked, you shifted, you sneezed. The model's representation is an average over hundreds of brains, denoised by training. It sees the signal underneath the noise.
|
|
And when you do have real data? One hour of fMRI for a new subject plus one epoch of fine-tuning gives you 2-4x better predictions than a linear model trained from scratch on the same data. The foundation model does the heavy lifting.
|
what 1 hour of real data buys you | Linear (from scratch) 1x Baseline accuracy | → | TRIBE v2 (1 epoch) 2-4x Same data, better model |
|
|
It Rediscovered Decades of Neuroscience on Its Own |
|
This is the part that should make neuroscientists uncomfortable (in a good way). The team ran virtual experiments through TRIBE v2 and checked whether it could recover findings that took the field decades to establish. It did.
|
brain regions the model found without being told they existed | |
|
|
But here's what surprised me most. When the team applied Independent Component Analysis to the transformer's final layer, five functional networks emerged on their own: auditory, language, motion, default mode, and visual. Nobody labeled these. Nobody guided the model toward them. It organized itself into the same structure that neuroscientists spent decades mapping by hand.
|
5 networks that emerged unsupervised | |
|
Why this matters if you work on multi-modal AI: TRIBE v2 never saw a brain. It mapped frozen AI embeddings to fMRI signals. The fact that this alignment holds across vision, audio, and language simultaneously tells us something deep: AI representations and biological representations share more structure than most of us assumed. If you build representation learning systems, this result should change how you think about what your model is learning.
|
|
|
If someone on your team works on BCI, neural decoding, or multi-modal embeddings, send them this section. They will want to see the architecture diagram above.
|
| Meta's TRIBE v2 predicts how your brain responds to any sight or sound, without scanning you, and its zero-shot prediction beats individual fMRI recordings. The model, weights, and code are all open-source. |
|
|
I ran the interactive demo for about 20 minutes. You feed it a video clip and it shows predicted brain activation maps in real time. What struck me: the visual cortex lights up differently for fast cuts vs. slow pans, and the language network activates even for non-speech audio with semantic content. It's the fastest way to build intuition for what "representational alignment" actually means in practice.
|
|
If you want to go deeper: GitHub repo for the code, HuggingFace for the weights. Fine-tuning on your own data takes under an hour.
|
Five Things Worth Knowing |
|
| The training data is surprisingly small. 451 hours from 25 deep-scan subjects watching movies, listening to podcasts, and viewing silent videos. Evaluated on 1,117 hours from 720 subjects. The fact that 25 deeply-scanned brains can generalize to 700+ unseen ones is the real story here.
|
| The encoders never learn. LLaMA 3.2-3B, V-JEPA2-Giant, and Wav2Vec-BERT 2.0 stay frozen. The trainable parts are the temporal transformer and the subject-specific projection head. That's a design choice worth studying: it suggests the hard part isn't perception, it's temporal integration.
|
| It works across languages. Train on English audio. Predict brain responses to Mandarin. The model generalizes because the underlying neural representations for language processing are more universal than the surface-level signals. I didn't expect this to work as well as it did.
|
| 70x resolution improvement. Compared to the previous best neural prediction models, TRIBE v2 maps brain activity at 70x finer spatial resolution. That's the difference between knowing "somewhere in the visual cortex" and knowing which specific patch.
|
| v1 already proved the concept. The original architecture took first place in the Algonauts 2025 challenge. v2 scales it from a competition entry into a general-purpose foundation model. This is the rare case where the follow-up paper matters more than the original.
|
|
|
TRIBE v2 trains on healthy adult brains. The scaling law shows no ceiling. So here's the question nobody is asking yet: what happens when you train a version on patients with neurological conditions and compare their predicted activation patterns to the healthy baseline? Could the gap between "expected brain" and "actual brain" become a diagnostic biomarker?
|
|
Early Alzheimer's detection from a 10-minute video viewing session instead of an expensive PET scan. That's the paper I'm waiting for. Hit reply if you think it's plausible, or if you see a reason it won't work.
|
Next issue: a new method recovers 3D protein structures from single-frame cryo-EM images. I'm still checking the resolution numbers because they seem too good.
|
|
ResearchAudio.io
Source: A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience (Meta FAIR, 2026)
Code: github.com/facebookresearch/tribev2 | Weights: huggingface.co/facebook/tribev2 | Demo: aidemos.atmeta.com/tribev2 |