In partnership with

If you work in fintech or finance, you already have too many tabs open and not enough time.

Fintech Takes is the free newsletter senior leaders actually read. Each week, I break down the trends, deals, and regulatory moves shaping the industry — and explain why they matter — in plain English.

No filler, no PR spin, and no “insights” you already saw on LinkedIn eight times this week. Just clear analysis and the occasional bad joke to make it go down easier.

Get context you can actually use. Subscribe free and see what’s coming before everyone else.

Meta Built a Digital Twin of Your Brain. It's More Accurate Than the Real Thing.

ResearchAudio.io

Meta Built a Digital Twin of Your Brain. It's More Accurate Than the Real Thing.

TRIBE v2 predicts your neural response to any sight or sound, no scanner needed

70x

Finer resolution than prior SOTA

700+

Brains in the training pool

29K

Brain locations predicted per frame

I read this result three times because I thought I was misunderstanding the claim.

Meta's FAIR team built a model called TRIBE v2. You show it a movie clip. You tell it to predict how a person's brain will respond. The catch: the model has never seen this movie, and it has never scanned this person. And its prediction of the group-average brain response is 2x more correlated with reality than the actual fMRI recording from any individual in the group.

The AI's guess about your brain is more reliable than your own brain scan. That's the paper's headline finding, and I spent the last few hours pulling it apart.

Why This Exists

If you want to know how the brain processes faces, you recruit 30 undergrads, show them face photos in an MRI tube, and publish a paper about the fusiform gyrus. That's neuroscience in 2025. Small datasets, narrow questions, and months of scanner time for each study.

Nobody has a model that handles vision, audio, and language together, mapping all of them to the whole brain at once. TRIBE v2 is the first serious attempt.

Three Frozen Giants, One Shared Brain

The architecture is elegant. TRIBE v2 takes three existing AI models, freezes them completely, and uses their internal representations as a lens into human neural activity.

how TRIBE v2 sees, hears, and reads

three frozen encoders feed one shared transformer

LLaMA 3.2-3B

1,024 words context per token

V-JEPA2-Giant

64 frames, 4-second window

Wav2Vec-BERT 2.0

Resampled to 2 Hz

→

8 layers, 8 heads

D = 3 x 384 = 1,152

100-second window

→

YOUR BRAIN

29,286 locations

20K cortical + 8.8K subcortical

Think of the three encoders like borrowed eyes, ears, and a reading brain. LLaMA 3.2 reads text with 1,024 words of preceding context per token. V-JEPA2-Giant watches 4-second video segments. Wav2Vec-BERT listens to audio. Each compresses its input into a 384-dimensional embedding at 2 Hz.

Those three streams get concatenated (D = 1,152) and fed into an 8-layer transformer. The transformer mixes signals across a 100-second context window, then projects down to 29,286 brain locations per time step.

One model. Three senses. Every measurable spot in the brain.

Here's what got me: TRIBE v2 follows a log-linear scaling law. More fMRI hours = proportionally better predictions, with no plateau in sight. We're watching a GPT-style scaling curve, but for neuroscience. The training set was 451 hours. Imagine what happens at 5,000.

more data in, better predictions out

encoding accuracy vs. fMRI hours (log-linear, no ceiling)

50 hrs

0.12

100 hrs

0.19

200 hrs

0.28

451 hrs

0.40

1000+ hrs

???

Dashed bar: projected. The data doesn't exist yet. But the curve says it will work.

The Result That Broke My Intuition

I expected the model to produce decent approximations. Useful but imprecise. What I did not expect: TRIBE v2's zero-shot prediction of group-averaged brain activity achieved an R_group near 0.4 on the Human Connectome Project's 7T dataset. The median individual subject's correlation with the group average was roughly half that.

Your actual scan

~0.2

correlation with group truth

You blinked. You shifted.
You were thinking about lunch.
Single-subject fMRI is noisy.

→

TRIBE v2's guess

~0.4

correlation with group truth

Never scanned this person.
Never seen this stimulus.
Denoised by construction.

Why does this happen? Because individual fMRI is inherently noisy: you blinked, you shifted, you sneezed. The model's representation is an average over hundreds of brains, denoised by training. It sees the signal underneath the noise.

And when you do have real data? One hour of fMRI for a new subject plus one epoch of fine-tuning gives you 2-4x better predictions than a linear model trained from scratch on the same data. The foundation model does the heavy lifting.

what 1 hour of real data buys you

Linear (from scratch)

Baseline accuracy

→

TRIBE v2 (1 epoch)

2-4x

Same data, better model

It Rediscovered Decades of Neuroscience on Its Own

This is the part that should make neuroscientists uncomfortable (in a good way). The team ran virtual experiments through TRIBE v2 and checked whether it could recover findings that took the field decades to establish. It did.

brain regions the model found without being told they existed

FFA

Faces

PPA

Places

Broca's

Syntax

TPJ

Emotion

But here's what surprised me most. When the team applied Independent Component Analysis to the transformer's final layer, five functional networks emerged on their own: auditory, language, motion, default mode, and visual. Nobody labeled these. Nobody guided the model toward them. It organized itself into the same structure that neuroscientists spent decades mapping by hand.

5 networks that emerged unsupervised

Auditory

Language

Motion

Default

Visual

Why this matters if you work on multi-modal AI: TRIBE v2 never saw a brain. It mapped frozen AI embeddings to fMRI signals. The fact that this alignment holds across vision, audio, and language simultaneously tells us something deep: AI representations and biological representations share more structure than most of us assumed. If you build representation learning systems, this result should change how you think about what your model is learning.

If someone on your team works on BCI, neural decoding, or multi-modal embeddings, send them this section. They will want to see the architecture diagram above.

Meta's TRIBE v2 predicts how your brain responds to any sight or sound, without scanning you, and its zero-shot prediction beats individual fMRI recordings. The model, weights, and code are all open-source.

What I'd Actually Use

I ran the interactive demo for about 20 minutes. You feed it a video clip and it shows predicted brain activation maps in real time. What struck me: the visual cortex lights up differently for fast cuts vs. slow pans, and the language network activates even for non-speech audio with semantic content. It's the fastest way to build intuition for what "representational alignment" actually means in practice.

If you want to go deeper: GitHub repo for the code, HuggingFace for the weights. Fine-tuning on your own data takes under an hour.

Five Things Worth Knowing

The training data is surprisingly small. 451 hours from 25 deep-scan subjects watching movies, listening to podcasts, and viewing silent videos. Evaluated on 1,117 hours from 720 subjects. The fact that 25 deeply-scanned brains can generalize to 700+ unseen ones is the real story here.

The encoders never learn. LLaMA 3.2-3B, V-JEPA2-Giant, and Wav2Vec-BERT 2.0 stay frozen. The trainable parts are the temporal transformer and the subject-specific projection head. That's a design choice worth studying: it suggests the hard part isn't perception, it's temporal integration.

It works across languages. Train on English audio. Predict brain responses to Mandarin. The model generalizes because the underlying neural representations for language processing are more universal than the surface-level signals. I didn't expect this to work as well as it did.

70x resolution improvement. Compared to the previous best neural prediction models, TRIBE v2 maps brain activity at 70x finer spatial resolution. That's the difference between knowing "somewhere in the visual cortex" and knowing which specific patch.

v1 already proved the concept. The original architecture took first place in the Algonauts 2025 challenge. v2 scales it from a competition entry into a general-purpose foundation model. This is the rare case where the follow-up paper matters more than the original.

The Open Question

TRIBE v2 trains on healthy adult brains. The scaling law shows no ceiling. So here's the question nobody is asking yet: what happens when you train a version on patients with neurological conditions and compare their predicted activation patterns to the healthy baseline? Could the gap between "expected brain" and "actual brain" become a diagnostic biomarker?

Early Alzheimer's detection from a 10-minute video viewing session instead of an expensive PET scan. That's the paper I'm waiting for. Hit reply if you think it's plausible, or if you see a reason it won't work.

Next issue: a new method recovers 3D protein structures from single-frame cryo-EM images. I'm still checking the resolution numbers because they seem too good.

ResearchAudio.io

Source: A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience (Meta FAIR, 2026)

Code: github.com/facebookresearch/tribev2 | Weights: huggingface.co/facebook/tribev2 | Demo: aidemos.atmeta.com/tribev2

Meta Built a Digital Twin of Your Brain. It's More Accurate Than the Real Thing.

Meta Built a Digital Twin of Your Brain. It's More Accurate Than the Real Thing.

Why This Exists

Three Frozen Giants, One Shared Brain

The Result That Broke My Intuition

It Rediscovered Decades of Neuroscience on Its Own

Keep Reading

Quick Links

Stay Updated

Meta Built a Digital Twin of Your Brain. It's More Accurate Than the Real Thing.

The Free Newsletter Fintech and Finance Execs Actually Read

Meta Built a Digital Twin of Your Brain. It's More Accurate Than the Real Thing.

Why This Exists

Three Frozen Giants, One Shared Brain

The Result That Broke My Intuition

It Rediscovered Decades of Neuroscience on Its Own

Keep Reading

Quick Links

Stay Updated