|
TL;DR
1. On May 11, Thinking Machines released TML-Interaction-Small. A 276B-parameter model with 12B active.
2. It reads audio, video, and text in 200ms slices and writes audio plus text in parallel, in the same slice, forever.
3. This deletes four pieces every voice AI in production today depends on: voice activity detection, turn-boundary prediction, separate audio encoders, and bolted-on text-to-speech.
4. On standard tests it beats or ties GPT-realtime-2.0 and Gemini-3.1-flash-live. On new tests Thinking Machines had to invent, it wins by 10x or more, because nothing else can do them.
5. Research preview now. Wider release later this year. Every voice product stack will need a rewrite.
|
You have probably used ChatGPT voice mode. Or Gemini Live. Or Siri. And you have noticed something.
It is not really a conversation. It is a walkie-talkie.
You speak. It waits. It answers. It waits. You speak again.
Try to interrupt and it gets confused. Point your camera at something while talking and it cannot see it. Pause mid-sentence and it jumps in too early, or too late.
Here is why.
The harness no one talks about
Inside every commercial voice AI, the model itself is smart. But the model cannot listen to you on its own. It cannot decide when to speak. It cannot read your video feed.
So engineers wrap it in a layer of small specialist systems that handle all of that for it. This layer has a name. They call it the harness.
The harness has four main parts:
|
Voice activity detection (VAD). Watches your audio waveform and guesses when you stopped talking.
Turn-boundary prediction. Decides the exact moment the model should jump in to speak.
Audio encoder. A separate model that converts your voice into tokens the main model can read.
Text-to-speech. Another separate model that converts the main model's text back into spoken audio.
|
Each of these is much smaller and less intelligent than the main model. Each one caps how good the experience can get.
When VAD guesses wrong, the model interrupts you. When turn-boundary prediction misfires, the model speaks too early. The audio encoder strips out information the main model could have used. And TTS adds latency on every reply.
This is exactly the situation Rich Sutton warned about in 2019.
The Bitter Lesson says general methods that scale with compute eventually beat handcrafted systems. Every time.
Speech recognition went through this. Translation went through this. Image classification went through this.
Voice interaction was next on the list.
Thinking Machines just shipped the proof
On May 11, Thinking Machines released TML-Interaction-Small. A 276B-parameter model that throws the entire harness out.
No VAD. No turn prediction. No separate encoder. No bolted-on speech.
One model handles audio, video, and text. In 200-millisecond slices. In both directions. At the same time.
|
The architecture shift
|
Today: Turn-based
| User speaks |
| ▼ |
| Model waits |
| ▼ |
| Model speaks |
Required harness:
VAD · turn predictor audio encoder · TTS
|
|
TM Labs: Micro-turn
| User stream |
|
Model stream |
|
|
|
|
|
|
|
|
|
| 200ms slices |
|
in parallel |
No harness:
silence is a token audio + video + text in one model
|
Source: Thinking Machines Lab, Interaction Models (May 2026)
|
So how does a model talk while you talk
The whole trick is one idea: the micro-turn.
Instead of waiting for a complete user turn, then generating a complete model turn, the model chops time into 200-millisecond slices. In each slice, it reads 200ms of input and writes 200ms of output. Then it does it again. Forever.
Both directions stream. In the same slice, the model can hear what you are saying right now, see what your camera is pointing at right now, and produce its own voice and text response right now. The streams interleave.
And here is the thing that breaks the old design: silence is just a token. The model is always reading. So when you stop talking, it does not need a separate detector to figure that out. The pause is already in its context, like any other piece of information.
This single change kills the entire harness layer.
VAD is gone, because silence is a token. Turn prediction is gone, because there are no turns to predict. Separate audio encoders are gone, because the model takes raw audio through a lightweight embedding (dMel) and learns the rest itself. Bolted-on TTS is gone, because the model generates speech directly with a flow head.
All co-trained from scratch. One model. One context.
But what about heavy thinking
Here is the natural objection. If the model has to respond every 200 milliseconds, how does it ever do deep reasoning, tool calls, or long web searches?
Thinking Machines split the work. The interaction model handles the live conversation. When something needs real thought, it hands the full context over to a second background model that runs asynchronously.
The interaction model stays present. It keeps the conversation going, answers follow-ups, takes new input. When the background model finishes its tool call or search, the result streams back in and the interaction model weaves it into the conversation at a natural moment.
Two models. Shared context.
One feels fast. The other thinks deep. The user gets both.
The benchmarks tell two stories
The first story is normal. On benchmarks GPT-realtime-2.0 and Gemini-3.1-flash-live already exist on, TML-Interaction-Small wins or ties.
Turn-taking latency on FD-bench V1: 0.40 seconds. GPT-realtime-2.0 minimal sits at 1.18. Gemini-3.1 sits at 0.57.
Interaction quality on FD-bench V1.5: 77.8. GPT scores 46.8. Gemini scores 54.3.
On Audio MultiChallenge, the intelligence benchmark for voice, it scores 43.4. GPT minimal scores 37.6. Still competitive when GPT runs in its highest reasoning mode (48.5).
Fine. New model wins benchmarks. We have seen this movie before.
But then there is the second story. And it is much stranger.
Thinking Machines kept hitting capabilities the existing benchmarks could not measure. So they invented new ones. TimeSpeak asks whether a model can speak at a precise time on command ("remind me to breathe every 4 seconds"). CueSpeak asks whether it can react to a verbal cue while you are still speaking.
RepCount-A streams a video of someone doing pushups and asks the model to count them out loud as they happen. Charades streams a video and asks the model to say "start" and "stop" when a specific action begins and ends.
The numbers on these tests are not close.
|
New benchmarks · TML vs GPT-realtime-2.0
|
|
64.7
vs 4.3
TimeSpeak
speak at specific time
|
81.7
vs 2.9
CueSpeak
react to verbal cue
|
35.4
vs 1.3
RepCount-A
count pushups on video
|
These are not small gaps. On Charades action detection, TML scores 32.4 mIoU. GPT-realtime-2.0 scores zero.
The harness-based systems do not just lose. They cannot attempt the task. The architecture forbids it.
You cannot count pushups in a video if your model is text-only and your harness only triggers on audio. You cannot say "stop" at the exact moment an action ends if your model has to wait for the user to finish a turn. You cannot interrupt your user when they say something wrong if you have no concept of speaking while listening.
What this actually unlocks
Translation that runs while the speaker is still mid-sentence. Live, no waiting for a pause.
A coding assistant that watches your screen and interrupts the second you write a bug. A workout coach that counts reps by looking at the camera. A language tutor that catches a mispronunciation in the middle of the word, not after the sentence.
A sports commentator that narrates the game as it happens. An interview rehearsal partner that backchannels ("mhmm", "go on") instead of staying silent. A driving assistant that warns you about something it sees in the camera while you are mid-question.
None of these work today. Not because the underlying intelligence is missing, but because the interface refuses to let them work. The harness is in the way.
Four things worth taking away
|
Insight 1: Every voice product in production today sits on top of components less intelligent than the underlying model. Sutton's Bitter Lesson says those components get outpaced. This release is that lesson arriving in voice.
|
|
Insight 2: The speed-vs-thinking tradeoff has a clean answer. One fast model for live presence, one slow model for reasoning. Shared context, async hand-off, results streamed back into the conversation. Expect this two-model pattern to spread fast.
|
|
Insight 3: When a new model needs new benchmarks to even measure its capabilities, pay attention. Audio MultiChallenge and BigBench Audio cannot see what is happening here. The capability gap is much larger than the leaderboards suggest.
|
|
Insight 4: If you are shipping voice products, audit your stack now. Which parts of your harness disappear the moment native full-duplex models reach your API? Plan the rewrite before competitors do.
|
The walkie-talkie era has an end date
TML-Interaction-Small is a 276B mixture-of-experts model with 12B active parameters. It is a research preview, not a public API yet. Thinking Machines says a limited preview is coming in the next few months, with wider access later this year. Larger models are on the way.
The point is the direction. Interaction is no longer a thin layer wrapped around a smart model. It is the model.
And every voice product built on a harness is now living on borrowed time.
|