“AI is Going to Fundamentally Change…Everything”
That’s what NVIDIA CEO Jensen Huang just said about the AI boom, even calling it “the largest infrastructure buildout in human history.”
NVIDIA’s chips made this real-time revolution possible, but now it’s collaborating with Miso to unlock amazing new advances in robotics
Already a first-mover in the $1T fast-food industry, Miso’s AI-powered Flippy Fry Station robots have worked 200K+ hours for leading brands like White Castle, just surpassing 5M+ baskets of fried food.
And this latest NVIDIA collaboration unlocks up to 35% faster performance for Miso’s robots, which can cook perfect fried foods 24/7. In an industry experiencing 144% labor turnover, where speed is key, those gains can be game-changing.
There are 100K+ US fast-food locations in desperate need, a $4B/year revenue opportunity for Miso. And you can become an early-stage Miso shareholder today. Hurry to unlock up to 7% bonus stock.
This is a paid advertisement for Miso Robotics’ Regulation A offering. Please read the offering circular at invest.misorobotics.com.
|
ResearchAudio.io 50 Tokens Predict Accuracy Better Than 50,000Google's Deep-Thinking Ratio catches reasoning quality after a brief prefix, cutting inference costs by half. |
|||
Here's the part nobody's talking about. The AI community spent years assuming that longer chain-of-thought traces produce better reasoning. More tokens, more thinking, more accuracy. It sounds right. New research from the University of Virginia and Google shows the opposite is true. Raw token count has an average correlation of r = -0.59 with accuracy. That's not noise. That's a moderately strong negative relationship. As the model generates more text, it becomes more likely to be wrong. And meanwhile, you're paying for every one of those tokens. Why longer traces failEngineers routinely use token count as a proxy for how hard an AI is working on a problem. The logic seems sound: more reasoning steps should yield better answers. |

But the researchers found that longer traces often signal what they call "overthinking." The model gets caught in loops. It repeats redundant verification steps. It amplifies its own earlier mistakes by reasoning over flawed intermediate conclusions.
If you're running inference at scale, this means a portion of your compute budget is actively degrading the quality of your outputs. You're paying tokens to make answers worse.
What "deep thinking" actually looks like inside a transformer
The core insight of this paper is that real reasoning happens inside the model's layers, not in the length of its output. When a transformer predicts a token, it processes data through a series of layers (L). Not all tokens require the same depth of processing.
Shallow tokens are the easy predictions. For common words or obvious next steps, the model's internal "guess" stabilizes early, around layer 5 or so. The remaining 30 layers barely change the prediction.
Deep-thinking tokens are different. For a difficult logic step or a critical math symbol, the prediction shifts significantly in the deeper layers. The model is genuinely computing something new in those final layers, not just confirming what it already decided.
How they measure it
The team uses a technique to inspect the model's internal drafts at every layer. They project the intermediate states into vocabulary space using the model's unembedding matrix. This gives them a probability distribution over possible next tokens at each layer.
They then measure the Jensen-Shannon Divergence between each intermediate layer's distribution and the final layer's distribution. A token qualifies as "deep-thinking" if its prediction only stabilizes in the final 15% of layers (a depth fraction of 0.85).
The Deep-Thinking Ratio (DTR) is the percentage of these hard tokens in a full output sequence. The higher the DTR, the more of the model's output involved genuine deep computation rather than shallow repetition.
They tested this across three models: DeepSeek-R1-70B, Qwen3-30B-Thinking, and GPT-OSS-120B. DTR showed a positive correlation of r = 0.683 with accuracy, consistently outperforming length-based and confidence-based alternatives.
Think@n: the practical method
The standard inference approach for hard problems is Self-Consistency (Cons@n). You sample 48 candidate answers, generate each one to completion, and pick the majority vote. It works well, but it's expensive: every candidate is fully generated.

Think@n changes this with a simple idea. Start generating all 48 candidates. After just 50 prefix tokens, compute the DTR for each candidate.
Immediately stop generating the candidates with low DTR scores. Only finish the candidates that are genuinely thinking hard.
The reasoning: if a candidate is producing mostly shallow tokens in the first 50, it's unlikely to produce a correct answer even if you let it run for 10,000 more tokens. The depth signal shows up early.
| AIME 2025 | Cons@n | Think@n |
| Accuracy | 92.7% | 94.7% |
| Avg. cost (k tokens) | 307.6 | 155.4 |
| Cost per correct answer | Baseline | ~48% less |
Higher accuracy at roughly half the cost. On the AIME 2025 math benchmark, Think@n scored 94.7% while consuming 155.4k tokens on average. Standard majority voting scored 92.7% at 307.6k tokens. You get better answers and a smaller bill.
What this means for your inference pipeline
There's a catch. DTR requires access to the model's internal layer activations. You need to be able to project intermediate states into vocabulary space and compute the divergence. This means it works with open-weight models you self-host (DeepSeek, Qwen, Llama), not with closed API-only services where you only see the final text output.
But if you're running open-weight models in production for math, code, or any reasoning-heavy workload, this is directly applicable. The implementation is straightforward: intercept activations at each layer, project through the unembedding matrix, compute Jensen-Shannon Divergence against the final layer, count the deep-thinking tokens, and divide by total tokens.
Key insights
|
Stop using token count as a quality proxy. If you're evaluating reasoning outputs by length, you're measuring the wrong thing. The paper shows length has a negative correlation with quality (r = -0.59). When DTR is high, the model is genuinely working on the problem. |
|
Implement early halting for self-consistency sampling. If you're using Cons@n (majority voting over many samples) for math or reasoning tasks, switch to Think@n. Generate 50 prefix tokens per candidate, compute DTR, kill the low scorers immediately, and only finish the top candidates. The paper validates this across three major model families. |

|
The 50-token window is the actionable number. You don't need thousands of tokens to assess quality. DTR can be estimated from just 50 prefix tokens, which means the cost of the assessment itself is negligible. This is what makes the approach practical at scale. |
The industry spent years optimizing for longer chain-of-thought. This paper provides evidence that the real quality signal was always in the layers, not the length.
Quick Hits
Mamba-3 state space model lands. The new architecture uses exponential-trapezoidal discretization with complex-valued states. It achieves a 1.8 percentage point accuracy gain over GDN at the 1.5B parameter scale, with lower decode latency than prior linear models like Mamba-2.
Nvidia acquires Groq inference talent. Nvidia secured a licensing agreement for Groq's inference chip technology and hired Groq's founder Jonathan Ross along with senior engineers. Groq continues operating independently under new leadership. The move signals Nvidia is taking inference-specific hardware seriously.
Edge inference spend projected at $378B by 2028. IDC predicts AI use cases will push edge computing investment to nearly $378 billion, with companies like Akamai claiming sub-millisecond latency advantages over hyperscaler regions for real-time inference workloads.
Fast-WAM skips imagination at test time. Tsinghua researchers built a World Action Model that achieves 97.6% task success on robotics benchmarks while running 4x faster than imagine-then-execute approaches, by using video co-training during learning but skipping explicit future simulation during inference.
The Take
I expected this paper to be another incremental efficiency gain. It isn't. The negative correlation between token count and accuracy challenges a core assumption of how most inference pipelines are designed today.
If you're budgeting more tokens to improve quality, you may be actively making things worse. That's a hard message for teams who've built their systems around the "think longer, think better" paradigm.
The practical limitation is real: DTR requires internal layer access, so it only works with open-weight models you self-host. But for anyone running DeepSeek, Qwen, Llama, or similar models in production, this is a concrete path to cutting your inference bill in half while getting better answers. I'd start testing it this week.
The Open Question
DTR works because you can peek inside the transformer layers. But what about closed models behind APIs, where you only see the final text output? Is there a way to estimate reasoning depth from the text alone, without needing layer-state access?
If you've found a text-only proxy that correlates with answer quality better than token count, I'd like to hear about it. Hit reply.
Paid members get the full implementation walkthrough: DTR computation code for vLLM and HuggingFace, optimal early-halting thresholds by model family, and a cost calculator spreadsheet for estimating savings on your specific inference workload.
The deeper implication goes beyond cost. If the quality signal lives in the depth of layer processing rather than the volume of output, then inference scaling laws need rewriting. The question is no longer "how many tokens should the model generate" but "how hard should each token think."

|
ResearchAudio.io Source: arXiv:2602.13517 (University of Virginia + Google, February 2026) |


