In partnership with

Help make better ads

Did you recently see an ad for Roku Ads Manager in a newsletter? We’re running a short brand lift survey to understand what’s actually breaking through (and what’s not).

It takes about 20 seconds, the questions are super easy, and your feedback directly helps us improve how we show up in the newsletters you read and love.

If you’ve got a few moments, we’d really appreciate your insight.

Google's New Gemma Skips Next-Token Prediction
A language model that drafts the whole paragraph at once, and the physics behind it.

ResearchAudio.io

Google's New Gemma Skips Next-Token Prediction

Discrete diffusion drafts whole blocks at once, up to 4x faster on a single GPU.

Open models · Inference · June 2026

Every language model you have used writes one token at a time, left to right, like a typewriter. Google just shipped an open model that abandons that completely. DiffusionGemma drafts a 256-token block all at once and sharpens it over a few passes, the way an image model resolves static into a photo.

It generates up to 4x faster, past 1,100 tokens per second on a single H100, with the weights on Hugging Face under Apache 2.0. The catch, which Google states plainly, is that it is less accurate than the autoregressive Gemma 4 it is built on.

4x
faster decode
1,100+
tok/s (H100)
3.8B
active / 25.2B
256K
context

Diffusion text generation is not a new idea. Researchers have chased it for years (LLaDA, Inception's Mercury, Google's own Gemini Diffusion), and every attempt to scale it ran into a quality wall. This is the first time a major lab has put a large, capable, multimodal diffusion model on the open web, with full benchmarks, tooling, and weights anyone can fine-tune.

Google DeepMind built it on the Gemma 4 26B Mixture-of-Experts backbone and added a diffusion head. The framing is candid for a launch: it targets researchers and developers working on fast, interactive, local workloads, and for anything that needs top quality, Google says to keep using standard Gemma 4. A vendor shipping a model and pointing you at a different one is rare enough to be worth noticing.

From typewriter to printing press

How the decode works

Autoregressive
Gemma 4 · the typewriter
▸ 1 token per step, left to right
▸ Memory-bandwidth bound
▸ GPU idles between tokens
Diffusion
DiffusionGemma · the press
▸ 256-token canvas per pass
▸ Compute bound
▸ GPU saturated, up to 4x
canvas of noise
iterative refine
clean text

Source: Google DeepMind launch blog and DiffusionGemma model card.

Here is the idea that is easy to miss under the speed numbers. Autoregressive decoding is bound by memory bandwidth: to produce one token, the GPU reloads the entire model from memory, uses it for a single step, and repeats, so most of the chip sits idle and waits. Diffusion hands it a 256-token block to work on at once, which turns generation into the kind of dense compute the hardware is built for. The bottleneck moves from memory to math, and the accelerator finally runs flat out.

The method Google calls multi-canvas sampling makes this concrete. The model starts with 256 random placeholder tokens, then makes several passes, locking in the ones it is sure about and using them as context to fix the rest, until the block resolves into clean text. It is block-autoregressive: diffusion inside each block, blocks in sequence.

Under the hood sits an encoder-decoder split tuned for inference. An autoregressive encoder prefills the prompt into a KV cache, and the decoder reads the whole canvas with bidirectional attention while cross-attending to that cache. Each pass commits roughly 15 to 20 tokens, the sampler caps at 48 denoising steps, and adaptive stopping ends early once the predictions stop changing.

Because the win is about hardware, it appears exactly where hardware is underused. Google measures up to 4x faster decoding: past 1,100 tokens per second on an H100 in FP8, and over 700 on a consumer RTX 5090. The same logic explains the limit, since batching thousands of cloud requests already saturates the chip and leaves diffusion's parallel decode adding cost instead of speed. The advantage lives at low concurrency, on one accelerator, close to the user.

How it decides what to keep

DiffusionGemma does not denoise blindly. Its recommended sampler, Entropy-Bounded Denoising with Adaptive Stopping, commits the tokens it is most certain about on each pass and renoises the rest, while the temperature cools from 0.8 to 0.4 as the block firms up. Adaptive stopping is the useful part: a canvas can finish well short of the 48-step cap once the model is both confident and stable, so a short structured answer resolves in a few passes while a hard one takes the full budget. That is why tokens per second is not a fixed number, it tracks how easy the task is.

sampler: entropy_bounded_denoising
max_denoising_steps: 48
temperature: 0.8 → 0.4   # linear decay
entropy_bound: 0.1   # commit lowest-entropy tokens
adaptive_stop:
  entropy_threshold: 0.005   # confident
  stable_steps: 2   # top token unchanged

Like Gemma 4, it also carries a toggleable thinking mode, switched on by a single control token in the system prompt, so you can spend a few denoising passes on internal reasoning when a task is worth it. Leave it off and the model answers straight from the canvas.

The part that isn't about speed

Speed is the headline. The more interesting shift is what the model can do. Because every token in a block can see every other token, DiffusionGemma can use the end of a passage to fix its beginning. Autoregression cannot: once it writes a token, that token is locked, and it never revises it in light of what comes after.

This matters for a whole class of tasks where future context constrains the present: code infilling, in-line editing, closing brackets and markdown cleanly, even structured puzzles. Google's demo is the clearest tell, a fine-tuned DiffusionGemma that solves Sudoku, which autoregressive models handle poorly because the first square depends on squares they have not reached yet. A model that drafts the whole grid at once can reason across it, and the same property lets it render code or formatting in near real time, committing structure and content in a single pass.

The accuracy you trade away

DiffusionGemma vs Gemma 4 (26B A4B)

DiffusionGemma    Gemma 4

MMLU Pro
77.6%  vs  82.6%
AIME 2026 (no tools)
69.1%  vs  88.3%
GPQA Diamond
73.2%  vs  82.3%
LiveCodeBench v6
69.1%  vs  77.1%
Relative decode speed (local, single GPU)
DiffusionGemma  up to 4x
Gemma 4  1x baseline

Source: DiffusionGemma model card (instruction-tuned, Entropy-Bound sampler) and Google launch blog.

None of this comes without cost, and Google does not pretend otherwise. Against its own autoregressive twin, the diffusion model trails almost everywhere: MMLU Pro 77.6 to 82.6, AIME 2026 69.1 to 88.3, GPQA Diamond 73.2 to 82.3, LiveCodeBench v6 69.1 to 77.1. On vision the gap widens, with MMMU Pro at 54.3 against 73.8. One result breaks the pattern: on Humanity's Last Exam with no tools, the diffusion model edges ahead, 11.0 to 8.7, a hint that it may be reasoning differently rather than simply worse.

Two of those gaps land on the model's own turf. It is pitched for interactive and agentic local work, yet on Tau2, an agentic tool-use benchmark, it scores 56.2 against 68.2, and on MRCR long-context retrieval it reaches 32.0 against 44.1. Codeforces rating tells the same story, 1429 against 1718. The speed has to be worth those points, which is why the use case matters more here than the leaderboard.

What you get back is a small, local footprint. The Mixture-of-Experts design activates 3.8 billion of its 25.2 billion parameters per token (8 of 128 experts, plus one shared), and quantized it fits in 18GB of VRAM, which puts a 256K-context multimodal model on a single consumer card. Native NVFP4 support runs it 4-bit on Blackwell with near-lossless accuracy, tuned for the RTX 5090 and 4090, DGX Spark, and DGX Station.

Adoption has not waited for the quality gap to close. In its first weeks the model passed 300,000 downloads, with community quantizations and fine-tunes already in the dozens. It runs through Hugging Face Transformers, vLLM, MLX, Unsloth, and NVIDIA NeMo, with llama.cpp support on the way, and Google shipped a modular JAX toolbox, Hackable Diffusion, for hands-on fine-tuning.

What to take away

Match it to the workload, not the headline. The 4x speedup is real at low concurrency on one accelerator, which means local and single-user. In high-throughput cloud serving, autoregressive batching already wins and diffusion can cost more. Reach for it when the user is one person waiting on one machine.

It is a hardware bet as much as a model. The whole advantage rests on having spare compute, so it shines on H100, Blackwell, and high-end consumer Nvidia cards, and may do little on memory-bandwidth-bound machines like Apple Silicon. Where you run it decides whether it is fast at all.

The frontier just opened up. A major lab has shipped a diffusion language model with the weights, the benchmarks, and the caveats, plus support across Unsloth, vLLM, MLX, NeMo, and NVIDIA NIM. Whatever the gap looks like today, the field can measure it, fine-tune against it, and work to close it.

Today the bet is narrow and honest: local, interactive, infill-heavy work where 4x speed is worth a few points of accuracy. But the gap to autoregression is points, not orders of magnitude, and the physics underneath favors diffusion as hardware keeps getting more compute-rich. If the next generation closes that gap, the typewriter stops being the default.

ResearchAudio.io · frontier AI research, explained for builders.

Model card: google/diffusiongemma-26B-A4B-it

Launch blog: blog.google

Developer guide: developers.googleblog.com

You are reading ResearchAudio.io

Keep Reading