In partnership with

Help make better ads

Did you recently see an ad for Roku Ads Manager in a newsletter? We’re running a short brand lift survey to understand what’s actually breaking through (and what’s not).

It takes about 20 seconds, the questions are super easy, and your feedback directly helps us improve how we show up in the newsletters you read and love.

If you’ve got a few moments, we’d really appreciate your insight.

Gemma 4 12B Deletes the Encoder Stack
Vision became one matrix multiply. Raw audio flows straight into reasoning.

ResearchAudio.io

Frontier Models · June 5, 2026

Gemma 4 12B Deletes the Encoder Stack

Vision is one matrix multiply. Audio drops the encoder entirely. It still runs on a 16GB laptop.

12B
Parameters
16GB
To run locally
0
Multimodal encoders
Start with the part that sounds like a typo. To handle images, Gemma 4 12B does not use a vision encoder. It uses a single matrix multiplication.
Google released Gemma 4 12B this week, a mid-sized open model that sits between the tiny E4B and the 26B Mixture-of-Experts. Most coverage led with the laptop angle: it runs a multimodal agent locally on 16GB of memory. That part is true and useful. The part worth a closer look is how it gets there.

ResearchAudio.io · Reference

Inside Gemma 4 12B

what the encoder-less build gives you

fig. 01 · legend

 solid primary path
 dashed relation
 dotted auxiliary
Encoder-less vision01 / 06
no vision encoder
Image
pixels
1 matmul
 
LLM
backbone
Module
matmul + norms
Replaces
vision encoder
Adds
pos. embedding
Perceives
backbone
 
Encoder-less audio02 / 06
encoder removed
Raw audio
waveform
project
 
LLM
backbone
Encoder
removed
Signal
raw → tokens
Same dims
as text
First
mid-size Gemma
 
Near-26B reasoning03 / 06
at less than half the memory
Gemma 4 12B
this model
nears
 
26B MoE
larger sibling
Benchmarks
near 26B
Memory
under half
Unlocks
multi-step
For
agentic work
 
16GB laptop04 / 06
on-device
Gemma 4 12B
weights
local
 
Laptop
16GB
Memory
16GB
Type
system / unified
Deploy
local
Between
E4B and 26B
 
Drafters, built in05 / 06
ships included
Model
backbone
multi-token
 
Lower
latency
Feature
Multi-Token
Role
drafter
Goal
cut latency
Wiring
none
 
Open and everywhere06 / 06
Apache 2.0
Weights
open
serve
 
Ollama
vLLM, more
License
Apache 2.0
Weights
HF + Kaggle
Tune
Unsloth
Agents
Skills repo
Source: Google DeepMind · Introducing Gemma 4 12B (June 3, 2026) fig. 01
The standard recipe for multimodal models is to bolt encoders onto a language model. A vision encoder turns pixels into embeddings. An audio encoder turns waveforms into embeddings. Those embeddings get projected into the model, and each encoder adds its own parameters, latency, and memory.
Gemma 4 12B drops that pattern. For vision, Google replaced the encoder with a lightweight embedding module: one matrix multiplication, a positional embedding, and a few normalizations. The language model backbone takes over the visual processing itself. For audio, they went further and removed the encoder completely, projecting the raw audio signal into the same dimensional space as text tokens.

Google's new multimodal model replaces the vision encoder with a single matrix multiplication and removes the audio encoder completely. The language model handles perception itself.

If you run a multimodal pipeline today with separate vision and audio encoders, this is a concrete memory and latency baseline to beat on the same hardware. Google reports benchmark performance nearing the 26B MoE at less than half the total memory footprint, small enough for consumer laptops with 16GB of system or unified memory. It also ships with Multi-Token Prediction drafters to cut latency, and it is the first mid-sized Gemma to take native audio input.

The key insight: Removing the encoders is not a size trick. It is an architecture bet that a strong language backbone can absorb perception directly, and that bet is exactly what makes the model small enough to live on your laptop.

Quick Hits

150 million downloads. The Gemma 4 family has crossed 150 million downloads. Google points to community builds that range from wearable robotic arms for physical assistance to enterprise-grade AI security.
Drafters in the box. Gemma 4 12B comes with Multi-Token Prediction drafters built in to reduce latency, so the speedup is part of the release rather than something you wire up afterward.
Open and everywhere. Weights are on Hugging Face and Kaggle under Apache 2.0. You can run it through Ollama, LM Studio, llama.cpp, vLLM, and SGLang, or fine-tune it with Unsloth.
A skills library for agents. Google also published an official Skills Repository for Gemma, a library of agent skills meant to help coding agents build with the models. The companion Developer Guide has the architecture breakdown.

The Take

Here is the part nobody is talking about. The story is not that another open model runs on a laptop. It is that two encoders, which most of us treat as non-negotiable plumbing, turned out to be removable without collapsing reasoning.
If that holds up under real workloads, the default multimodal stack gets simpler and leaner, and a lot of on-device agent ideas that were blocked on memory budget suddenly fit. My move this week: pull Gemma 4 12B in Ollama or LM Studio on a 16GB machine, feed it an image and an audio clip, and measure memory and tokens per second against the encoder-based pipeline you run today. If the encoder-less path wins on the same hardware, that is your signal to rethink the stack.
If someone on your team is sizing a multimodal pipeline, this is the comparison to send them.
(The latency probe I am using to compare the two pipelines is in the paid archive.)

The Open Question

If a strong language backbone can absorb vision with one matrix multiply and audio with no encoder at all, what is the encoder actually doing for us in the models that still ship one? Reply with the workload where you think a dedicated encoder still earns its memory.
Dropping the encoders is the kind of architecture choice that looks like a footnote until half the field copies it.
Next issue: I am running Gemma 4 12B's encoder-less vision path head to head against an encoder-based pipeline on the same laptop, with the memory and latency numbers laid out. One side wins by more than I expected.

ResearchAudio.io · research worth shipping with

Source: Introducing Gemma 4 12B, Google DeepMind (June 3, 2026)

Keep Reading