In partnership with

Help make better ads

Did you recently see an ad for Roku Ads Manager in a newsletter? We’re running a short brand lift survey to understand what’s actually breaking through (and what’s not).

It takes about 20 seconds, the questions are super easy, and your feedback directly helps us improve how we show up in the newsletters you read and love.

If you’ve got a few moments, we’d really appreciate your insight.

Take the survey.

Gemma 4 12B Deletes the Encoder Stack

ResearchAudio.io

Frontier Models · June 5, 2026

Gemma 4 12B Deletes the Encoder Stack

Vision is one matrix multiply. Audio drops the encoder entirely. It still runs on a 16GB laptop.

12B

Parameters

16GB

To run locally

Multimodal encoders

Start with the part that sounds like a typo. To handle images, Gemma 4 12B does not use a vision encoder. It uses a single matrix multiplication.

Google released Gemma 4 12B this week, a mid-sized open model that sits between the tiny E4B and the 26B Mixture-of-Experts. Most coverage led with the laptop angle: it runs a multimodal agent locally on 16GB of memory. That part is true and useful. The part worth a closer look is how it gets there.

ResearchAudio.io · Reference

Inside Gemma 4 12B

what the encoder-less build gives you

fig. 01 · legend

solid primary path

dashed relation

dotted auxiliary

Encoder-less vision

01 / 06

no vision encoder

Image

pixels

1 matmul

▸

LLM

backbone

Module matmul + norms	Replaces vision encoder
Adds pos. embedding	Perceives backbone

Encoder-less audio

02 / 06

encoder removed

Raw audio

waveform

project

▸

LLM

backbone

Encoder removed	Signal raw → tokens
Same dims as text	First mid-size Gemma

Near-26B reasoning

03 / 06

at less than half the memory

Gemma 4 12B

this model

nears

▸

26B MoE

larger sibling

Benchmarks near 26B	Memory under half
Unlocks multi-step	For agentic work

16GB laptop

04 / 06

on-device

Gemma 4 12B

weights

local

▸

Laptop

16GB

Memory 16GB	Type system / unified
Deploy local	Between E4B and 26B

Drafters, built in

05 / 06

ships included

Model

backbone

multi-token

▸

Lower

latency

Feature Multi-Token	Role drafter
Goal cut latency	Wiring none

Open and everywhere

06 / 06

Apache 2.0

Weights

open

serve

▸

Ollama

vLLM, more

License Apache 2.0	Weights HF + Kaggle
Tune Unsloth	Agents Skills repo

Source: Google DeepMind · Introducing Gemma 4 12B (June 3, 2026)

fig. 01

The standard recipe for multimodal models is to bolt encoders onto a language model. A vision encoder turns pixels into embeddings. An audio encoder turns waveforms into embeddings. Those embeddings get projected into the model, and each encoder adds its own parameters, latency, and memory.

Gemma 4 12B drops that pattern. For vision, Google replaced the encoder with a lightweight embedding module: one matrix multiplication, a positional embedding, and a few normalizations. The language model backbone takes over the visual processing itself. For audio, they went further and removed the encoder completely, projecting the raw audio signal into the same dimensional space as text tokens.

Google's new multimodal model replaces the vision encoder with a single matrix multiplication and removes the audio encoder completely. The language model handles perception itself.

If you run a multimodal pipeline today with separate vision and audio encoders, this is a concrete memory and latency baseline to beat on the same hardware. Google reports benchmark performance nearing the 26B MoE at less than half the total memory footprint, small enough for consumer laptops with 16GB of system or unified memory. It also ships with Multi-Token Prediction drafters to cut latency, and it is the first mid-sized Gemma to take native audio input.

The key insight: Removing the encoders is not a size trick. It is an architecture bet that a strong language backbone can absorb perception directly, and that bet is exactly what makes the model small enough to live on your laptop.

Quick Hits

150 million downloads. The Gemma 4 family has crossed 150 million downloads. Google points to community builds that range from wearable robotic arms for physical assistance to enterprise-grade AI security.

Drafters in the box. Gemma 4 12B comes with Multi-Token Prediction drafters built in to reduce latency, so the speedup is part of the release rather than something you wire up afterward.

Open and everywhere. Weights are on Hugging Face and Kaggle under Apache 2.0. You can run it through Ollama, LM Studio, llama.cpp, vLLM, and SGLang, or fine-tune it with Unsloth.

A skills library for agents. Google also published an official Skills Repository for Gemma, a library of agent skills meant to help coding agents build with the models. The companion Developer Guide has the architecture breakdown.

The Take

Here is the part nobody is talking about. The story is not that another open model runs on a laptop. It is that two encoders, which most of us treat as non-negotiable plumbing, turned out to be removable without collapsing reasoning.

If that holds up under real workloads, the default multimodal stack gets simpler and leaner, and a lot of on-device agent ideas that were blocked on memory budget suddenly fit. My move this week: pull Gemma 4 12B in Ollama or LM Studio on a 16GB machine, feed it an image and an audio clip, and measure memory and tokens per second against the encoder-based pipeline you run today. If the encoder-less path wins on the same hardware, that is your signal to rethink the stack.

If someone on your team is sizing a multimodal pipeline, this is the comparison to send them.

(The latency probe I am using to compare the two pipelines is in the paid archive.)

The Open Question

If a strong language backbone can absorb vision with one matrix multiply and audio with no encoder at all, what is the encoder actually doing for us in the models that still ship one? Reply with the workload where you think a dedicated encoder still earns its memory.

Dropping the encoders is the kind of architecture choice that looks like a footnote until half the field copies it.

Next issue: I am running Gemma 4 12B's encoder-less vision path head to head against an encoder-based pipeline on the same laptop, with the memory and latency numbers laid out. One side wins by more than I expected.

ResearchAudio.io · research worth shipping with

Source: Introducing Gemma 4 12B, Google DeepMind (June 3, 2026)

Google Deleted the Encoder Out of Gemma 4 12B

Help make better ads

Gemma 4 12B Deletes the Encoder Stack

Keep Reading

Quick Links

Stay Updated