|
ResearchAudio.io
Frontier Models · June 5, 2026
Gemma 4 12B Deletes the Encoder Stack
Vision is one matrix multiply. Audio drops the encoder entirely. It still runs on a 16GB laptop.
|
|
12B
Parameters
|
16GB
To run locally
|
0
Multimodal encoders
|
|
|
Start with the part that sounds like a typo. To handle images, Gemma 4 12B does not use a vision encoder. It uses a single matrix multiplication.
|
|
Google released Gemma 4 12B this week, a mid-sized open model that sits between the tiny E4B and the 26B Mixture-of-Experts. Most coverage led with the laptop angle: it runs a multimodal agent locally on 16GB of memory. That part is true and useful. The part worth a closer look is how it gets there.
|
|
ResearchAudio.io · Reference
Inside Gemma 4 12B
what the encoder-less build gives you
fig. 01 · legend
| Encoder-less vision | 01 / 06 |
|
| no vision encoder |
Image pixels |
1 matmul |
LLM backbone |
|
Module matmul + norms | Replaces vision encoder |
Adds pos. embedding | Perceives backbone |
|
|
|
| Encoder-less audio | 02 / 06 |
|
| encoder removed |
Raw audio waveform |
project |
LLM backbone |
|
Encoder removed | Signal raw → tokens |
Same dims as text | First mid-size Gemma |
|
|
| |
| Near-26B reasoning | 03 / 06 |
|
| at less than half the memory |
Gemma 4 12B this model |
nears |
26B MoE larger sibling |
|
Benchmarks near 26B | Memory under half |
Unlocks multi-step | For agentic work |
|
|
|
|
| on-device |
Gemma 4 12B weights |
local |
Laptop 16GB |
|
Memory 16GB | Type system / unified |
Deploy local | Between E4B and 26B |
|
|
| |
| Drafters, built in | 05 / 06 |
|
| ships included |
Model backbone |
multi-token |
Lower latency |
|
Feature Multi-Token | Role drafter |
Goal cut latency | Wiring none |
|
|
|
| Open and everywhere | 06 / 06 |
|
| Apache 2.0 |
Weights open |
serve |
Ollama vLLM, more |
|
License Apache 2.0 | Weights HF + Kaggle |
Tune Unsloth | Agents Skills repo |
|
|
| Source: Google DeepMind · Introducing Gemma 4 12B (June 3, 2026) |
fig. 01 |
|
|
The standard recipe for multimodal models is to bolt encoders onto a language model. A vision encoder turns pixels into embeddings. An audio encoder turns waveforms into embeddings. Those embeddings get projected into the model, and each encoder adds its own parameters, latency, and memory.
|
|
Gemma 4 12B drops that pattern. For vision, Google replaced the encoder with a lightweight embedding module: one matrix multiplication, a positional embedding, and a few normalizations. The language model backbone takes over the visual processing itself. For audio, they went further and removed the encoder completely, projecting the raw audio signal into the same dimensional space as text tokens.
|
|
|
Google's new multimodal model replaces the vision encoder with a single matrix multiplication and removes the audio encoder completely. The language model handles perception itself.
|
|
|
If you run a multimodal pipeline today with separate vision and audio encoders, this is a concrete memory and latency baseline to beat on the same hardware. Google reports benchmark performance nearing the 26B MoE at less than half the total memory footprint, small enough for consumer laptops with 16GB of system or unified memory. It also ships with Multi-Token Prediction drafters to cut latency, and it is the first mid-sized Gemma to take native audio input.
|
|
The key insight: Removing the encoders is not a size trick. It is an architecture bet that a strong language backbone can absorb perception directly, and that bet is exactly what makes the model small enough to live on your laptop.
|
|
|
|
|
150 million downloads. The Gemma 4 family has crossed 150 million downloads. Google points to community builds that range from wearable robotic arms for physical assistance to enterprise-grade AI security.
|
|
Drafters in the box. Gemma 4 12B comes with Multi-Token Prediction drafters built in to reduce latency, so the speedup is part of the release rather than something you wire up afterward.
|
|
Open and everywhere. Weights are on Hugging Face and Kaggle under Apache 2.0. You can run it through Ollama, LM Studio, llama.cpp, vLLM, and SGLang, or fine-tune it with Unsloth.
|
|
A skills library for agents. Google also published an official Skills Repository for Gemma, a library of agent skills meant to help coding agents build with the models. The companion Developer Guide has the architecture breakdown.
|
|
|
|
Here is the part nobody is talking about. The story is not that another open model runs on a laptop. It is that two encoders, which most of us treat as non-negotiable plumbing, turned out to be removable without collapsing reasoning.
|
|
If that holds up under real workloads, the default multimodal stack gets simpler and leaner, and a lot of on-device agent ideas that were blocked on memory budget suddenly fit. My move this week: pull Gemma 4 12B in Ollama or LM Studio on a 16GB machine, feed it an image and an audio clip, and measure memory and tokens per second against the encoder-based pipeline you run today. If the encoder-less path wins on the same hardware, that is your signal to rethink the stack.
|
|
If someone on your team is sizing a multimodal pipeline, this is the comparison to send them.
|
|
(The latency probe I am using to compare the two pipelines is in the paid archive.)
|
|
|
|
If a strong language backbone can absorb vision with one matrix multiply and audio with no encoder at all, what is the encoder actually doing for us in the models that still ship one? Reply with the workload where you think a dedicated encoder still earns its memory.
|
|
Dropping the encoders is the kind of architecture choice that looks like a footnote until half the field copies it.
|
|
Next issue: I am running Gemma 4 12B's encoder-less vision path head to head against an encoder-based pipeline on the same laptop, with the memory and latency numbers laid out. One side wins by more than I expected.
|
|
ResearchAudio.io · research worth shipping with
Source: Introducing Gemma 4 12B, Google DeepMind (June 3, 2026)
|