Sponsored by

Are You Ready to Actually Retire?

Knowing when to retire is harder than knowing how much to save. The timing depends on what your retirement actually looks like: how long your money needs to last, what you'll spend, and where your income comes from.

When to Retire: A Quick and Easy Planning Guide is built for investors with $1,000,000 or more who are ready to move from saving to planning. Download your free guide and start working through the details.

Download your free guide.

4B Active Parameters Score 1441 Elo. Here's How.

ResearchAudio.io

4B Active Parameters Score 1441 Elo. Here's How.

128 tiny experts inside one model. Released under Apache 2.0.

89.2%

AIME Math (was 20.8%)

1452

Elo (No. 3 Open Models)

3.8B

Active Params (of 26B)

Google released a model family where the MoE variant activates 3.8 billion parameters per token, yet scores 1441 Elo on Arena AI, nearly matching the dense 31B version at 1452. Three days ago, on April 2, Google DeepMind shipped Gemma 4, and the benchmark jumps from its predecessor are not incremental. They are generational.

Competition math (AIME) went from 20.8% to 89.2%. Coding (LiveCodeBench) went from 29.1% to 80.0%. Science reasoning (GPQA) jumped from 42.4% to 84.3%. And the whole family ships under Apache 2.0, which is a first for Google's open model line.

Here's the part nobody's talking about: the architecture that makes this possible.

The Nesting Doll Architecture

Gemma 4 ships in four sizes: E2B (2.3B effective), E4B (4.5B effective), a 26B Mixture-of-Experts with 128 small experts, and a 31B dense model. The "E" prefix stands for "effective," and it matters. These models carry more raw parameters than they activate.

The edge models (E2B and E4B) use Per-Layer Embeddings (PLE), a technique carried forward from Gemma 3n. In a standard transformer, the embedding table sits entirely in GPU VRAM, a static cost before you process a single token. PLE offloads embedding weights to CPU RAM and streams the specific vectors needed for each layer during inference. The result: a model with 5B+ total parameters operates within the memory budget of a 2B model.

But the 26B MoE is where it gets interesting. Google chose 128 small experts (compared to Llama 4's 16 large experts), activating 8 plus 1 shared expert per token. A total of 3.8B parameters fire per forward pass. The LMArena score of 1441 with that setup is competitive with models 8x its active size.

gemma 4 model family

E2B

<1.5 GB RAM

+ audio

→

E4B

~3 GB RAM

+ audio

→

26B MoE

128 experts

3.8B active

→

31B Dense

256K context

Elo: 1452

◀ edge / on-device

workstation / cloud ▶

Source: Google DeepMind, Gemma 4 Technical Report (April 2, 2026)

Why the MoE Design Matters

The 26B MoE variant is where the engineering gets clever. All 25.2B parameters need to live in memory (you pay for storage when loading), but 3.8B worth of compute fires at each token. On a T4 GPU or MacBook Air, that's the difference between "can't run it" and "runs comfortably."

The edge models go further. PLE adds a secondary embedding signal into every decoder layer, so a model with 5.1B total parameters carries the representational depth of its full parameter count while fitting in under 1.5 GB of memory with quantization. Google built this in collaboration with Qualcomm Technologies and MediaTek, and these models run completely offline with near-zero latency on phones, Raspberry Pi, and NVIDIA Jetson Orin Nano.

gemma 3 vs gemma 4 (31B)

AIME Math

20.8%

89.2%

Code

29.1%

80.0%

Science

42.4%

84.3%

Gemma 3 27B

Gemma 4 31B

Source: Google DeepMind benchmarks. AIME 2026, LiveCodeBench v6, GPQA Diamond.

What Actually Changed

Gemma 4 is built from the same research and technology as Gemini 3, Google's proprietary frontier model. Three architectural choices stand out.

Thinking mode. When Gemma 4 reasons step-by-step before answering, it can produce 4,000+ tokens of reasoning before committing. This is likely the primary driver behind the math jump. Chain-of-thought at inference time, baked into the model's training, not bolted on after.

Context that works. Gemma 3's 128K context was mostly theoretical. Multi-needle retrieval sat at 13.5%. Gemma 4 pushes that to 66.4% on the 31B, with context windows up to 256K tokens on the larger variants. That's an entire codebase in a single prompt.

Native tool use. Every model in the family supports function calling and structured JSON output without special prompting. Combined with the thinking mode, this makes Gemma 4 a strong candidate for agentic workflows where the model needs to reason about which tools to call and in what sequence.

Key Insight: The Apache 2.0 license is the real story. Previous Gemma releases used a custom license with restrictions. Apache 2.0 means no monthly active user limits, no acceptable-use policy enforcement, and full permission for commercial and sovereign AI deployments. For teams that avoided Gemma because of licensing, that barrier is gone.

Google shipped a 26B model that activates 3.8B parameters per token, scores 1441 Elo, and released it under Apache 2.0. The open model race has a new front-runner.

What to Do With This

If you run any workload on open models, test the 26B MoE this week. On a MacBook with 24 GB unified memory, it runs at 4-bit quantization. Compare it against whatever you're currently running. The benchmark claims are strong, but your data is the one benchmark that matters.

For on-device work, the E4B variant is worth evaluating. It processes text, images, and audio, runs on 3 GB of memory, and Google has day-one support for Hugging Face Transformers, llama.cpp, MLX, Ollama, vLLM, and a dozen more frameworks. The E2B model runs 3x faster than E4B if latency matters more than accuracy.

The step-by-step deployment guide with quantization benchmarks across all four variants and inference cost-per-token templates is in the paid archive.

Quick Hits

Gemma has crossed 400 million downloads since the first generation launched, with over 100,000 community-created variants. The ecosystem is mature enough that day-one framework support covers essentially every major inference tool.

The E2B and E4B models are the foundation for Gemini Nano 4. Code written for Gemma 4 today will work on Gemini Nano 4-enabled Android devices shipping later this year. If you're building for Android, this is forward-compatible development.

Codeforces Elo jumped from 110 to 2150. That's a move from "barely functional" to "expert competitive programmer" in one model generation. The coding gap between Gemma and the competition did not close. It reversed.

The Take

I think this is Google's strongest open model release. Not because the benchmarks are good (they are), but because the licensing, framework support, and size range finally align. Previous Gemma models had restrictive licenses that made enterprise teams nervous. Apache 2.0 removes that friction entirely.

The real competition this targets is Meta's Llama 4 and Alibaba's Qwen 3.5. Both have strong open-source positions. Google's bet is that developer ergonomics (on-device support, framework coverage, Android integration) will matter more than raw parameter counts. This surprised me. Google chose to compete on ecosystem, not benchmarks alone.

The Open Question

Benchmarks improved dramatically, but the pretraining data cuts off in January 2025. For any domain that has changed since then, you still need retrieval augmentation or tool access.

So here's the question: when models get this capable but still have knowledge cutoffs, does the model itself matter less than the tooling around it? Reply to this email with your take.

The gap between open and closed models narrowed again this week. When a 26B MoE with 3.8B active parameters competes with models 8x its size, the economics of inference change for everyone building on top of these models.

Next week: Why speculative decoding with nested MatFormer sub-models could make on-device inference 2x faster, and how one team is already using it in production.

Know someone evaluating open models for production workloads? They'll want to see this.

ResearchAudio.io

Source: Google DeepMind: Gemma 4 Announcement

Source: Gemma Developer Guide

Source: Hugging Face: Gemma 4 Release

4B Active Parameters Score 1441 Elo. Here's How.

Are You Ready to Actually Retire?

4B Active Parameters Score 1441 Elo. Here's How.

The Nesting Doll Architecture

Why the MoE Design Matters

What Actually Changed

What to Do With This

Quick Hits

The Take

The Open Question

Keep Reading

Quick Links

Stay Updated