In partnership with

Mixture of Experts in LLMs

Issue 47

ResearchAudio

AI Research, Explained Clearly

Mixture of Experts (MoE) in Large Language Models

A complete guide to MoE types and NVIDIA Nemotron 3 architecture

MoE has become the dominant architecture for frontier AI. This issue covers the core concept, different types of MoE implementations, and NVIDIA's new hybrid Nemotron 3 architecture released December 2025.

Part One

What is Mixture of Experts

Dense models activate all parameters for every token. MoE splits the network into specialized sub-networks called experts and uses a router to activate only the most relevant ones. This enables massive scale with minimal compute.

Dense Model

100% active always

Sparse MoE

Only 2 experts active

Part Two

Types of MoE Architectures

MoE implementations vary significantly. Here are the main architectural variants used in production:

1. Sparse MoE (Top-K Routing)

The most common type. Router scores all experts, selects top-K (usually 2), and combines their outputs. Only selected experts compute, reducing cost dramatically.

Used by: Mixtral, DeepSeek, Grok, GPT-4 (rumored)

2. Token Choice Routing

Each token independently selects which experts to use. Simple and parallelizable, but can cause load imbalance if many tokens pick the same expert.

Used by: Switch Transformer, most standard MoE models

3. Expert Choice Routing

Flips the paradigm: experts select tokens instead of tokens selecting experts. Each expert picks its top-K tokens, ensuring perfect load balance and better utilization.

Used by: EC-MoE, some Google research models

4. Soft MoE

Fully differentiable with soft assignments using softmax scores. No token dropping, no expert imbalance. Each slot contains a weighted average of all tokens.

Used by: Vision-MoE, research models requiring smooth gradients

5. Shared + Routed Experts

Combines always-on shared experts (for common cross-domain knowledge) with routed experts (for specialized tasks). Stabilizes generalization while enabling specialization.

Used by: DeepSeek V3 (1 shared + 8 routed), LLaMA-4, Nemotron 3

6. Hybrid MoE (Mamba-Transformer-MoE)

Combines MoE with other architectures. Mamba handles long-range dependencies efficiently, Transformer attention handles precise reasoning, MoE provides scalable compute.

Used by: NVIDIA Nemotron 3, Jamba (AI21 Labs)

Part Three

NVIDIA Nemotron 3: Hybrid MoE Architecture

Released December 2025, Nemotron 3 introduces a breakthrough hybrid architecture combining three technologies into one backbone for agentic AI systems.

Nemotron 3 Hybrid Architecture

MAMBA-2

23 layers

Long-range memory, minimal overhead

ATTENTION

6 GQA layers

Precise reasoning and structure

MoE ROUTING

23 MoE layers

128 routed + 1 shared expert

Nemotron 3 Family

Nano

30B / 3B active

Available now

Super

100B / 10B active

H1 2026

Ultra

500B / 50B active

H1 2026

Context

1M tokens

All models

Latent MoE (Super and Ultra only)

A new approach where experts share a common core representation and keep only a small part private. Think of chefs sharing one big kitchen but each having their own spice rack. This allows 4x more experts with the same inference cost.

Throughput vs Nemotron 2

60%

Less reasoning tokens

6/128

Experts active per token

Part Four

MoE Models Comparison

MODEL

TOTAL

ACTIVE

EXPERTS

TOP-K

TYPE

Nemotron 3

30B

128+1

Hybrid

DeepSeek V3

671B

37B

256

Shared

Mixtral 8x7B

47B

13B

Sparse

Grok-1

314B

78B

Sparse

Qwen3 235B

235B

22B

128

Sparse

Key Takeaways

1. MoE types include Sparse, Token Choice, Expert Choice, Soft MoE, Shared+Routed, and Hybrid.

2. NVIDIA Nemotron 3 combines Mamba + Transformer + MoE for 4x throughput gains.

3. Latent MoE lets experts share a common core, enabling 4x more experts at same cost.

4. Most frontier models now use MoE: DeepSeek, Mixtral, Grok, Qwen3, Nemotron.

If this was useful, consider sharing it with a colleague.

Until next time,
ResearchAudio

This newsletter you couldn’t wait to open? It runs on beehiiv — the absolute best platform for email newsletters.

Our editor makes your content look like Picasso in the inbox. Your website? Beautiful and ready to capture subscribers on day one.

And when it’s time to monetize, you don’t need to duct-tape a dozen tools together. Paid subscriptions, referrals, and a (super easy-to-use) global ad network — it’s all built in.

beehiiv isn’t just the best choice. It’s the only choice that makes sense.

Start free today. No credit card required

Mixture of Experts (MoE) in Large Language Models

ResearchAudio

Mixture of Experts (MoE) in Large Language Models

What is Mixture of Experts

Types of MoE Architectures

NVIDIA Nemotron 3: Hybrid MoE Architecture

MoE Models Comparison

Keep Reading

Quick Links

Stay Updated

Mixture of Experts (MoE) in Large Language Models

ResearchAudio

Mixture of Experts (MoE) in Large Language Models

What is Mixture of Experts

Types of MoE Architectures

NVIDIA Nemotron 3: Hybrid MoE Architecture

MoE Models Comparison

You can (easily) launch a newsletter too

Keep Reading

Quick Links

Stay Updated