|
Issue 47
ResearchAudio
AI Research, Explained Clearly
|
Mixture of Experts (MoE) in Large Language Models
A complete guide to MoE types and NVIDIA Nemotron 3 architecture
|
|
MoE has become the dominant architecture for frontier AI. This issue covers the core concept, different types of MoE implementations, and NVIDIA's new hybrid Nemotron 3 architecture released December 2025.
|
|
Part One
What is Mixture of Experts
Dense models activate all parameters for every token. MoE splits the network into specialized sub-networks called experts and uses a router to activate only the most relevant ones. This enables massive scale with minimal compute.
|
|
Dense Model
100% active always
|
Sparse MoE
Only 2 experts active
|
|
|
|
Part Two
Types of MoE Architectures
MoE implementations vary significantly. Here are the main architectural variants used in production:
|
|
1. Sparse MoE (Top-K Routing)
The most common type. Router scores all experts, selects top-K (usually 2), and combines their outputs. Only selected experts compute, reducing cost dramatically.
Used by: Mixtral, DeepSeek, Grok, GPT-4 (rumored)
|
|
|
2. Token Choice Routing
Each token independently selects which experts to use. Simple and parallelizable, but can cause load imbalance if many tokens pick the same expert.
Used by: Switch Transformer, most standard MoE models
|
|
|
3. Expert Choice Routing
Flips the paradigm: experts select tokens instead of tokens selecting experts. Each expert picks its top-K tokens, ensuring perfect load balance and better utilization.
Used by: EC-MoE, some Google research models
|
|
|
4. Soft MoE
Fully differentiable with soft assignments using softmax scores. No token dropping, no expert imbalance. Each slot contains a weighted average of all tokens.
Used by: Vision-MoE, research models requiring smooth gradients
|
|
|
5. Shared + Routed Experts
Combines always-on shared experts (for common cross-domain knowledge) with routed experts (for specialized tasks). Stabilizes generalization while enabling specialization.
Used by: DeepSeek V3 (1 shared + 8 routed), LLaMA-4, Nemotron 3
|
|
|
6. Hybrid MoE (Mamba-Transformer-MoE)
Combines MoE with other architectures. Mamba handles long-range dependencies efficiently, Transformer attention handles precise reasoning, MoE provides scalable compute.
Used by: NVIDIA Nemotron 3, Jamba (AI21 Labs)
|
|
|
Part Three
NVIDIA Nemotron 3: Hybrid MoE Architecture
Released December 2025, Nemotron 3 introduces a breakthrough hybrid architecture combining three technologies into one backbone for agentic AI systems.
|
|
Nemotron 3 Hybrid Architecture
|
MAMBA-2
23 layers
Long-range memory, minimal overhead
|
ATTENTION
6 GQA layers
Precise reasoning and structure
|
MoE ROUTING
23 MoE layers
128 routed + 1 shared expert
|
|
|
|
Nemotron 3 Family
|
Nano
30B / 3B active
Available now
|
Super
100B / 10B active
H1 2026
|
Ultra
500B / 50B active
H1 2026
|
Context
1M tokens
All models
|
|
|
|
Latent MoE (Super and Ultra only)
A new approach where experts share a common core representation and keep only a small part private. Think of chefs sharing one big kitchen but each having their own spice rack. This allows 4x more experts with the same inference cost.
|
|
|
4x
Throughput vs Nemotron 2
|
60%
Less reasoning tokens
|
6/128
Experts active per token
|
|
|
Part Four
MoE Models Comparison
|
MODEL |
TOTAL |
ACTIVE |
EXPERTS |
TOP-K |
TYPE |
Nemotron 3 |
30B |
3B |
128+1 |
6 |
Hybrid |
DeepSeek V3 |
671B |
37B |
256 |
8 |
Shared |
Mixtral 8x7B |
47B |
13B |
8 |
2 |
Sparse |
Grok-1 |
314B |
78B |
8 |
2 |
Sparse |
Qwen3 235B |
235B |
22B |
128 |
8 |
Sparse |
|
|
|
Key Takeaways
1. MoE types include Sparse, Token Choice, Expert Choice, Soft MoE, Shared+Routed, and Hybrid.
2. NVIDIA Nemotron 3 combines Mamba + Transformer + MoE for 4x throughput gains.
3. Latent MoE lets experts share a common core, enabling 4x more experts at same cost.
4. Most frontier models now use MoE: DeepSeek, Mixtral, Grok, Qwen3, Nemotron.
|
|
|
If this was useful, consider sharing it with a colleague.
Until next time, ResearchAudio
|