Sponsored by

Build a LinkedIn Growth Routine That Actually Compounds

Same post. Same person. Completely different results.

The difference? A growth routine.

Taplio is the all-in-one LinkedIn tool that helps you build a repeatable system: find proven content ideas in your niche, write posts faster with AI that matches your voice, engage with the right people using Smart Reply, and track what's working so you can do more of it.

Creators like Amanda Goetz used Taplio to grow 30,000+ followers. Teams like lemlist used it to generate over $3M in pipeline from LinkedIn.

Try Taplio free for 7 days and get your first month for $1 with code BEEHIIV1X1.

ResearchAudio.io

120B Parameters. 12B Active. NVIDIA's Efficiency Math.

LatentMoE routes to 22 of 512 experts at the cost of one. The architecture breakdown.

NVIDIA just released Nemotron 3 Super, a 120 billion parameter model that activates 12 billion parameters per token. On the RULER long-context benchmark at 1 million tokens, it scores 91.75%. The model it was benchmarked against, GPT-OSS-120B, scores 22.3% at the same context length. That is not a typo.

The model combines three architectural ideas that rarely appear together: a hybrid Mamba-Transformer backbone, a new expert routing system called LatentMoE, and native 4-bit precision training from the first gradient update. The result is a model optimized for multi-agent workloads that delivers 2.2x higher throughput than GPT-OSS-120B while matching or exceeding it on most accuracy benchmarks.

The Problem: Agents Are Expensive

Multi-agent AI systems generate up to 15x the tokens of a standard chat interaction. Every turn re-sends conversation history, tool outputs, and reasoning traces. Over long tasks, this creates two compounding problems. First, a "context explosion" where agents gradually lose alignment with the original objective as context grows. Second, a "thinking tax" where using large reasoning models for every sub-task makes the system too slow and costly for production use.

Nemotron 3 Super is NVIDIA's attempt to solve both problems simultaneously. It follows the release of Nemotron 3 Nano (a 30B/3B-active model) in December 2025, and slots into a tiered deployment strategy: Nano handles high-frequency simple tasks, Super handles complex reasoning and coordination, and a forthcoming Ultra model will target the hardest problems.

How It Works: Three Architectural Bets

1. Hybrid Mamba-Transformer Backbone. The model interleaves three layer types in a repeating pattern. Mamba-2 layers (state space models) handle the majority of sequence processing with linear-time complexity, which is what makes the 1 million token context window practical rather than theoretical. Transformer attention layers are placed at key depths for precise associative recall, the kind of task where you need to retrieve one specific fact buried deep in context. MoE layers scale the effective parameter count without the cost of dense computation. Think of it as a division of labor: Mamba layers handle the broad sweep, attention layers handle the precision, and MoE layers handle the specialization.

2. LatentMoE: Compression Before Routing. Standard MoE architectures route tokens directly from the model's full hidden dimension (4096) to the experts. LatentMoE adds a projection step: tokens are compressed into a 1024-dimensional latent space before routing decisions are made. Expert computation happens in this smaller dimension, and results are projected back afterward. This 4x compression means the model can afford 512 total experts with 22 active per token, at the same computational cost as a standard MoE running far fewer experts. In practice, this enables fine-grained specialization. Distinct experts can activate for Python syntax versus SQL logic versus conversational reasoning, all within a single multi-turn agent interaction.

3. Multi-Token Prediction (MTP). Instead of predicting one token at a time, specialized prediction heads forecast several future tokens simultaneously from each position. NVIDIA uses a shared-weight design across all MTP heads, which keeps parameter overhead minimal while improving stability. On SPEED-Bench, the model achieves an average acceptance length of 3.45 tokens per verification step (compared to 2.70 for DeepSeek-R1). This enables up to 3x wall-clock speedups for structured generation tasks like code and tool calls, without requiring a separate draft model. The training benefit matters just as much: predicting multiple future tokens forces the model to internalize longer-range logical dependencies rather than just guessing plausible next words.

Training: 25 Trillion Tokens in 4-Bit

Most quantized models start as full-precision and get compressed after training, which inevitably introduces accuracy loss. Nemotron 3 Super takes a different approach: it trains natively in NVFP4, NVIDIA's 4-bit floating-point format optimized for Blackwell GPUs, from the very first gradient update. The pretraining corpus spans 10 trillion unique curated tokens, with the model seeing 25 trillion total tokens across the run.

Post-training follows a two-stage process. First, supervised fine-tuning on approximately 7 million samples (drawn from a broader 40 million sample corpus) covering reasoning, instruction following, coding, safety, and multi-step agent tasks. Then, multi-environment reinforcement learning using asynchronous GRPO across 21 environment configurations, generating over 1.2 million rollouts. These environments test the model's ability to perform action sequences (correct tool calls, functional code, multi-part plans with verifiable criteria), not just satisfying single-turn responses.

2.2x
Throughput vs GPT-OSS
60.5%
SWE-Bench Verified
91.75%
RULER @ 1M tokens
85.6%
PinchBench (Best Open)

Key Insights

LatentMoE changes the expert scaling equation. By compressing tokens to a quarter of their dimension before routing, you can run 4x as many experts for the same FLOP budget. This is the core mechanism that lets a 120B model serve at 12B-active speeds while retaining broad specialization. If you're building MoE architectures, the latent projection trick is worth studying independent of NVIDIA's specific implementation.

Native low-precision pretraining is becoming viable. Training in NVFP4 from the start (rather than quantizing afterward) means the model learns to be accurate within 4-bit constraints throughout its entire training run. NVIDIA reports no meaningful accuracy loss compared to higher-precision baselines. This approach is hardware-specific (optimized for Blackwell), but the principle of training at deployment precision is one to watch across the industry.

The "Super + Nano" pattern is the real product. NVIDIA is not positioning Nemotron 3 Super as a standalone frontier model. It is designed to sit in a tiered system: Nano (3B active) handles high-frequency simple tasks, Super (12B active) handles complex reasoning and planning, and Ultra (forthcoming, approximately 50B active) handles the hardest problems. If you are building multi-agent systems, this tiered routing approach (where model selection is part of the system architecture) is likely more cost-effective than running one large model for every task.

Fully open is doing heavy lifting here. Weights, datasets, training recipes, evaluation pipelines, RL environments, and deployment cookbooks are all published. The RL environments and datasets are released as part of NeMo Gym. For teams building custom agent systems, the open training recipe (including the async GRPO setup across 21 environments) may be more valuable than the model weights themselves.

The most interesting question Nemotron 3 Super raises is not about any single benchmark, but about architecture composition. Mamba for linear-time sequence processing, attention for precision recall, LatentMoE for low-cost expert scaling, MTP for speculative decoding, and NVFP4 for memory efficiency. Each of these ideas existed independently. NVIDIA's contribution is showing they compose into a single coherent training recipe that holds together at 120B parameters and 25 trillion tokens.

ResearchAudio.io

Sources: NVIDIA Technical Blog · Nemotron 3 Super Technical Report · Hugging Face Model Card

Keep Reading