|
ResearchAudio.io
How a 30-Person Team Trained a 400B Model in 33 Days
Arcee AI released Trinity Large this week โ a sparse MoE with 400 billion parameters, trained on 17 trillion tokens. Here's the complete technical breakdown.
January 29, 2026
|
|
Summary
Arcee AI, a 30-person startup, released a 400B parameter sparse MoE trained entirely in the US. Cost was approximately $20M. Three checkpoints released: chat-ready preview, full base model, and pure pretraining checkpoint with no alignment. All Apache 2.0 licensed.
|
400B total params |
13B active/token |
17T tokens |
33 days |
|
Architecture Overview
|
Model Specifications
| Parameter | Value |
| Total parameters | ~398B |
| Active per token | ~13B |
| Experts | 256 total, 4 active |
| Routing sparsity | 1.56% |
| Dense layers | 6 |
| Context length | 512K (native) |
| Training tokens | 17 trillion |
| License | Apache 2.0 |
|
Architecture Details
Sparse routing. 4-of-256 experts (1.56%). More aggressive than DeepSeek-V3 (3.13%) and Qwen3 (6.25%), less than Llama 4 Maverick (0.78%).
SMEBU load balancing. New technique: Soft-clamped Momentum Expert Bias Updates. Adjusts router biases with tanh clipping and momentum to prevent expert collapse.
Attention. Interleaved local/global attention with Grouped Query Attention (GQA) and gating.
Normalization. Depth-scaled sandwich norm + z-loss regularization on LM head.
Stability. Zero loss spikes across 17T tokens using Muon optimizer.
|
Sparsity Comparison
| Model | Routing | Sparsity |
| Llama 4 Maverick | 1-of-128 | 0.78% |
| Trinity Large | 4-of-256 | 1.56% |
| DeepSeek-V3 | 8-of-256 | 3.13% |
| GLM-4.5 | 8-of-160 | 5.0% |
| Qwen3-235B | 8-of-128 | 6.25% |
|
Benchmarks: TrueBase (Pure Pretraining)
10T tokens, no instruction tuning or RLHF.
| Benchmark | N-shot | Score |
| MMLU | 5 | 78.45% |
| HellaSwag | 5 | 88.13% |
| GSM8K (CoT) | 8 | 80.44% |
| MBPP+ | 3 | 80.95% |
| HumanEval+ | 0 | 51.83% |
| TriviaQA | 5 | 80.96% |
| GPQA Diamond | 5 | 40.91% |
| MMLU-Pro | 5 | 51.60% |
| ARC Challenge | 0 | 62.37% |
| WinoGrande | 5 | 81.45% |
| BBH | 3 | 57.84% |
| MATH Hard | 4 | 26.96% |
|
Benchmarks: Preview vs Llama 4 Maverick
Preview is instruct-tuned, not reasoning-focused.
| Benchmark | Llama 4 | Trinity |
| MMLU | 85.5% | 87.2% |
| MMLU-Pro | 80.5% | 75.2% |
| GPQA-Diamond | 69.8% | 63.3% |
| AIME 2025 | 19.3% | 24.0% |
|
Three Checkpoints Released
Preview: Lightly post-trained, chat-ready. Optimized for creative writing, storytelling, agents (OpenCode, Cline, Kilo Code). Not a reasoning model.
Base: Full 17T tokens with LR annealing. Best pretraining checkpoint for fine-tuning.
TrueBase: 10T tokens. No instruction data, no RLHF, no alignment. Pure pretraining only. 8K context. Rare at this scale โ intended for interpretability research and studying what pretraining produces.
|
Trinity Model Family
| Model | Total | Active | Use Case |
| Trinity Nano | 6B | ~1B | Edge, mobile |
| Trinity Mini | 26B | 3B | Agents, tools |
| Trinity Large | 400B | 13B | Frontier tasks |
|
Training Infrastructure
| Hardware | 2,048 NVIDIA B300 GPUs |
| Duration | 33 days (pretraining only) |
| Parallelism | HSDP + Expert Parallelism (EP=8) |
| Optimizer | Muon (Adam LR: 2e-4, Muon LR: 8e-4) |
| Compute partner | Prime Intellect |
| Data partner | DatologyAI |
| Total cost | ~$20M (compute, salaries, data, storage) |
|
Training Data
17T tokens curated by DatologyAI in three phases: 10T general, 4T high-quality, 3T STEM-heavy.
8+ trillion synthetic tokens across web, code, math, reasoning, multilingual.
14 non-English languages targeted.
|
Inference
2-3x faster inference than peers due to high sparsity. Native 512K context.
Supported: vLLM, SGLang, llama.cpp, LM Studio, HuggingFace Transformers.
|
How to Access
OpenRouter API: arcee-ai/trinity-large-preview (no cost during preview through Feb 2026)
HuggingFace: arcee-ai/Trinity-Large-Preview, arcee-ai/Trinity-Large-Base, arcee-ai/Trinity-Large-TrueBase
Arcee API: chat.arcee.ai
|
Code Examples
# OpenRouter
import requests
response = requests.post(
"https://openrouter.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"model": "arcee-ai/trinity-large-preview",
"messages": [{"role": "user", "content": "Hello"}]}
)
# HuggingFace
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"arcee-ai/Trinity-Large-Preview",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# vLLM
vllm serve arcee-ai/Trinity-Large-Preview --dtype bfloat16
|
Known Limitations
TrueBase: Not aligned. May exhibit raw behaviors. 8K context only.
Preview: Not a reasoning model. Light post-training. Text-only.
API: Currently 128K context with 8-bit quantization while scaling.
|
Bottom Line
A 30-person startup trained a 400B sparse MoE in 33 days for ~$20M. Competitive with Llama 4 Maverick. Three checkpoints available including a rare pure-pretraining version for research. All Apache 2.0, weights on HuggingFace, API on OpenRouter.
|
|
|
ResearchAudio.io
Deep Mehta ยท Technical briefings on AI research
|