In partnership with

Start using AI the way top finance teams do.

The AI for Business & Finance Certificate Program from Columbia Business School Exec Ed and Wall Street Prep draws on real-world examples inspired by how firms like BlackRock, Citi, and Morgan Stanley approach AI enablement for their teams.

You’ll go beyond theory to understand what’s being implemented, why it works, and how to apply it in your own role.

Join LIVE office hours with Columbia Business School faculty
Earn a certificate from a top business school
Get lifetime access to program materials, meet-ups, and networking opportunities

Save $300 with code SAVE300 + $200 with early enrollment by Feb. 17.

Enroll Today

ResearchAudio.io

How a 30-Person Team Trained a 400B Model in 33 Days

Arcee AI released Trinity Large this week — a sparse MoE with 400 billion parameters, trained on 17 trillion tokens. Here's the complete technical breakdown.

January 29, 2026

Summary

Arcee AI, a 30-person startup, released a 400B parameter sparse MoE trained entirely in the US. Cost was approximately $20M. Three checkpoints released: chat-ready preview, full base model, and pure pretraining checkpoint with no alignment. All Apache 2.0 licensed.

400B

total params

13B

active/token

17T

tokens

days

Architecture Overview

Model Specifications

Parameter	Value
Total parameters	~398B
Active per token	~13B
Experts	256 total, 4 active
Routing sparsity	1.56%
Dense layers	6
Context length	512K (native)
Training tokens	17 trillion
License	Apache 2.0

Architecture Details

Sparse routing. 4-of-256 experts (1.56%). More aggressive than DeepSeek-V3 (3.13%) and Qwen3 (6.25%), less than Llama 4 Maverick (0.78%).

SMEBU load balancing. New technique: Soft-clamped Momentum Expert Bias Updates. Adjusts router biases with tanh clipping and momentum to prevent expert collapse.

Attention. Interleaved local/global attention with Grouped Query Attention (GQA) and gating.

Normalization. Depth-scaled sandwich norm + z-loss regularization on LM head.

Stability. Zero loss spikes across 17T tokens using Muon optimizer.

Sparsity Comparison

Model	Routing	Sparsity
Llama 4 Maverick	1-of-128	0.78%
Trinity Large	4-of-256	1.56%
DeepSeek-V3	8-of-256	3.13%
GLM-4.5	8-of-160	5.0%
Qwen3-235B	8-of-128	6.25%

Benchmarks: TrueBase (Pure Pretraining)

10T tokens, no instruction tuning or RLHF.

Benchmark	N-shot	Score
MMLU	5	78.45%
HellaSwag	5	88.13%
GSM8K (CoT)	8	80.44%
MBPP+	3	80.95%
HumanEval+	0	51.83%
TriviaQA	5	80.96%
GPQA Diamond	5	40.91%
MMLU-Pro	5	51.60%
ARC Challenge	0	62.37%
WinoGrande	5	81.45%
BBH	3	57.84%
MATH Hard	4	26.96%

Benchmarks: Preview vs Llama 4 Maverick

Preview is instruct-tuned, not reasoning-focused.

Benchmark	Llama 4	Trinity
MMLU	85.5%	87.2%
MMLU-Pro	80.5%	75.2%
GPQA-Diamond	69.8%	63.3%
AIME 2025	19.3%	24.0%

Three Checkpoints Released

Preview: Lightly post-trained, chat-ready. Optimized for creative writing, storytelling, agents (OpenCode, Cline, Kilo Code). Not a reasoning model.

Base: Full 17T tokens with LR annealing. Best pretraining checkpoint for fine-tuning.

TrueBase: 10T tokens. No instruction data, no RLHF, no alignment. Pure pretraining only. 8K context. Rare at this scale — intended for interpretability research and studying what pretraining produces.

Trinity Model Family

Model	Total	Active	Use Case
Trinity Nano	6B	~1B	Edge, mobile
Trinity Mini	26B	3B	Agents, tools
Trinity Large	400B	13B	Frontier tasks

Training Infrastructure

Hardware	2,048 NVIDIA B300 GPUs
Duration	33 days (pretraining only)
Parallelism	HSDP + Expert Parallelism (EP=8)
Optimizer	Muon (Adam LR: 2e-4, Muon LR: 8e-4)
Compute partner	Prime Intellect
Data partner	DatologyAI
Total cost	~$20M (compute, salaries, data, storage)

Training Data

17T tokens curated by DatologyAI in three phases: 10T general, 4T high-quality, 3T STEM-heavy.

8+ trillion synthetic tokens across web, code, math, reasoning, multilingual.

14 non-English languages targeted.

Inference

2-3x faster inference than peers due to high sparsity. Native 512K context.

Supported: vLLM, SGLang, llama.cpp, LM Studio, HuggingFace Transformers.

How to Access

OpenRouter API: arcee-ai/trinity-large-preview (no cost during preview through Feb 2026)

HuggingFace: arcee-ai/Trinity-Large-Preview, arcee-ai/Trinity-Large-Base, arcee-ai/Trinity-Large-TrueBase

Arcee API: chat.arcee.ai

Code Examples

# OpenRouter
import requests
response = requests.post(
    "https://openrouter.ai/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"model": "arcee-ai/trinity-large-preview",
          "messages": [{"role": "user", "content": "Hello"}]}
)

# HuggingFace
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "arcee-ai/Trinity-Large-Preview",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# vLLM
vllm serve arcee-ai/Trinity-Large-Preview --dtype bfloat16

Known Limitations

TrueBase: Not aligned. May exhibit raw behaviors. 8K context only.

Preview: Not a reasoning model. Light post-training. Text-only.

API: Currently 128K context with 8-bit quantization while scaling.

Bottom Line

A 30-person startup trained a 400B sparse MoE in 33 days for ~$20M. Competitive with Llama 4 Maverick. Three checkpoints available including a rare pure-pretraining version for research. All Apache 2.0, weights on HuggingFace, API on OpenRouter.

ResearchAudio.io

Deep Mehta · Technical briefings on AI research

How a 30-Person Team Trained a 400B Model in 33 Days

Start using AI the way top finance teams do.

How a 30-Person Team Trained a 400B Model in 33 Days

Architecture Overview

Model Specifications

Architecture Details

Sparsity Comparison

Benchmarks: TrueBase (Pure Pretraining)

Benchmarks: Preview vs Llama 4 Maverick

Three Checkpoints Released

Trinity Model Family

Training Infrastructure

Training Data

Inference

How to Access

Code Examples

Known Limitations

Keep Reading

researchaudio