In partnership with

Start using AI the way top finance teams do.

The AI for Business & Finance Certificate Program from Columbia Business School Exec Ed and Wall Street Prep draws on real-world examples inspired by how firms like BlackRock, Citi, and Morgan Stanley approach AI enablement for their teams.

You’ll go beyond theory to understand what’s being implemented, why it works, and how to apply it in your own role.

  • Join LIVE office hours with Columbia Business School faculty

  • Earn a certificate from a top business school

  • Get lifetime access to program materials, meet-ups, and networking opportunities

Save $300 with code SAVE300 + $200 with early enrollment by Feb. 17.

ResearchAudio.io

How a 30-Person Team Trained a 400B Model in 33 Days

Arcee AI released Trinity Large this week — a sparse MoE with 400 billion parameters, trained on 17 trillion tokens. Here's the complete technical breakdown.

January 29, 2026

Summary

Arcee AI, a 30-person startup, released a 400B parameter sparse MoE trained entirely in the US. Cost was approximately $20M. Three checkpoints released: chat-ready preview, full base model, and pure pretraining checkpoint with no alignment. All Apache 2.0 licensed.

400B
total params
13B
active/token
17T
tokens
33
days

Architecture Overview

Model Specifications

ParameterValue
Total parameters~398B
Active per token~13B
Experts256 total, 4 active
Routing sparsity1.56%
Dense layers6
Context length512K (native)
Training tokens17 trillion
LicenseApache 2.0

Architecture Details

Sparse routing. 4-of-256 experts (1.56%). More aggressive than DeepSeek-V3 (3.13%) and Qwen3 (6.25%), less than Llama 4 Maverick (0.78%).

SMEBU load balancing. New technique: Soft-clamped Momentum Expert Bias Updates. Adjusts router biases with tanh clipping and momentum to prevent expert collapse.

Attention. Interleaved local/global attention with Grouped Query Attention (GQA) and gating.

Normalization. Depth-scaled sandwich norm + z-loss regularization on LM head.

Stability. Zero loss spikes across 17T tokens using Muon optimizer.

Sparsity Comparison

ModelRoutingSparsity
Llama 4 Maverick1-of-1280.78%
Trinity Large4-of-2561.56%
DeepSeek-V38-of-2563.13%
GLM-4.58-of-1605.0%
Qwen3-235B8-of-1286.25%

Benchmarks: TrueBase (Pure Pretraining)

10T tokens, no instruction tuning or RLHF.

BenchmarkN-shotScore
MMLU578.45%
HellaSwag588.13%
GSM8K (CoT)880.44%
MBPP+380.95%
HumanEval+051.83%
TriviaQA580.96%
GPQA Diamond540.91%
MMLU-Pro551.60%
ARC Challenge062.37%
WinoGrande581.45%
BBH357.84%
MATH Hard426.96%

Benchmarks: Preview vs Llama 4 Maverick

Preview is instruct-tuned, not reasoning-focused.

BenchmarkLlama 4Trinity
MMLU85.5%87.2%
MMLU-Pro80.5%75.2%
GPQA-Diamond69.8%63.3%
AIME 202519.3%24.0%

Three Checkpoints Released

Preview: Lightly post-trained, chat-ready. Optimized for creative writing, storytelling, agents (OpenCode, Cline, Kilo Code). Not a reasoning model.

Base: Full 17T tokens with LR annealing. Best pretraining checkpoint for fine-tuning.

TrueBase: 10T tokens. No instruction data, no RLHF, no alignment. Pure pretraining only. 8K context. Rare at this scale — intended for interpretability research and studying what pretraining produces.

Trinity Model Family

ModelTotalActiveUse Case
Trinity Nano6B~1BEdge, mobile
Trinity Mini26B3BAgents, tools
Trinity Large400B13BFrontier tasks

Training Infrastructure

Hardware2,048 NVIDIA B300 GPUs
Duration33 days (pretraining only)
ParallelismHSDP + Expert Parallelism (EP=8)
OptimizerMuon (Adam LR: 2e-4, Muon LR: 8e-4)
Compute partnerPrime Intellect
Data partnerDatologyAI
Total cost~$20M (compute, salaries, data, storage)

Training Data

17T tokens curated by DatologyAI in three phases: 10T general, 4T high-quality, 3T STEM-heavy.

8+ trillion synthetic tokens across web, code, math, reasoning, multilingual.

14 non-English languages targeted.

Inference

2-3x faster inference than peers due to high sparsity. Native 512K context.

Supported: vLLM, SGLang, llama.cpp, LM Studio, HuggingFace Transformers.

How to Access

OpenRouter API: arcee-ai/trinity-large-preview (no cost during preview through Feb 2026)

HuggingFace: arcee-ai/Trinity-Large-Preview, arcee-ai/Trinity-Large-Base, arcee-ai/Trinity-Large-TrueBase

Arcee API: chat.arcee.ai

Code Examples

# OpenRouter
import requests
response = requests.post(
    "https://openrouter.ai/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"model": "arcee-ai/trinity-large-preview",
          "messages": [{"role": "user", "content": "Hello"}]}
)
# HuggingFace
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "arcee-ai/Trinity-Large-Preview",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
# vLLM
vllm serve arcee-ai/Trinity-Large-Preview --dtype bfloat16

Known Limitations

TrueBase: Not aligned. May exhibit raw behaviors. 8K context only.

Preview: Not a reasoning model. Light post-training. Text-only.

API: Currently 128K context with 8-bit quantization while scaling.

Bottom Line

A 30-person startup trained a 400B sparse MoE in 33 days for ~$20M. Competitive with Llama 4 Maverick. Three checkpoints available including a rare pure-pretraining version for research. All Apache 2.0, weights on HuggingFace, API on OpenRouter.

ResearchAudio.io

Deep Mehta · Technical briefings on AI research

Keep Reading