In partnership with

Start using AI the way top finance teams do.

The AI for Business & Finance Certificate Program from Columbia Business School Exec Ed and Wall Street Prep draws on real-world examples inspired by how firms like BlackRock, Citi, and Morgan Stanley approach AI enablement for their teams.

Youโ€™ll go beyond theory to understand whatโ€™s being implemented, why it works, and how to apply it in your own role.

  • Join LIVE office hours with Columbia Business School faculty

  • Earn a certificate from a top business school

  • Get lifetime access to program materials, meet-ups, and networking opportunities

Save $300 with code SAVE300 + $200 with early enrollment by Feb. 17.

ResearchAudio.io

How a 30-Person Team Trained a 400B Model in 33 Days

Arcee AI released Trinity Large this week โ€” a sparse MoE with 400 billion parameters, trained on 17 trillion tokens. Here's the complete technical breakdown.

January 29, 2026

Summary

Arcee AI, a 30-person startup, released a 400B parameter sparse MoE trained entirely in the US. Cost was approximately $20M. Three checkpoints released: chat-ready preview, full base model, and pure pretraining checkpoint with no alignment. All Apache 2.0 licensed.

400B
total params
13B
active/token
17T
tokens
33
days

Architecture Overview

Model Specifications

ParameterValue
Total parameters~398B
Active per token~13B
Experts256 total, 4 active
Routing sparsity1.56%
Dense layers6
Context length512K (native)
Training tokens17 trillion
LicenseApache 2.0

Architecture Details

Sparse routing. 4-of-256 experts (1.56%). More aggressive than DeepSeek-V3 (3.13%) and Qwen3 (6.25%), less than Llama 4 Maverick (0.78%).

SMEBU load balancing. New technique: Soft-clamped Momentum Expert Bias Updates. Adjusts router biases with tanh clipping and momentum to prevent expert collapse.

Attention. Interleaved local/global attention with Grouped Query Attention (GQA) and gating.

Normalization. Depth-scaled sandwich norm + z-loss regularization on LM head.

Stability. Zero loss spikes across 17T tokens using Muon optimizer.

Sparsity Comparison

ModelRoutingSparsity
Llama 4 Maverick1-of-1280.78%
Trinity Large4-of-2561.56%
DeepSeek-V38-of-2563.13%
GLM-4.58-of-1605.0%
Qwen3-235B8-of-1286.25%

Benchmarks: TrueBase (Pure Pretraining)

10T tokens, no instruction tuning or RLHF.

BenchmarkN-shotScore
MMLU578.45%
HellaSwag588.13%
GSM8K (CoT)880.44%
MBPP+380.95%
HumanEval+051.83%
TriviaQA580.96%
GPQA Diamond540.91%
MMLU-Pro551.60%
ARC Challenge062.37%
WinoGrande581.45%
BBH357.84%
MATH Hard426.96%

Benchmarks: Preview vs Llama 4 Maverick

Preview is instruct-tuned, not reasoning-focused.

BenchmarkLlama 4Trinity
MMLU85.5%87.2%
MMLU-Pro80.5%75.2%
GPQA-Diamond69.8%63.3%
AIME 202519.3%24.0%

Three Checkpoints Released

Preview: Lightly post-trained, chat-ready. Optimized for creative writing, storytelling, agents (OpenCode, Cline, Kilo Code). Not a reasoning model.

Base: Full 17T tokens with LR annealing. Best pretraining checkpoint for fine-tuning.

TrueBase: 10T tokens. No instruction data, no RLHF, no alignment. Pure pretraining only. 8K context. Rare at this scale โ€” intended for interpretability research and studying what pretraining produces.

Trinity Model Family

ModelTotalActiveUse Case
Trinity Nano6B~1BEdge, mobile
Trinity Mini26B3BAgents, tools
Trinity Large400B13BFrontier tasks

Training Infrastructure

Hardware2,048 NVIDIA B300 GPUs
Duration33 days (pretraining only)
ParallelismHSDP + Expert Parallelism (EP=8)
OptimizerMuon (Adam LR: 2e-4, Muon LR: 8e-4)
Compute partnerPrime Intellect
Data partnerDatologyAI
Total cost~$20M (compute, salaries, data, storage)

Training Data

17T tokens curated by DatologyAI in three phases: 10T general, 4T high-quality, 3T STEM-heavy.

8+ trillion synthetic tokens across web, code, math, reasoning, multilingual.

14 non-English languages targeted.

Inference

2-3x faster inference than peers due to high sparsity. Native 512K context.

Supported: vLLM, SGLang, llama.cpp, LM Studio, HuggingFace Transformers.

How to Access

OpenRouter API: arcee-ai/trinity-large-preview (no cost during preview through Feb 2026)

HuggingFace: arcee-ai/Trinity-Large-Preview, arcee-ai/Trinity-Large-Base, arcee-ai/Trinity-Large-TrueBase

Arcee API: chat.arcee.ai

Code Examples

# OpenRouter
import requests
response = requests.post(
    "https://openrouter.ai/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"model": "arcee-ai/trinity-large-preview",
          "messages": [{"role": "user", "content": "Hello"}]}
)
# HuggingFace
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "arcee-ai/Trinity-Large-Preview",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
# vLLM
vllm serve arcee-ai/Trinity-Large-Preview --dtype bfloat16

Known Limitations

TrueBase: Not aligned. May exhibit raw behaviors. 8K context only.

Preview: Not a reasoning model. Light post-training. Text-only.

API: Currently 128K context with 8-bit quantization while scaling.

Bottom Line

A 30-person startup trained a 400B sparse MoE in 33 days for ~$20M. Competitive with Llama 4 Maverick. Three checkpoints available including a rare pure-pretraining version for research. All Apache 2.0, weights on HuggingFace, API on OpenRouter.

ResearchAudio.io

Deep Mehta ยท Technical briefings on AI research

Keep Reading

No posts found