|
On sourcing. This issue covers a model from a lab competing with the ones we usually write about. Every number below is first-party from MiniMax's launch post and benchmark methodology, cross-checked against third-party coverage, and flagged wherever it cannot yet be reproduced independently. The architecture explanation and the methodology critique are our own analysis. Where the launch post overstates, we say so.
|
There is a new attention architecture in production, and it is not from OpenAI, Anthropic, or Google. It is from a Chinese lab, and most of the English-language coverage has focused on the benchmark table rather than the thing underneath it.
MiniMax M3 launched on June 1, 2026. The headline numbers are the kind that get aggregator posts: 59.0% on SWE-Bench Pro, a 1-million-token context window, native multimodal training from step zero, 70% on OSWorld-Verified for computer use, and the first open-weight model to combine those at once. Those numbers are real, and they are also vendor-reported. The part worth your time is the architecture: MSA, MiniMax Sparse Attention, the first credible production alternative to standard full transformer attention to ship in 2026.
The architecture claim, in plain English
A normal language model looks at every earlier token when it decides the next one. At a million tokens of context, that means doing the same comparison roughly a trillion times, and that cost grows with the square of the length. MSA filters the context first and skips most of that work, reading the small fraction that matters. It is the difference between reading a 1,000-page book cover to cover and reading the 30 pages where the plot turns.
The technical version: MSA keeps a standard Grouped-Query Attention (GQA) backbone and adds block-level selection on top. The key-value cache is split into fixed blocks, a lightweight index branch scores those blocks and keeps the most relevant via top-k, and the main branch runs ordinary softmax attention over just those blocks of the real, uncompressed KV. This is the design choice that separates it from DeepSeek's Multi-head Latent Attention (MLA), which compresses keys and values into a smaller latent space and trades away some long-context precision. MSA does not compress, it selects, so it sidesteps that precision tax. The cost is that the cache itself does not shrink, which is a sensible bargain: spend memory to keep quality, recover compute through sparsity.
| Property |
Full attention |
MSA (M3) |
| Compute vs context length |
Quadratic (n²) |
Near-linear (sparse) |
| KV cache |
Continuous |
Block-partitioned, real KV kept |
| Operator |
Standard flash attention |
KV-outer gather-Q, each block read once |
| Per-token compute at 1M vs M2 |
baseline |
1/20 |
| Prefill / decode at 1M |
baseline |
9×+ / 15×+ faster |
| vs Flash-Sparse-Attn, flash-moba |
baseline |
4×+ faster (M3 head config) |
The single most quoted line from the launch post: at a context length of one million tokens, M3's per-token compute is 1/20 that of the previous-generation M2 model. That is not a 10% efficiency win. It is the difference between a 1M-context model being deployable and being a research curiosity.
|
How one MSA layer reads a long context
1 · Split
Real KV cache cut into fixed blocks
|
→ |
2 · Score
0.8 0.1 0.7 0.2 0.9 0.3
Lightweight indexer ranks every block
|
→ |
3 · Keep + attend
Softmax runs on the few kept blocks, full precision
|
|
Query heads in one GQA group share a single top-k pick, so one block load fills on-chip memory and a FlashAttention-style kernel runs unchanged.
|
Source: MiniMax M3 launch blog (Jun 2026). Speedups are M3 vs the previous-generation M2 model at 1M context.
|
The hardware alignment is worth slowing down on. In standard attention the query drives the loop: the model asks what to look up, and the KV cache answers. MSA inverts that into a "KV outer, gather Q" pass: each block announces what it contains, and the queries that need it gather around. Each block is read from memory once, access is contiguous, and arithmetic intensity rises. MiniMax reports this runs more than 4 times faster than the open-source Flash-Sparse-Attention and flash-moba kernels under M3's head configuration, a relative claim about that configuration rather than a universal one.
Independent readers of MiniMax's published diagram describe M3 as a streamlined version of NSA. Where NSA ran three parallel branches (compress, select, and a sliding window) plus a learned gate, M3 keeps the selection branch and drops the rest. The companion MSA paper, public this week along with the open weights, reports the controlled test: on a 109-billion-parameter mixture-of-experts model trained from scratch on 3 trillion tokens, MSA matched GQA quality while cutting per-token attention compute 28.4 times at 1M context, with 14.2 times prefill and 7.6 times decode speedups on an H800. Those figures describe that ablation model, not M3 itself, and the kernel is on GitHub.
|
Two claims to scrutinize. The launch post says ablations showed MSA matched full attention on the majority of capabilities, without saying which were lost or by how much. Sparse attention has a long history of trading a small accuracy loss for a large speedup (Longformer, BigBird), and a 1/20 compute ratio implies an aggressive sparsity. That ablation detail is the thing to read closely.
On the parameter count: M3 is a sparse mixture-of-experts model that activates a slice of its weights per token, in the M2 lineage (229.9B total, 9.8B active across 256 fine-grained experts). MiniMax did not state M3's exact size at launch. Vendors carry the 229.9B figure forward while at least one host lists it higher, so treat the total as unconfirmed until the model card settles it.
|
Where M3 wins, and where it does not
All of the following are MiniMax-reported, most run on internal infrastructure with agent scaffolding such as Claude Code, Mini-SWE-Agent, or Terminus.
| Benchmark |
M3 |
Reference |
| SWE-Bench Pro |
59.0% |
GPT-5.5 58.6, Gemini 3.1 Pro 54.2 |
| BrowseComp |
83.5 |
Opus 4.7 79.3 (history dropped past 64K) |
| Terminal-Bench 2.1 |
66.0% |
Opus 4.7 ~66.1 |
| OSWorld-Verified (computer use) |
70.06% |
361 samples, max steps 200 |
| MCP Atlas (tool use) |
74.2% |
— |
| SWE-fficiency |
34.8% |
efficiency-weighted SWE |
| PostTrainBench |
0.37 |
trails Opus 4.7 0.42, GPT-5.5 0.39 |
Read that table carefully, because it cuts both ways. M3's 59.0% on SWE-Bench Pro edges the GPT-5.5 and Gemini 3.1 Pro figures MiniMax reports, and 83.5 on BrowseComp tops the Opus 4.7 number it cites. On PostTrainBench it trails both. So the honest framing is not "M3 wins coding," it is "M3 reaches frontier range on the headline coding benchmark while trailing on others."
Two caveats matter most. First, the comparisons are not apples to apples: SWE-Bench Pro was run on MiniMax's own infrastructure with Claude Code as the scaffold, so the gap to GPT-5.5 can flip depending on whose scaffold you use, and independent scores from Artificial Analysis and LMArena were not out at launch. Second, MiniMax benchmarked against Opus 4.7, but Anthropic shipped Opus 4.8 about three days earlier, with a reported 69.2% on SWE-Bench Pro. Against the current Opus, M3 trails by roughly ten points, not the few points the launch framing implies.
The three autonomous tasks in the launch post
More interesting than the benchmark table are three long-horizon tasks MiniMax documented with timestamps and tool counts.
1 · Reproduce an ICLR Outstanding Paper
Given the ICLR 2025 paper Learning Dynamics of LLM Finetuning, M3 ran for nearly 12 hours, produced 18 commits and 23 figures, and replicated the core experiments with no human intervention. It needed multimodal reading (curves and formulas from the PDF), long context (paper, code, and logs in one window), and long-horizon coding at once.
2 · Optimize a CUDA kernel from a broken skeleton
Starting from a non-runnable Triton skeleton with no reference implementation, M3 optimized an FP8 matrix-multiply kernel on NVIDIA Hopper over roughly 24 hours.
24h continuous |
1,959 tool calls |
7.6→71.3% Hopper peak util |
9.4× speedup |
The line to flag: most other models stopped progressing within the first 30 submissions, while Opus 4.7 and M3 were the two that kept going. M3's best solution landed on submission 145. Whether you read that as a moat or a benchmark-design quirk depends on how you score a 147-submission curve.
3 · Post-train four base models
Given four base models fresh out of pretraining, M3 ran the full data-synthesis, training, evaluation, and iteration cycle within 12 hours and scored 0.37 on PostTrainBench, placing third behind Opus 4.7 (0.42) and GPT-5.5 (0.39), ahead of the rest of the set. The launch post calls this "significantly ahead of all other models," which holds against the open-weight set, less so against every frontier model.
What M3 does not settle
A short list of things the launch implies but does not prove:
Hallucination. Coding and agentic scores are strong, but the post cites no factuality benchmark (TruthfulQA, HaluEval, FActScore).
Safety. No red-teaming, jailbreak, or refusal data. An open-weight model with computer use has a risk surface that is not quantified here.
Long-context retrieval. MSA optimizes cost at 1M, not accuracy at 1M. The lost-in-the-middle problem is unsolved across long-context models, and the post cites no RULER or LongBench numbers.
Production stability. M3 is two weeks old. There is no third-party production-scale data yet, so treat every benchmark as first-party until independent reproduction.
What this means if you are building
|
A coding agent
At promo pricing near $0.30 input and $1.20 output per million tokens, M3 runs at a fraction of closed-frontier cost, roughly one-fifteenth the input cost of Opus and far below GPT-5.5. For the hardest slice of tasks you will still want a closed-weight fallback, which means justifying the dual-model router. The 70% OSWorld number is high enough to build a real desktop agent on, after you validate on your own task distribution.
|
|
A long-context application
The wall that broke 1M-context deployments was cost, not memory, and MSA changes that. The 9x prefill and 15x decode gains help most on RAG pipelines, long-document summarization, and multi-hour coding sessions. If your average context sits under 50K tokens, MSA is not your bottleneck, and the block-partitioning plus KV-outer pattern will likely show up in other open models within months.
|
|
Procurement
M3 is the clearest argument yet that open-weight models can match the closed frontier on coding, context, and multimodality at the same time. The token plans ($20 for ~1.7B tokens, $50 for ~5.1B, $120 for ~9.8B per month) undercut the consumer subscriptions on a per-token basis for sustained use. The risk is single-vendor dependency and license terms, though open weights put on-prem deployment and data residency back in your control.
|
If you want to test the claims yourself
1. Read the MSA paper for the sparsity ratio at training time versus inference time. If they differ, the inference-cost claim is softer than the headline.
2. Re-run SWE-Bench Pro with your own scaffold on M3 against the current Opus. The gap is small enough that scaffold choice can move the ranking.
3. Test the 1M window on a real long-document task and measure retrieval accuracy at the 900K-token position, not at 100K.
The number to walk away with
|
Per-token compute at 1M context
Full attention (M2 baseline)
MSA (M3), about 1/20
|
At a million tokens of context, M3 does roughly one-twentieth the compute of the previous generation. Not 1.2 times faster, not 2 times faster, twenty times less compute per token. The open question is whether block selection on real key-values becomes the default long-context recipe and pushes latent compression to the margins. With the weights and the kernel both public, the rest of the field can now test that directly.
Next issue: the MSA kernel, read line by line.
|