The $4.6M Model That Beat GPT-5
How Kimi-K2 Thinking scored 44.9% on "Humanity's Last Exam" (vs GPT-5's 41.7%), costs 10x less to use, and proves the AI frontier is now open-source
|
|
Bottom Line Up Front
Moonshot AI's Kimi-K2 Thinking beats GPT-5 on Humanity's Last Exam (44.9% vs 41.7%) while costing just $4.6M to train—proving efficient architecture beats hundred-million-dollar budgets.
The model executes 200-300 sequential tool calls without drift, runs natively at INT4 with 2x speed and zero accuracy loss, and costs 10x less than GPT-5's API ($0.09 vs $1.25 per million input tokens). The AI frontier just became collaborative, not proprietary.
|
A Thinking Agent That Reasons While Acting
Traditional AI models think, then act, then think again. K2 Thinking fundamentally breaks this pattern by interleaving chain-of-thought reasoning with function calls in real-time. It thinks about algorithm logic while executing code. It adjusts search strategies based on result quality. It maintains coherent goals across 200-300 consecutive tool calls—capabilities previous models couldn't approach.
The system exposes a separate reasoning_content stream showing its internal thought process transparently. Developers can observe reasoning traces, debug decisions, and understand failures—essential for production deployments.
Architecture: Massive Yet Efficient
1.04 trillion parameters with Mixture-of-Experts design activating only 32 billion per inference. This sparse pattern gives you scale's benefits without infrastructure burden. The 256K token context (doubled from K2-Instruct's 128K) enables complex multi-turn reasoning maintaining full history.
|
| Specification |
Value |
| Total Parameters |
1.04 Trillion |
| Activated Parameters |
32 Billion |
| MoE Experts |
384 (8 selected per token) |
| Context Window |
256K tokens |
| Quantization |
Native INT4 (QAT) |
| Training Cost |
$4.6 Million |
|
Three Training Breakthroughs That Changed Everything
1. Quantization-Aware Training (QAT)
Most models train in full precision then quantize (causing accuracy loss). K2 trained directly at INT4 during post-training, achieving lossless 2x speed improvement. All benchmarks reported under INT4—honest real-world performance, not inflated research numbers.
2. MuonClip Optimizer
Training trillion-parameter models usually hits "spikes" that waste millions. MuonClip prevented logit explosions. Result: K2 pre-trained on 15.5T tokens with zero training spikes.
3. RL from Verifiable Rewards
LLMs generated diverse tool-calling trajectories, all evaluated for quality. The model learned through trial-and-error reinforcement learning rather than mimicking human demos—the same breakthrough behind OpenAI's o1, now open-source.
|
Benchmark Dominance: K2 vs The World
On Humanity's Last Exam (designed to be "Google-proof"), K2 achieves 44.9% with tools—beating GPT-5 (41.7%), Grok-4 (41.0%), and Claude 4.5 (32.0%). In heavy mode: 51.0%, tied with Grok-4 and ahead of GPT-5.
On BrowseComp (autonomous web research), K2 scores 60.2%—destroying GPT-5's 54.9% and more than doubling Claude's 24.1%. On mathematics, K2 hits 99.1% on AIME 2025 with Python, 95.1% on HMMT, and 78.6% on IMO AnswerBench.
| Benchmark |
K2 |
GPT-5 |
Claude 4.5 |
| HLE (tools) |
44.9% |
41.7% |
32.0% |
| BrowseComp |
60.2% |
54.9% |
24.1% |
| AIME 2025 (Python) |
99.1% |
99.6% |
100% |
| SWE-bench Verified |
71.3% |
74.9% |
77.2% |
| LiveCodeBench V6 |
83.1% |
87.0% |
64.0% |
| τ²-Bench Telecom |
93.0% |
— |
— |
Green = winner. All K2 results at INT4 precision, temperature=1.0.
|
Getting Started with K2 Thinking
K2 uses OpenAI-compatible API format. The key feature: separate reasoning_content stream exposing internal thought process.
import openai
client = openai.OpenAI(
api_key="your_key",
base_url="https://api.moonshot.cn/v1"
)
response = client.chat.completions.create(
model="kimi-k2-thinking",
messages=[
{"role": "system",
"content": "You are Kimi AI."},
{"role": "user",
"content": "Which is bigger: 9.11 or 9.9?"}
],
temperature=1.0,
max_tokens=4096
)
print("Answer:", response.choices[0].message.content)
print("Reasoning:", response.choices[0].message.reasoning_content)
|
Deployment Options
| Option |
Details |
| Cloud API |
platform.moonshot.ai $0.09/M input, $0.90/M output |
| Self-Hosted |
vLLM, SGLang, or KTransformers 594GB model size (INT4) |
| Apple Silicon |
MLX on Mac Studio M3 Ultra 15 tokens/sec reported |
|
Real-World Applications
Research Workflows: Execute complete literature reviews with hundreds of searches, paper retrievals, and cross-references—autonomously.
Software Development: 71.3% on SWE-bench Verified means it auto-repairs real GitHub issues. Works across Python, JavaScript, Java, C++.
Financial Analysis: 47.4% on FinSearchComp-T3 shows sophisticated numerical reasoning. Can analyze markets, retrieve data, synthesize insights.
Customer Service: 93% on τ²-Bench Telecom. Handles multi-step troubleshooting without escalation to humans.
|
Why This Changes Everything
The Cost Equation Flipped: K2 trained for $4.6M. GPT-5 likely cost $500M+. K2 achieves superior performance at <1% of the cost. This isn't incremental—it's a paradigm shift.
Test-Time Scaling Validated: Former OpenAI VP Bob McGrew identified three AI pillars: pre-training, post-training, test-time scaling. K2 proves the third pillar works. Models can "think harder" at inference rather than requiring exponentially more training compute.
Geopolitics: Chinese labs (DeepSeek, Moonshot, Qwen) release frontier models despite chip export restrictions. Nvidia CEO Jensen Huang: "China is nanoseconds behind America in AI." K2 proves resource constraints drive innovation.
Open vs Closed: The gap narrowed from years to months. Reasoning capabilities once exclusive to proprietary systems are now accessible to anyone. The question isn't whether open models can compete—it's why pay 10x more for similar performance?
|
The Frontier Is Now Collaborative
K2 Thinking's 44.9% on HLE, 60.2% on BrowseComp, and near-perfect math scores prove $4.6M in smart training beats hundred-million-dollar budgets. Three innovations made this possible: INT4 quantization-aware training (2x speed, zero loss), interleaved reasoning + tool use (200-300 calls without drift), and efficient MoE (32B active of 1T total).
The gap between open and closed models has collapsed. Reasoning capabilities once locked behind proprietary APIs are now accessible, modifiable, and deployable by anyone. The future Bob McGrew predicted—test-time scaling as AI's third pillar—is here. And it's open-source.
The $4.6M model that beat GPT-5 isn't just a technical win—it's proof the AI frontier became collaborative. And that changes everything.
|
Get Started Now
Model Weights: Hugging Face
API Access: platform.moonshot.ai
Try It: kimi.com
License: Modified MIT (commercially friendly)
|
|
ResearchAudio.io
Deep Technical Analysis for AI Professionals
|
|