In partnership with

Introducing the first AI-native CRM

Connect your email, and you’ll instantly get a CRM with enriched customer insights and a platform that grows with your business.

With AI at the core, Attio lets you:

  • Prospect and route leads with research agents

  • Get real-time insights during customer calls

  • Build powerful automations for your complex workflows

Join industry leaders like Granola, Taskrabbit, Flatfile and more.

Why a 70B Model Beat a 280B Model

Why a 70B Model Beat a 280B Model

Inside the Chinchilla paper: how DeepMind proved GPT-3, Gopher, and every major LLM was undertrained

In March 2022, a team at DeepMind published a paper that fundamentally changed how the AI industry thinks about training large language models. The paper, titled "Training Compute-Optimal Large Language Models," made a simple but devastating claim: nearly every major LLM at the time - GPT-3, Gopher, Jurassic-1, Megatron-Turing NLG - was dramatically undertrained. These models had too many parameters and had seen too little data.

The proof was Chinchilla: a 70 billion parameter model that outperformed DeepMind's own 280 billion parameter Gopher on virtually every benchmark. Chinchilla was 4x smaller but trained on 4x more data. It used the same compute budget but allocated it differently. The implications rippled through every AI lab on the planet.

This article explains what the Chinchilla scaling laws are, how DeepMind derived them, why they matter for understanding how modern AI systems are built, and what practical lessons they hold for anyone working with or thinking about large language models.

The Problem: Everyone Was Following the Wrong Scaling Laws

In 2020, OpenAI published influential research by Kaplan et al. that established "scaling laws" for language models. These laws described how model performance improves as you increase compute, parameters, and data. The key finding that shaped the industry: when you have more compute budget, you should primarily increase model size. Specifically, Kaplan suggested that with a 10x increase in compute, model size should increase 5.5x while training tokens should only increase 1.8x.

This advice drove a parameter arms race. GPT-3 launched with 175 billion parameters trained on 300 billion tokens. Gopher scaled to 280 billion parameters, also trained on roughly 300 billion tokens. Megatron-Turing NLG pushed to 530 billion parameters trained on 270 billion tokens. The pattern was clear: labs were racing to build bigger models while keeping training data roughly constant.

DeepMind suspected this was wrong. The Kaplan scaling laws had a methodological flaw: all models were trained with the same fixed number of tokens and learning rate schedule regardless of size. This prevented the researchers from observing how the optimal training duration changes with model size. It was like measuring sprinting speed while forcing all runners to stop after 100 meters, regardless of whether they had reached their top speed.

The Research: 400 Models to Find the Truth

DeepMind ran an extensive empirical study. They trained over 400 language models ranging from 70 million to 16 billion parameters, each trained on varying amounts of data from 5 billion to 500 billion tokens. Crucially, they adjusted the learning rate schedule to match the training duration for each run - something Kaplan had not done.

They used three independent approaches to answer the same question: given a fixed compute budget, what is the optimal balance between model size and training data?

The first approach fixed model sizes and varied training tokens. For each parameter count, they trained models for different durations and extracted the minimum loss achieved at each compute level. By interpolating these curves, they could identify which model size achieved the lowest loss for any given FLOP budget.

The second approach used "IsoFLOP profiles." For nine different fixed compute budgets, they trained models of varying sizes and observed which size minimized loss. This directly answered: for a given FLOP budget, what is the optimal parameter count?

The third approach fit a parametric loss function to all the data. They modeled loss as a function of parameters N and data D, decomposing it into three terms: an irreducible entropy term (the theoretical minimum loss), a term capturing model capacity limitations, and a term capturing optimization suboptimality from finite data.

The Core Finding: Scale Parameters and Data Equally

All three approaches converged on the same conclusion: for compute-optimal training, model size and training tokens should scale equally. When you double your compute budget, you should roughly double both the parameter count and the training data. The exponents were approximately a = 0.5 for parameters and b = 0.5 for tokens.

This was dramatically different from Kaplan's recommendation of a = 0.73 and b = 0.27. The practical implications were stark:

Scaling Law Comparison

Given 10x more compute, how should you allocate it?

KAPLAN et al. (2020)

Model Size: 5.5x larger

Data: 1.8x more

Result: Huge models, undertrained

CHINCHILLA (2022)

Model Size: ~3.2x larger

Data: ~3.2x more

Result: Balanced, compute-optimal

What This Meant for Existing Models

Model Parameters Tokens Status
GPT-3 175B 300B Undertrained
Gopher 280B 300B Undertrained
Jurassic-1 178B 300B Undertrained
Megatron-Turing NLG 530B 270B Undertrained
Chinchilla 70B 1.4T Optimal

The ratio tells the story:

GPT-3: 175B params / 300B tokens = 0.58 params per token

Chinchilla: 70B params / 1.4T tokens = 0.05 params per token

Chinchilla used ~10x more tokens per parameter.

The paper provided concrete recommendations. For the compute budget used to train Gopher (5.76 x 10^23 FLOPs), the optimal model would be around 67 billion parameters trained on 1.5 trillion tokens - not 280 billion parameters on 300 billion tokens. A 175 billion parameter model (GPT-3's size) would need 3.7 trillion tokens to be compute-optimal. A hypothetical 1 trillion parameter model would need over 21 trillion tokens.

The Proof: Chinchilla vs Gopher

DeepMind validated their predictions by training Chinchilla, a 70 billion parameter model trained on 1.4 trillion tokens. This used the same total compute as Gopher but reallocated it according to the new scaling laws. The results were unambiguous.

On MMLU, the massive multitask language understanding benchmark, Chinchilla achieved 67.6% accuracy compared to Gopher's 60.0% - a 7.6 percentage point improvement. This even exceeded expert forecasts for what state-of-the-art would achieve by June 2023. Chinchilla reached over 90% accuracy on four individual MMLU subjects, something no previous model had accomplished.

On reading comprehension, Chinchilla improved RACE-h accuracy from 71.6% to 82.3% and RACE-m from 75.1% to 86.8%. On BIG-bench, the average accuracy jumped from 54.4% to 65.1%. On every subset of The Pile language modeling benchmark, Chinchilla outperformed Gopher. On closed-book question answering with Natural Questions, 5-shot accuracy improved from 24.5% to 31.5%.

The pattern was consistent across virtually every evaluation. A model with 4x fewer parameters, trained on 4x more data, using the same compute budget, achieved uniformly better results. This was not a marginal improvement - it represented a fundamental shift in how capable models could be built.

The Math: Understanding the Loss Function

The Chinchilla paper proposed a parametric model for how loss depends on model size and data. The formula decomposes loss into three intuitive components:

L(N, D) = E + A/N^alpha + B/D^beta

The first term E represents the entropy of natural text - the irreducible minimum loss that even a perfect model could not beat. This is the inherent unpredictability in language itself. They estimated E at approximately 1.69.

The second term A/N^alpha captures model capacity limitations. Even a perfectly trained transformer with N parameters will underperform the ideal predictor. This term shrinks as you add more parameters, but with diminishing returns governed by the exponent alpha (approximately 0.34).

The third term B/D^beta captures optimization suboptimality from limited data. You only make a finite number of optimization steps on a finite sample. This term shrinks as you train on more data, governed by beta (approximately 0.28).

The compute-optimal frontier emerges from minimizing this loss subject to a fixed compute constraint. Since compute scales as approximately 6ND (six times parameters times tokens), you are trading off between reducing the capacity term (bigger N) and the data term (bigger D). The optimal allocation depends on the ratio of the exponents, which turns out to favor roughly equal scaling.

Why Kaplan Got It Wrong

The Kaplan scaling laws were not wrong in their observations - they accurately measured what they measured. The problem was methodological. By using a fixed learning rate schedule for all models regardless of training duration, they observed intermediate loss values rather than final converged losses for shorter training runs.

When you train a model with a cosine learning rate schedule designed for 130 billion tokens but stop at 50 billion tokens, you are measuring an overestimate of what a properly scheduled 50 billion token run would achieve. The learning rate has not decayed to its optimal final value. This systematically underestimates the effectiveness of training smaller models on less data, biasing the conclusions toward larger models.

DeepMind found that setting the cosine cycle length to approximately match the training duration was critical. Overshooting by more than 25% led to noticeable performance degradation. This meant that determining optimal training length was not just about compute allocation - it was essential for achieving the best possible loss at any given scale.

Additionally, Kaplan's analysis used mostly smaller models, many under 100 million parameters. DeepMind observed slight curvature in the compute-loss frontier at larger scales, meaning extrapolations from small models could diverge from actual large-scale behavior. Their analysis included models up to 16 billion parameters and focused more heavily on the 500 million+ parameter range.

Why This Matters: Smaller Models, Lower Costs, Wider Access

The Chinchilla results have profound practical implications that extend far beyond training. A compute-optimal model is not just better during training - it is better for everything that comes after.

Inference costs scale with parameter count. Every time you run a 280 billion parameter model, you need to load and process 280 billion weights. A 70 billion parameter model with equivalent capability costs roughly 4x less per query. For services handling millions of daily requests, this difference is measured in millions of dollars per year and megawatts of power consumption.

Memory requirements drop proportionally. Running Gopher requires specialized infrastructure with enough memory to hold 280 billion parameters. Chinchilla fits on more accessible hardware configurations. This democratizes deployment - smaller organizations and researchers can run capable models without hyperscale infrastructure.

Fine-tuning becomes more tractable. Adapting a 70 billion parameter model to a specific domain or task requires less compute and memory than fine-tuning a 280 billion parameter model. This accelerates development cycles and enables more specialized applications.

The environmental impact is significant. Training and running smaller models that achieve the same capability means less energy consumption. When you multiply this across all the LLM deployments worldwide, the aggregate savings are substantial.

How the Industry Responded

The Chinchilla paper landed in an industry that had been sprinting toward ever-larger models. The reception was a mixture of validation, course correction, and in some cases, denial. But the data was compelling, and labs began adapting.

Meta's LLaMA models, released in early 2023, explicitly followed Chinchilla-optimal scaling. LLaMA-7B was trained on 1 trillion tokens, LLaMA-13B on 1 trillion tokens, LLaMA-33B on 1.4 trillion tokens, and LLaMA-65B on 1.4 trillion tokens. The smallest LLaMA model achieved performance comparable to GPT-3 despite being 25x smaller, precisely because it was properly trained.

The open-source community embraced these findings enthusiastically. Properly trained 7 billion or 13 billion parameter models became viable alternatives to massive proprietary systems for many use cases. This catalyzed an explosion of fine-tuned and specialized models that would have been impractical with undertrained giants.

The focus shifted to data quality and quantity. Labs began investing more heavily in data collection, curation, and synthetic data generation. The realization that you need trillions of high-quality tokens to optimally train even medium-sized models changed priorities. Dataset engineering became as important as model architecture research.

Limitations and Ongoing Debates

The Chinchilla scaling laws are not the final word on optimal training. The paper itself notes several limitations that subsequent research has explored.

The analysis assumes training for less than one epoch - each data point is seen only once. What happens when you exhaust available high-quality data and must repeat? This "multi-epoch" regime may have different scaling properties. Some evidence suggests that data quality matters more than quantity in this regime, and that repeated passes over excellent data can outperform single passes over mediocre data.

The paper observed slight curvature in the scaling relationships at higher compute levels. Extrapolating to truly massive scales (10^26 FLOPs and beyond) involves uncertainty. The optimal allocation might shift as we push into unprecedented territory.

Inference cost was not part of the optimization objective. If you plan to serve a model billions of times, the total compute includes both training and inference. A slightly larger model trained on slightly less data might have higher training efficiency but worse lifetime cost when inference dominates. Some researchers argue for "inference-optimal" scaling laws that account for expected usage.

The laws describe dense transformers. Mixture-of-experts models, retrieval-augmented models, and other architectural innovations have their own scaling properties. The Chinchilla findings apply most directly to the standard transformer architecture that dominated 2020-2022.

Lessons for Practitioners

If you are training models, selecting models, or trying to understand AI capabilities, the Chinchilla paper offers several actionable insights.

First, parameter count alone tells you little about capability. A 7 billion parameter model trained on 1 trillion tokens can outperform a 70 billion parameter model trained on 100 billion tokens. When evaluating models, ask about training data quantity and quality, not just size.

Second, compute budgets should be allocated thoughtfully. If you have a fixed training budget, use the Chinchilla ratios as a starting point: roughly 20 tokens per parameter for compute-optimal training. A 1 billion parameter model should train on approximately 20 billion tokens.

Third, learning rate schedules matter enormously. Match your cosine cycle to your intended training duration. Using a schedule designed for longer training will hurt final performance. This is an easy mistake to make when adapting code from larger runs.

Fourth, data investment pays off. Building or acquiring high-quality training datasets is not a one-time cost but a continuous competitive advantage. The labs with the best data pipelines can train better models at any compute level.

Fifth, smaller can be better. For deployment, a properly trained smaller model often beats an undertrained larger one while costing less to run. Do not assume that the biggest available model is the best choice for your application.

Quick Reference: Compute-Optimal Training Tokens

Based on the Chinchilla analysis, here are the approximate token counts needed for compute-optimal training at various model sizes:

Model Size Optimal Tokens Tokens per Parameter
400 Million 8 Billion 20x
1 Billion 20 Billion 20x
10 Billion 200 Billion 20x
67 Billion 1.5 Trillion ~22x
175 Billion 3.7 Trillion ~21x
280 Billion 5.9 Trillion ~21x
1 Trillion 21 Trillion ~21x

Rule of thumb: ~20 tokens per parameter for compute-optimal training.

Note: These assume single-epoch training on high-quality data. Multi-epoch or lower-quality data may shift the optimal balance.

The Takeaway

The Chinchilla paper is a case study in how empirical rigor can overturn conventional wisdom. For two years, the AI industry followed scaling laws that systematically misallocated compute toward oversized, undertrained models. A careful experimental study with proper methodology revealed a fundamentally different and more efficient path.

The core insight is simple but profound: capability comes from the interaction of model capacity and training signal. A model needs to be large enough to represent complex patterns, but it also needs to see enough data to learn those patterns. Neglecting either side leaves performance on the table.

For practitioners, the message is clear: do not be seduced by parameter counts. A well-trained smaller model beats an undertrained larger one. Data quality and quantity deserve as much attention as architecture. And when planning training runs, balance your compute allocation thoughtfully rather than defaulting to "bigger is better."

The Chinchilla scaling laws did not end the story - they opened new chapters about data efficiency, multi-epoch training, and inference-aware optimization. But they established a foundation that continues to shape how the most capable AI systems are built. Understanding them is essential context for anyone trying to make sense of modern machine learning.

Technical accuracy matters

Keep Reading

No posts found