In partnership with

Apple just secretly added Starlink satellite support to iPhones through iOS 18.3.

One of the biggest potential winners? Mode Mobile.

Mode’s EarnPhone already reaches 490M+ users that have earned over $1B, and that’s before global satellite coverage. With SpaceX eliminating "dead zones," Mode's earning technology can now reach billions more in unbanked and rural populations worldwide.

Their global expansion is perfectly timed, and investors like you still have a chance to invest in their pre-IPO offering at $0.50/share.

With their recent 32,481% revenue growth and newly reserved Nasdaq ticker, Mode is one step closer to a potential IPO.

Please read the offering circular and related risks at invest.modemobile.com. This is a paid advertisement for Mode Mobile’s Regulation A+ Offering.

Mode Mobile recently received their ticker reservation with Nasdaq ($MODE), indicating an intent to IPO in the next 24 months. An intent to IPO is no guarantee that an actual IPO will occur.

The Deloitte rankings are based on submitted applications and public company database research, with winners selected based on their fiscal-year revenue growth percentage over a three-year period.

GPT-3 Was 11x Too Big. Here's the Math.

ResearchAudio.io · Issue No. 14

GPT-3 Was 11x Too Big.
Here's the Math.

From GPT-3's overshoot to DeepMind's 2026 return. Four eras of scaling laws, plainly explained.

The 60-second version

Scaling laws are the recipes that tell AI labs how to build models. In 2020, OpenAI said bigger always wins. In 2022, DeepMind proved that wrong: GPT-3 was 11 times too big for its training data. Since then, a family of corrections has shown the optimal recipe depends on data supply, inference cost, precision, architecture, and language mix. In 2024, a fifth lever appeared: letting the model think longer at the moment of answering also makes it smarter.

7
orders of magnitude
power-law range (Kaplan)
20:1
tokens per parameter
Chinchilla optimum
2000:1
tokens per parameter
Llama-3-8B, overtraining

In 2020, OpenAI told the world that bigger models were better. In 2022, DeepMind proved the rule was wrong. GPT-3 had 175 billion parameters trained on 300 billion tokens. Under the corrected math, it should have been about 15 billion parameters, roughly 11 times smaller.

That correction reshaped the industry. Then a family of follow-up laws between 2023 and 2026 corrected the correction, and test-time compute opened a fifth dimension nobody had measured before. Here is the full story.

What a scaling law actually is

A scaling law is a recipe. Double one ingredient, your cake gets a predictable amount better. Same amount, every time, across thousands of experiments. The three ingredients in a language model:

Parameters (N): the tunable dials inside the model. GPT-3 had 175 billion.

Data (D): training text, measured in tokens (roughly three-quarters of a word each).

Compute (C): total math operations during training, measured in FLOPs.

A scaling law connects these to loss, a measure of how wrong the model is on average. Lower loss = better next-token predictions.

Era 1: Kaplan 2020. The discovery.

In January 2020, Jared Kaplan and a team at OpenAI trained dozens of language models at different sizes, data amounts, and compute budgets, then plotted everything.

The result was a clean curve. Loss followed a power law in each of the three variables, holding across more than seven orders of magnitude: the relationship looked the same whether the model had a million parameters or a billion.

Kaplan compared this to the ideal gas law in physics. Loss, parameters, and data appeared to follow a universal relationship across model scales the way pressure, volume, and temperature do for a gas.

Then came the practical conclusion. With a fixed compute budget, Kaplan estimated that the best strategy was to build very large models and train them on a relatively small amount of data.

The optimal number of parameters scaled with compute to the power of roughly 0.73, while data scaled to only 0.27. Translation: when your budget goes up, mostly make the model bigger.

This is the rule that justified GPT-3. Bigger is better, mostly.

Era 2: Chinchilla 2022. The correction.

Two years later, Jordan Hoffmann and a team at DeepMind ran the experiment again, bigger and more careful: around 400 language models from 70M to over 16B parameters, trained on sets from 5B to 500B tokens.

Their conclusion was simple and uncomfortable. For compute-optimal training, model size and training tokens should scale at the same rate, roughly 20 tokens per parameter. Double the parameters, double the data.

By that math, the giants of 2020 to 2022 were undertrained. GPT-3 with 175B parameters and 300B tokens. Gopher with 280B parameters and 300B tokens.

Megatron-Turing NLG with 530B parameters. All of them had way more parameters than their training data could feed.

To prove it, DeepMind trained Chinchilla on Gopher's compute budget but with 70B parameters instead of 280B, and 1.4 trillion tokens instead of 300 billion. It outperformed Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on nearly every benchmark.

Same compute. Very different choices.
Comparing GPT-3 to its Chinchilla-optimal twin
Model Parameters Tokens trained
GPT-3 (actual) 175 B 300 B
Gopher (actual) 280 B 300 B
Chinchilla (optimal) 70 B 1,400 B
The ratio that changed everything: 20 training tokens for every parameter. GPT-3's actual ratio was about 1.7. Chinchilla's was exactly 20.
Hoffmann et al. 2022, Training Compute-Optimal Large Language Models

This is where the 11x number comes from. If GPT-3 had stuck to its 300B token budget, the Chinchilla rule says it should have been roughly 15B parameters. Instead it was 175B, about 11 times bigger than optimal. Read it the other way: to make a 175B model compute-optimal, you would need 3.5 trillion training tokens, roughly 12 times more than GPT-3 actually saw.

Key insight: Kaplan and Chinchilla are not really in conflict. A 2024 replication by Pearce and Song traced most of the disagreement to a measurement choice. Kaplan counted only non-embedding parameters and analyzed at smaller scale. The Chinchilla coefficients hold up. Practitioners now treat 20 tokens per parameter as the canonical compute-optimal ratio.

In practice

This is why models got smaller after 2022. The Chinchilla rule rewrote everyone's playbook: most flagship models built after 2022 are smaller than GPT-3 but trained on far more text. Claude, Gemini, GPT-4, Llama 2 and 3, Mistral are all downstream of Chinchilla. If today's chatbots feel more capable per dollar than GPT-3, this is most of the reason.

Era 3: The post-Chinchilla family (2023 to 2026).

Chinchilla answered one question: given a fixed training budget, how should you split it between parameters and data? Real model builders quickly noticed that this question was incomplete. What if you do not have enough fresh text? What if inference cost matters? What if you train in low precision?

What if your model is sparse? What if your users do not speak English?

Between 2023 and 2026, five major refinements landed. None of them invalidate Chinchilla. They each add a dimension that Chinchilla held fixed. The most recent one comes from DeepMind itself, returning to scaling laws four years after Chinchilla.

Take Muennighoff first. Across more than 400 training runs, the team found that training on the same data for up to 4 epochs produces almost the same loss as fresh data. Past 16 epochs returns collapse: each repeated token retains only about 63 percent of fresh-token value.

Then Sardana, Frankle, and later Gadre pointed out the obvious. Chinchilla optimizes training cost, but real models live on inference cost. A model serving billions of queries pays training once and inference forever. The fix: train smaller models on much more data. Gadre trained 100 models at token-to-parameter ratios from 20 to 640 and the loss curve stayed clean across the whole range.

That theory is now standard practice. Meta's Llama-3-8B trained at roughly 2000 tokens per parameter. Google's Gemma-2 series trained past 1000:1. Both are deliberately far from Chinchilla-optimal because both will run for billions of inference calls.

Kumar and colleagues added precision. Training in FP8 instead of BF16 reduces the model's effective parameter count. A 70B model at low precision is no longer really 70B in representational capacity. Their law suggests larger models in lower precision can sometimes be compute-optimal, with one unsettling corollary: for a heavily overtrained low-precision model, more pretraining data eventually hurts post-quantization quality.

Finally architecture. Mixture-of-Experts models route each token to a small subset of experts, so total parameters far exceed active ones. A wave of 2024 to 2025 papers extended Chinchilla to MoE: optimal ratios depend on expert count, routing granularity, and whether memory or compute is the binding constraint.

And in January 2026, DeepMind came back.

Four years after Chinchilla, the same lab released ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining. It is the first scaling law that explicitly models the language mixture as a training variable. Most prior work, including Chinchilla, was English-only. ATLAS extends the framework to where over half of real AI users actually live.

The scale is striking: 774 training runs from 10M to 8B parameters, on 400+ languages, evaluated on 48. The output is a transfer matrix for 1,400 language pairs. Positive transfer tracks shared script or family: Norwegian benefits from Swedish and German, Arabic from Hebrew. English, French, and Spanish are broadly useful sources. Transfer is asymmetric: A can help B more than B helps A.

The practical rule, in numbers:

The ATLAS doubling rule
To double the number of supported languages, scale model and data by these factors
1.18×
model parameters
a mild capacity tax
1.66×
total training data
the binding constraint
83%
data per language
vs single-language baseline
Less data per language, but positive cross-lingual transfer offsets the loss. The capacity tax is small enough that adding languages is usually worth it.
Longpre & Ebrahimi et al., ATLAS, Google DeepMind, ICLR 2026

ATLAS also produced a budget-aware rule for a common decision: train a new model from scratch on your target language, or fine-tune from a strong multilingual checkpoint. For a 2B parameter model, the crossover point is between roughly 144 billion and 283 billion training tokens depending on the language. Below that budget, fine-tune. Above it, pretraining from scratch finishes ahead. The threshold scales predictably with model size.

Key insight: The Chinchilla 20:1 ratio is a special case, not a universal truth. It is the answer when you have unlimited unique data, no inference cost, full precision, a dense model, and English as your only language. Change any of those assumptions and the optimal ratio moves. DeepMind's return with ATLAS makes the lineage explicit: each generation of scaling laws relaxes one assumption the previous generation held fixed.

In practice

This is why your phone can now run AI offline. Llama 3 8B fits on consumer hardware because Meta deliberately ignored the Chinchilla rule, training a smaller model on far more data (about 2,000 tokens per parameter instead of 20). Gemma 2 on Android, Phi on Windows, Llama on a MacBook: the post-Chinchilla family is why small models stopped being dumb.

Era 4: Test-time compute. The new frontier.

Kaplan and Chinchilla are both about training, how to spend compute before the model ever talks to a user. Starting in late 2024, a new scaling law showed up: what happens after the model is trained.

OpenAI's o1 showed the pattern. Letting the model think longer at inference time produced predictable accuracy gains on reasoning tasks. Not from a bigger model, not from more training data. Just from more thinking.

In January 2025, DeepSeek-R1 reproduced the effect at scale. On the AIME math benchmark, accuracy went from 15.6 percent to 71 percent through extended chain-of-thought reasoning, reaching 86.7 percent with majority voting. Same model. Just more tokens per answer.

A Stanford paper called s1 took it further: a 32B model finetuned on only 1,000 reasoning examples, plus budget forcing (suppress the stop token, append "Wait" to keep the model thinking). The result: s1-32B exceeds o1-preview by up to 27 percent on competition math, and budget forcing alone pushes its AIME24 score from 50 to 57 percent.

Four scaling eras of modern AI
What you scale, and what gets better when you do
2020
Kaplan
Scale model size hard, data lightly.
2022
Chinchilla
Scale model and data equally. 20:1.
2023-26
The family
Data, inference, precision, sparsity, language mix. DeepMind returns.
2024+
Test-time
Scale thinking, not training.
Loss vs N, D, C Loss at optimal N:D Loss under real constraints Accuracy vs thinking tokens
Synthesis: Kaplan 2020 · Hoffmann 2022 · Muennighoff/Sardana/Gadre/Kumar 2023-24 · OpenAI o1, DeepSeek-R1, s1 2024-25

Test-time scaling has two flavors. Sequential: longer chain-of-thought. Parallel: sample many answers, pick the best via voting or verification.

The economic shift is significant: Introl projects inference compute demand will exceed training compute by 118x in 2026. The industry is moving from training-heavy to thinking-heavy.

In practice

This is what you are paying for when ChatGPT, Claude, or Gemini "thinks" before answering. Reasoning models generate huge internal monologues you never see, sometimes 10 to 50 times more tokens than the visible answer. Deep Research, Claude's extended thinking, the GPT o-series: test-time scaling, sold per query. ChatGPT Pro at $200 a month exists because thinking longer literally costs more.

In production: Meta's Muse Spark, April 2026

Two weeks ago, Meta Superintelligence Labs released Muse Spark, its first model under Alexandr Wang. The blog post is unusual because Meta openly framed the release as a scaling-laws story. The official title is "Scaling Towards Personal Superintelligence." Muse Spark is described as "the first step on our scaling ladder."

More usefully, the post defines three scaling axes Meta tracks: pretraining, reinforcement learning, and test-time reasoning. The pretraining headline alone is striking: their new recipe reaches Llama 4 Maverick capability with over 10 times less compute. Each axis is a clean instance of the eras above.

The RL axis is newer territory. Meta plots two curves against RL training steps: pass@1 (single-attempt accuracy) and pass@16 (one success in 16 attempts). Both grow log-linearly, on training and held-out data. Pass@16 growing alongside pass@1 means the model is improving reliability without collapsing its reasoning diversity. That is a non-trivial finding. Many RL recipes sharpen at the cost of mode collapse.

The test-time finding is the most novel. Trained with a length penalty during RL, Muse Spark shows a phase transition on AIME: it first improves by thinking longer, then collapses into a compressed reasoning mode, then extends again at higher capability. The first published account of thought compression emerging from RL optimization.

At deployment, Muse Spark's "Contemplating mode" scales test-time compute through parallel agents rather than longer chains, merging multiple agents' answers. It reaches 58 percent on Humanity's Last Exam and 38 percent on FrontierScience Research, in range of Gemini Deep Think and GPT Pro modes. Wall-clock latency stays comparable while total compute per query rises.

Key insight: Muse Spark is interesting independent of its benchmark numbers because it is the first frontier-lab release to publicly diagram its three scaling axes and fit a scaling law to each. The "scaling ladder" is no longer a research artifact. It is now the explicit operating system of a frontier lab.

Which scaling law actually affects you?

The eras above describe the science. Here is the practical version. Find the row that describes you, see which law you actually need to care about.

Find your row
Different audiences, different laws to track
If you are Law that matters most Why it matters to you
A daily user of ChatGPT, Claude, or Gemini Era 4: test-time compute You pay for thinking time. Higher tiers purchase longer thinking per query.
A startup building on top of an LLM Era 3: inference-aware (overtraining) Choose providers whose models are cheap and fast per call. Llama 3, Gemma 2, Mistral owe their economics to overtraining.
Training your own model at a small lab Era 3: data-constrained + precision You will not have unlimited fresh text. Repeat carefully (up to 4 epochs carries almost no penalty). Train in lower precision to fit your budget.
A frontier-lab researcher All five at once Pretraining, post-Chinchilla refinements, test-time, RL compute, and multilingual all compete for the same compute budget.
Building a product for non-English users ATLAS (DeepMind, 2026) The 1.18x params, 1.66x data per-doubling rule is the only public scaling guide for multilingual models.
An investor or AI strategist The whole evolution Whichever axis a lab emphasizes most shapes its product strategy. Watch the axis to predict the moat.

Quick hits

Kaplan and Chinchilla were not really fighting. Pearce and Song's 2024 replication showed most of the gap came from Kaplan counting only non-embedding parameters and analyzing at smaller scale. Re-run at the same conditions, the Chinchilla coefficients hold.

Data quality bends the ratio. Bi et al. 2024 showed that higher-quality data lets you train compute-optimally at lower tokens-per-parameter. Chinchilla's 20:1 assumes web-text quality. Curated data shifts the optimum down.

Test-time has its own architecture. Sequential scaling means longer chain-of-thought. Parallel scaling means many samples plus voting or verification. Recent 2025 work shows the two combine sub-additively, not multiplicatively.

A calculator for your own project. The paid archive includes a downloadable spreadsheet with Chinchilla-optimal, inference-aware overtraining, and ATLAS multilingual formulas. Plug in your model size, get the recommended training mix.

The take

Here is the part nobody talks about. Scaling laws are empirical, not theoretical. Nobody has derived them from first principles; they are tight fits to expensive experiments, which is why they keep getting revised. Treat the eras as a sequence of better measurements, not as truth. The pattern is not "we keep being wrong." It is "each generation measures one more variable the previous one held fixed." The next paper will add another.

The open question

Test-time scaling assumes the model has enough latent reasoning ability to unlock by thinking longer. But 2025 work shows smaller distilled models stop improving past a certain thinking length while larger ones keep going. Is there a pretraining scaling law for test-time-scalability itself? If so, training compute and inference compute are no longer independent choices. Reply if you have seen good work here. I would like to read it.

Keep Reading