Apple’s Starlink Update Sparks Huge Earning Opportunity
Apple just secretly added Starlink satellite support to iPhones through iOS 18.3.
One of the biggest potential winners? Mode Mobile.
Mode’s EarnPhone already reaches 490M+ users that have earned over $1B, and that’s before global satellite coverage. With SpaceX eliminating "dead zones," Mode's earning technology can now reach billions more in unbanked and rural populations worldwide.
Their global expansion is perfectly timed, and investors like you still have a chance to invest in their pre-IPO offering at $0.50/share.
With their recent 32,481% revenue growth and newly reserved Nasdaq ticker, Mode is one step closer to a potential IPO.
Please read the offering circular and related risks at invest.modemobile.com. This is a paid advertisement for Mode Mobile’s Regulation A+ Offering.
Mode Mobile recently received their ticker reservation with Nasdaq ($MODE), indicating an intent to IPO in the next 24 months. An intent to IPO is no guarantee that an actual IPO will occur.
The Deloitte rankings are based on submitted applications and public company database research, with winners selected based on their fiscal-year revenue growth percentage over a three-year period.
|
ResearchAudio.io · Issue No. 14 GPT-3 Was 11x Too Big.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In 2020, OpenAI told the world that bigger models were better. In 2022, DeepMind proved the rule was wrong. GPT-3 had 175 billion parameters trained on 300 billion tokens. Under the corrected math, it should have been about 15 billion parameters, roughly 11 times smaller. That correction reshaped the industry. Then a family of follow-up laws between 2023 and 2026 corrected the correction, and test-time compute opened a fifth dimension nobody had measured before. Here is the full story. What a scaling law actually isA scaling law is a recipe. Double one ingredient, your cake gets a predictable amount better. Same amount, every time, across thousands of experiments. The three ingredients in a language model: Parameters (N): the tunable dials inside the model. GPT-3 had 175 billion. Data (D): training text, measured in tokens (roughly three-quarters of a word each). Compute (C): total math operations during training, measured in FLOPs. A scaling law connects these to loss, a measure of how wrong the model is on average. Lower loss = better next-token predictions. Era 1: Kaplan 2020. The discovery.In January 2020, Jared Kaplan and a team at OpenAI trained dozens of language models at different sizes, data amounts, and compute budgets, then plotted everything. The result was a clean curve. Loss followed a power law in each of the three variables, holding across more than seven orders of magnitude: the relationship looked the same whether the model had a million parameters or a billion. Kaplan compared this to the ideal gas law in physics. Loss, parameters, and data appeared to follow a universal relationship across model scales the way pressure, volume, and temperature do for a gas. Then came the practical conclusion. With a fixed compute budget, Kaplan estimated that the best strategy was to build very large models and train them on a relatively small amount of data. The optimal number of parameters scaled with compute to the power of roughly 0.73, while data scaled to only 0.27. Translation: when your budget goes up, mostly make the model bigger. This is the rule that justified GPT-3. Bigger is better, mostly. Era 2: Chinchilla 2022. The correction.Two years later, Jordan Hoffmann and a team at DeepMind ran the experiment again, bigger and more careful: around 400 language models from 70M to over 16B parameters, trained on sets from 5B to 500B tokens. Their conclusion was simple and uncomfortable. For compute-optimal training, model size and training tokens should scale at the same rate, roughly 20 tokens per parameter. Double the parameters, double the data. By that math, the giants of 2020 to 2022 were undertrained. GPT-3 with 175B parameters and 300B tokens. Gopher with 280B parameters and 300B tokens. Megatron-Turing NLG with 530B parameters. All of them had way more parameters than their training data could feed. To prove it, DeepMind trained Chinchilla on Gopher's compute budget but with 70B parameters instead of 280B, and 1.4 trillion tokens instead of 300 billion. It outperformed Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on nearly every benchmark.
This is where the 11x number comes from. If GPT-3 had stuck to its 300B token budget, the Chinchilla rule says it should have been roughly 15B parameters. Instead it was 175B, about 11 times bigger than optimal. Read it the other way: to make a 175B model compute-optimal, you would need 3.5 trillion training tokens, roughly 12 times more than GPT-3 actually saw.
Era 3: The post-Chinchilla family (2023 to 2026).Chinchilla answered one question: given a fixed training budget, how should you split it between parameters and data? Real model builders quickly noticed that this question was incomplete. What if you do not have enough fresh text? What if inference cost matters? What if you train in low precision? What if your model is sparse? What if your users do not speak English? Between 2023 and 2026, five major refinements landed. None of them invalidate Chinchilla. They each add a dimension that Chinchilla held fixed. The most recent one comes from DeepMind itself, returning to scaling laws four years after Chinchilla. Take Muennighoff first. Across more than 400 training runs, the team found that training on the same data for up to 4 epochs produces almost the same loss as fresh data. Past 16 epochs returns collapse: each repeated token retains only about 63 percent of fresh-token value. Then Sardana, Frankle, and later Gadre pointed out the obvious. Chinchilla optimizes training cost, but real models live on inference cost. A model serving billions of queries pays training once and inference forever. The fix: train smaller models on much more data. Gadre trained 100 models at token-to-parameter ratios from 20 to 640 and the loss curve stayed clean across the whole range. That theory is now standard practice. Meta's Llama-3-8B trained at roughly 2000 tokens per parameter. Google's Gemma-2 series trained past 1000:1. Both are deliberately far from Chinchilla-optimal because both will run for billions of inference calls. Kumar and colleagues added precision. Training in FP8 instead of BF16 reduces the model's effective parameter count. A 70B model at low precision is no longer really 70B in representational capacity. Their law suggests larger models in lower precision can sometimes be compute-optimal, with one unsettling corollary: for a heavily overtrained low-precision model, more pretraining data eventually hurts post-quantization quality. Finally architecture. Mixture-of-Experts models route each token to a small subset of experts, so total parameters far exceed active ones. A wave of 2024 to 2025 papers extended Chinchilla to MoE: optimal ratios depend on expert count, routing granularity, and whether memory or compute is the binding constraint. And in January 2026, DeepMind came back.Four years after Chinchilla, the same lab released ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining. It is the first scaling law that explicitly models the language mixture as a training variable. Most prior work, including Chinchilla, was English-only. ATLAS extends the framework to where over half of real AI users actually live. The scale is striking: 774 training runs from 10M to 8B parameters, on 400+ languages, evaluated on 48. The output is a transfer matrix for 1,400 language pairs. Positive transfer tracks shared script or family: Norwegian benefits from Swedish and German, Arabic from Hebrew. English, French, and Spanish are broadly useful sources. Transfer is asymmetric: A can help B more than B helps A. The practical rule, in numbers:
ATLAS also produced a budget-aware rule for a common decision: train a new model from scratch on your target language, or fine-tune from a strong multilingual checkpoint. For a 2B parameter model, the crossover point is between roughly 144 billion and 283 billion training tokens depending on the language. Below that budget, fine-tune. Above it, pretraining from scratch finishes ahead. The threshold scales predictably with model size.
Era 4: Test-time compute. The new frontier.Kaplan and Chinchilla are both about training, how to spend compute before the model ever talks to a user. Starting in late 2024, a new scaling law showed up: what happens after the model is trained. OpenAI's o1 showed the pattern. Letting the model think longer at inference time produced predictable accuracy gains on reasoning tasks. Not from a bigger model, not from more training data. Just from more thinking. In January 2025, DeepSeek-R1 reproduced the effect at scale. On the AIME math benchmark, accuracy went from 15.6 percent to 71 percent through extended chain-of-thought reasoning, reaching 86.7 percent with majority voting. Same model. Just more tokens per answer. A Stanford paper called s1 took it further: a 32B model finetuned on only 1,000 reasoning examples, plus budget forcing (suppress the stop token, append "Wait" to keep the model thinking). The result: s1-32B exceeds o1-preview by up to 27 percent on competition math, and budget forcing alone pushes its AIME24 score from 50 to 57 percent.
Test-time scaling has two flavors. Sequential: longer chain-of-thought. Parallel: sample many answers, pick the best via voting or verification. The economic shift is significant: Introl projects inference compute demand will exceed training compute by 118x in 2026. The industry is moving from training-heavy to thinking-heavy.
In production: Meta's Muse Spark, April 2026Two weeks ago, Meta Superintelligence Labs released Muse Spark, its first model under Alexandr Wang. The blog post is unusual because Meta openly framed the release as a scaling-laws story. The official title is "Scaling Towards Personal Superintelligence." Muse Spark is described as "the first step on our scaling ladder." More usefully, the post defines three scaling axes Meta tracks: pretraining, reinforcement learning, and test-time reasoning. The pretraining headline alone is striking: their new recipe reaches Llama 4 Maverick capability with over 10 times less compute. Each axis is a clean instance of the eras above. The RL axis is newer territory. Meta plots two curves against RL training steps: pass@1 (single-attempt accuracy) and pass@16 (one success in 16 attempts). Both grow log-linearly, on training and held-out data. Pass@16 growing alongside pass@1 means the model is improving reliability without collapsing its reasoning diversity. That is a non-trivial finding. Many RL recipes sharpen at the cost of mode collapse. The test-time finding is the most novel. Trained with a length penalty during RL, Muse Spark shows a phase transition on AIME: it first improves by thinking longer, then collapses into a compressed reasoning mode, then extends again at higher capability. The first published account of thought compression emerging from RL optimization. At deployment, Muse Spark's "Contemplating mode" scales test-time compute through parallel agents rather than longer chains, merging multiple agents' answers. It reaches 58 percent on Humanity's Last Exam and 38 percent on FrontierScience Research, in range of Gemini Deep Think and GPT Pro modes. Wall-clock latency stays comparable while total compute per query rises.
Which scaling law actually affects you?The eras above describe the science. Here is the practical version. Find the row that describes you, see which law you actually need to care about.
Quick hitsKaplan and Chinchilla were not really fighting. Pearce and Song's 2024 replication showed most of the gap came from Kaplan counting only non-embedding parameters and analyzing at smaller scale. Re-run at the same conditions, the Chinchilla coefficients hold. Data quality bends the ratio. Bi et al. 2024 showed that higher-quality data lets you train compute-optimally at lower tokens-per-parameter. Chinchilla's 20:1 assumes web-text quality. Curated data shifts the optimum down. Test-time has its own architecture. Sequential scaling means longer chain-of-thought. Parallel scaling means many samples plus voting or verification. Recent 2025 work shows the two combine sub-additively, not multiplicatively. A calculator for your own project. The paid archive includes a downloadable spreadsheet with Chinchilla-optimal, inference-aware overtraining, and ATLAS multilingual formulas. Plug in your model size, get the recommended training mix. The takeHere is the part nobody talks about. Scaling laws are empirical, not theoretical. Nobody has derived them from first principles; they are tight fits to expensive experiments, which is why they keep getting revised. Treat the eras as a sequence of better measurements, not as truth. The pattern is not "we keep being wrong." It is "each generation measures one more variable the previous one held fixed." The next paper will add another. The open questionTest-time scaling assumes the model has enough latent reasoning ability to unlock by thinking longer. But 2025 work shows smaller distilled models stop improving past a certain thinking length while larger ones keep going. Is there a pretraining scaling law for test-time-scalability itself? If so, training compute and inference compute are no longer independent choices. Reply if you have seen good work here. I would like to read it. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


