NeurIPS 2025 Spotlight

Why Your AI Can't Remember What You Told It Yesterday

Google researchers discovered that deep learning has a hidden memory problem. Their solution could change how we build AI assistants, coding tools, and enterprise applications.

📄 Google Research ⏱️ 10 min read 🎧 Audio available

⚡ The 30-Second Version

Current AI models like ChatGPT have a form of "digital amnesia"—they can use information in your current conversation but can't actually learn from it permanently. Google researchers discovered that neural networks are really nested memory systems operating at different speeds. By adding "medium-term memory" between fast context and frozen knowledge, they built HOPE—a model that keeps learning after deployment. This matters because it could lead to AI assistants that actually remember your preferences, coding tools that learn your codebase, and enterprise systems that improve with every interaction.

1 The Problem Nobody Talks About

Your AI Has Amnesia (And You're Paying for It)

Here's something that might surprise you: ChatGPT, Claude, and Gemini all have a form of amnesia.

Think about how you learn. You have a conversation, extract key insights, and update your mental model. Tomorrow, you remember what you learned. AI models? They're stuck in an eternal present.

"Their knowledge is limited to either the immediate context that fits into their context window, or the knowledge in MLP layers that stores long-past, before the onset of 'end of pre-training.'"

The researchers compare this to anterograde amnesia—the condition where people can't form new long-term memories. Every conversation feels new to them, even if they just discussed the same topic five minutes ago.

Current LLMs have exactly two modes:

⚡

Immediate Memory (Attention)

Holds your current conversation. Limited to context window. Completely erased when the chat ends.

🧊

Frozen Knowledge (Weights)

Everything learned during training. Never changes after deployment. Can't incorporate new information.

There's nothing in between. No medium-term memory. No way to consolidate important information from conversations into persistent knowledge. This is why AI assistants keep asking you the same questions.

2 The Breakthrough Insight

Everything is Memory (Including Your Optimizer)

The paper's core insight is beautifully simple: every component of a neural network is an associative memory trying to compress patterns into its parameters.

The layers? Memory. The attention mechanism? Memory. Even the optimizer that trains the model? Also memory.

The key difference between components isn't what they do—it's how often they update. Just like brain waves operate at different frequencies (delta, theta, alpha, beta, gamma), neural network components operate at different "speeds":

The Nested Learning Hierarchy

Pre-training once per epoch

↓

Optimizer State ~1M steps

↓

Slow Memory ~1K tokens

↓

Fast Memory (Attention) every token

The problem with current Transformers? They only have two speeds: "every token" and "never." There's no middle ground where the model can gradually consolidate useful patterns.

The Optimizer Revelation

Here's where it gets wild. The paper proves that common optimizers like Adam and SGD with Momentum are actually memory systems themselves:

          # What we thought momentum was doing:
m[t+1] = 0.9 * m[t] + 0.1 * gradient

# What it's ACTUALLY doing (solving an optimization problem):
m[t+1] = argmin -⟨m, gradient⟩ + λ‖m - m[t]‖²

# Translation: Momentum is a memory learning to compress
# the history of gradients into a single vector!
        

This explains why the Muon optimizer works so well—it's essentially using a more expressive memory for gradient history. Better optimizer = better memory for learning signals.

3 Why This Matters To You

5 Real-World Applications That Become Possible

This isn't just academic theory. Nested Learning and the HOPE architecture unlock applications that current AI simply cannot do well:

💼

Enterprise AI Assistants That Actually Learn

Customer Service, Internal Tools, Knowledge Management

Imagine an AI assistant that learns your company's terminology, processes, and preferences over weeks of interaction—without expensive fine-tuning. Current systems require RAG (retrieval) for everything or periodic retraining. HOPE-style architectures could consolidate frequently-used patterns into medium-term memory, making responses faster and more accurate over time.

→ 40-60% reduction in repeated explanations

👨‍💻

Coding Assistants That Know Your Codebase

GitHub Copilot, Cursor, Code Review Tools

Today's coding assistants treat every file independently. They don't remember that you prefer functional programming, that your team uses specific naming conventions, or that there's a utility function you wrote last week. A nested learning approach could build a persistent model of your codebase patterns that improves with every coding session.

→ Code suggestions aligned with team conventions

🏥

Medical AI That Learns Patient History

Clinical Decision Support, Patient Monitoring

Healthcare AI needs to track patient context across months of interactions. Current approaches require explicit retrieval of medical records. A continuum memory system could maintain an evolving understanding of each patient's history, medications, and responses—updating its internal model as new information arrives without forgetting critical history.

→ Better longitudinal patient understanding

🎓

Personalized Education That Adapts

Tutoring Systems, Corporate Training, Skill Development

A tutoring AI should remember what concepts a student struggles with, what explanations clicked, and how their understanding evolves. Current systems either forget between sessions or require complex external databases. Nested learning enables genuine adaptive learning that builds a model of each student's knowledge state.

→ Learning paths that truly personalize

🤖

Autonomous Agents That Improve

AI Agents, Workflow Automation, Research Assistants

The coming wave of AI agents needs to learn from failures and successes across tasks. If an agent figures out a better way to search documentation or structure API calls, that knowledge should persist. HOPE's self-modifying architecture lets agents update their own behavior based on experience—the holy grail for autonomous systems.

→ Agents that get better at their job

4 How It Works

HOPE: The Self-Modifying Architecture

The paper introduces HOPE—a self-referential learning module. It combines two key innovations:

Continuum Memory System

Instead of just fast attention and frozen MLP weights, HOPE uses multiple MLP blocks that update at different frequencies:

🔴

High-Frequency MLP

Updates every ~16 tokens. Captures immediate patterns for the current task. Like working memory.

🟡

Mid-Frequency MLP

Updates every ~1,000 tokens. Consolidates repeated patterns. The "missing middle" in current architectures.

🟢

Low-Frequency MLP

Updates rarely (~16M tokens). Stores fundamental knowledge. Changes only when truly necessary.

Self-Modifying Updates

The model doesn't just store information—it learns how to update itself using an improved gradient rule:

          # Standard gradient descent (what everyone uses)
W[t+1] = W[t] - η * gradient

# HOPE's delta-rule update
W[t+1] = W[t] * (I - x·xᵀ) - η * gradient

# The (I - x·xᵀ) term does something clever:
# It considers correlations between inputs,
# helping the memory manage its capacity better
        

This "delta rule" comes from classical neuroscience and helps the model avoid overwriting important information when learning new patterns—a key requirement for continual learning.

5 The Evidence

Does It Actually Work?

At 1.3B parameters trained on 100B tokens, HOPE outperforms both Transformers and recent state-of-the-art alternatives:

Model	Wiki PPL ↓	Benchmark Avg ↑
Transformer++	18.53	52.25%
RetNet	19.08	52.02%
Mamba (Samba)	16.13	54.00%
Titans (LMM)	15.60	56.82%
HOPE	15.11	57.23%

The full paper includes results on continual learning (where HOPE really shines), long-context reasoning, and in-context learning emergence. The complete arXiv version drops November 13.

6 Key Takeaways

What This Means for Practitioners

Depth Is Overrated, Frequency Is Underrated

Stacking more layers doesn't give you more learning capability. What matters is having components that update at different timescales.

Your Optimizer Is Doing More Than You Think

Momentum, Adam, and other optimizers are learnable memory systems. Better optimizers = better memory for learning signals. Expect more research here.

The "Middle Ground" Is Missing

Current architectures lack medium-term memory. This is why RAG exists—it's a workaround. Native medium-term memory could reduce RAG dependency.

Continual Learning Is Now Architecturally Possible

With the right structure, models can learn after deployment without catastrophic forgetting. This changes the economics of AI deployment.

🎧 Listen to the Full Deep Dive

Get the 15-minute audio breakdown with implementation insights and practical takeaways.

▶️ Play Audio Summary 📄 Read the Paper

Coming Next Week

Mixture of Experts 2.0: DeepSeek's new architecture that trains faster and runs cheaper than GPT-4. We'll break down how it works and what it means for open-source AI.

Have a paper you'd like us to cover? Reply to this email—we read every response.

Transformers Version 2