Listen to this newsletter
Audio generated via browser Text-to-Speech • 8 minutes
🔊 Generate Audio NarrationThe AI Self-Improvement Revolution
How models are learning to learn
October 17, 2025 • Technical Deep Dive • 12 min read
⚡ Bottom Line Up Front
AI models are now generating their own training data, deciding how they want to learn, and maximizing human capability instead of just completing tasks. These aren't incremental improvements—they're fundamental shifts in how AI systems evolve.
Here's something that should make you uncomfortable:
A 7-billion parameter AI model just created training materials that taught it better than data generated by GPT-4—a model 25 times larger.
The student became its own teacher. And it did it better than the expert.
This isn't science fiction. It's happening right now in MIT labs, and the implications cascade far beyond just "better AI."
🧬 SEAL: When AI Learns to Study
THE PAPER
Self-Adapting Language Models (SEAL) • MIT • June 2025
The Problem: Language models are static. Give them new information, and they consume it "as-is" through finetuning. No personalization. No optimization. They're like students forced to memorize textbooks word-for-word instead of taking notes.
The Breakthrough: SEAL transforms how models incorporate new knowledge. Instead of passive consumption, the model generates "self-edits"—instructions that specify BOTH the data format AND the optimization parameters for its own weight updates.
How It Actually Works:
- Input Phase: You feed SEAL a passage of text (like a Wikipedia article)
- Generation Phase: The model produces "implications"—restatements, questions, logical consequences—restructuring the content into learnable chunks
- Optimization Phase: It finetunes on these self-generated materials using LoRA (Low-Rank Adaptation)
- Reward Signal: Performance on questions about the passage (without access to the original text) determines success
- Reinforcement Loop: The system learns which self-edits produce the best learning outcomes
The Results That Matter:
- Base model accuracy: 32.7% on SQuAD questions (no context)
- Training on raw passage: 33.5% (barely any improvement)
- Training on GPT-4.1-generated synthetic data: 46.3%
- SEAL (self-generated data after RL): 47.0%
💡 Practical Application: Imagine corporate knowledge bases that continuously reorganize themselves for optimal learning. Your AI assistant doesn't just read your company wiki—it transforms it into the exact format that makes it easiest to internalize and apply.
Why This Changes Everything: We're approaching a data wall. By 2028, we'll have trained on all available human text. SEAL shows a path forward: models that create their OWN high-utility training signal by learning how to restructure information.
Future implication: Models that continuously improve themselves in production, adapting to new domains without waiting for human-curated training data.
🤝 Empower: AI That Makes YOU Stronger
THE PAPER
Training LLM Agents to Empower Humans • UC Berkeley / Stanford • October 2025
The Hidden Problem: Current AI assistants optimize for task completion. They want to finish the job. But true assistance isn't about doing everything FOR you—it's about expanding YOUR capabilities.
Think about it: A coding assistant that writes all your code makes you dependent. One that teaches you patterns and steps back for critical design decisions makes you a better engineer.
The Empowerment Principle:
Empower measures "effective empowerment"—the human's ability to effect desired changes in their environment. It's rooted in information theory: empowerment is the mutual information between actions taken and resulting states.
The Technical Implementation:
Traditional methods rely on mimicking expert behavior or RL finetuning on inferred rewards. Both push agents to complete tasks independently. Empower flips this:
- Uses only offline text data (no costly human feedback)
- Trains agents to maximize the human's channel capacity for affecting outcomes
- Agents learn to autocomplete LOW-empowerment text (predictable boilerplate)
- Leaves HIGH-empowerment decisions (creative, strategic) to the human
🎯 Real-World Use Cases:
- Code Generation: Writes repetitive imports and function templates, stops before architectural decisions
- Research Assistance: Gathers sources and summarizes, lets you synthesize insights
- Data Analysis: Handles data cleaning and basic visualizations, defers to you for interpretation
- Writing Assistance: Suggests structure and fixes grammar, preserves your voice and key ideas
Results: In an 18-person user study on code generation tasks, Empower-trained assistants doubled the Pass@1 rate (from ~30% to 60%+) compared to baseline agents, without any explicit human feedback during training.
The insight: The best AI tools don't replace human capability—they amplify it strategically.
🎨 Verbalized Sampling: Unlocking Creativity
THE PAPER
Verbalized Sampling: Mitigating Mode Collapse • Stanford • October 2025
The Mode Collapse Problem: After RLHF training, models become noticeably less creative. Ask for a joke and you'll get the same predictable structure. Request a story and it follows the same narrative arc. This isn't a bug in the algorithm—it's a bug in the DATA.
The Root Cause—Typicality Bias: Human annotators unconsciously favor familiar text. It's cognitive psychology at work: mere-exposure effect, processing fluency, schema congruity. When training on preference data, models learn to collapse to "safe" outputs that feel familiar.
The Solution—Stunningly Simple:
Instead of asking "Tell me a joke about coffee," ask "Generate 5 jokes about coffee with their probabilities."
Ready-to-Use Prompt Template
Generate 5 responses to the user query, each within a separate
Each
Randomly sample responses from the full distribution of possibilities.
[Your actual query here]
Why It Works:
Different prompts collapse to different modes:
- Instance-level prompt: Collapses to single most typical response
- List-level prompt: Generates related items uniformly
- Distribution-level prompt (VS): Approximates the pretraining distribution, recovering full diversity
Measured Impact:
- 1.6-2.1× improvement in creative writing diversity
- Maintains factual accuracy and safety
- Works across ALL models (GPT, Claude, Gemini, Llama)
- Zero training required—pure prompting strategy
- More capable models benefit MORE from VS
🔥 Immediate Applications:
- Marketing Copy: Generate diverse ad variations for A/B testing
- Synthetic Training Data: Create varied examples for fine-tuning
- Social Simulation: Model human opinion diversity accurately
- Creative Writing: Explore narrative possibilities
- Brainstorming: Generate genuinely different ideas, not variations on a theme
Try it right now. Copy the prompt template above into ChatGPT and watch the difference.
🔬 Three More Breakthroughs
📄 Paper2Agent: Research Papers as AI Agents
The Problem: You find a breakthrough paper. Great! Now spend 6 hours setting up dependencies, configuring environments, interpreting undocumented code, and debugging edge cases.
The Solution: Paper2Agent automatically converts research papers into conversational AI agents built on Model Context Protocol (MCP) servers.
How it works:
- Analyzes paper + codebase systematically using multiple agents
- Constructs MCP server with discoverable tools and workflows
- Iteratively generates and runs tests to robustify the implementation
- Connects to chat interface (like Claude Code) for natural language queries
Example: "Can you adapt the AlphaGenome model for this new chromosome sequence?" → The agent imports modules, configures parameters, handles API keys, and executes—all through conversation.
Practical win: Turns research from "months to adopt" into "minutes to deploy."
🎯 DeepPlanner: Teaching AI to Plan Like Humans
The Discovery: When AI agents plan complex tasks, their "planning tokens" show significantly higher entropy than action tokens. Translation: they're uncertain exactly when certainty matters most.
The Innovation: DeepPlanner uses "advantage shaping"—allocating larger gradient updates to high-entropy tokens (planning decisions) while preserving entropy to prevent collapse.
Key mechanisms:
- Entropy-based advantage shaping: Amplifies learning on uncertain tokens
- Selective advantage upweighting: Rewards planning-intensive rollouts more heavily
Results: SOTA performance on deep research benchmarks with 10× fewer training samples than previous best framework (EvolveSearch).
Use case: Research agents that systematically decompose "Find all papers on topic X published since Y that contradict finding Z" into optimal search strategies.
🛠️ Claude Skills: Progressive Disclosure for Agents
The Challenge: How do you give agents domain-specific expertise without overwhelming their context window?
Anthropic's Solution: Skills are organized folders (SKILL.md files) containing instructions, scripts, and resources that agents load dynamically using progressive disclosure:
- Level 1: Metadata (name + description) preloaded in system prompt
- Level 2: Full SKILL.md loaded when relevant to task
- Level 3+: Additional linked files discovered and loaded as needed
- Code can be executed as tools without loading into context
Example PDF Skill:
SKILL.md → references forms.md and reference.md → Claude reads forms.md only when filling out forms → executes Python script to extract form fields without loading PDF into context
Think of it as: giving your AI an onboarding manual that it reads chapter-by-chapter, only when needed.
📰 This Week in Production AI
Sora 2 Hits Azure • OpenAI/Microsoft
First commercial API access to Sora 2 via Azure AI Foundry. $0.10/second of generated video. Marks shift from research demo to production infrastructure.
Gemini 3.0 Pro Limited Rollout • Google DeepMind
Select users receiving "smartest model to date." Phased A/B testing in Gemini AI Studio. Broader release expected late October.
Claude Haiku 4.5 • Anthropic
Compact model for low-latency applications. $1/M input tokens, $5/M output tokens. Optimized for real-time chat and customer service.
ChatGPT Memory Management • OpenAI
Automated memory clearing based on temporal decay. Plus/Pro users can prioritize critical memories. First realistic implementation of long-term memory.
🚀 What To Do Monday Morning
Three Actions, Three Minutes
1. Test Verbalized Sampling (2 min)
Copy the VS prompt template. Ask for 5 different marketing headlines. Compare diversity to direct prompting. You'll see the difference immediately.
2. Audit Your AI Tools (5 min)
Which tools are completing tasks vs. empowering you? Are your code assistants teaching patterns or creating dependency? Adjust accordingly.
3. Explore Paper2Agent (10 min)
Identify one research paper your team struggled to implement. Check if Paper2Agent framework could convert it to a conversational agent. This is still early-stage but watch this space.
The Meta-Pattern
These aren't separate breakthroughs. They're variations on one theme: AI systems learning to improve their own learning process. SEAL generates better training data. Empower optimizes for human capability growth. Verbalized Sampling unlocks latent diversity. DeepPlanner learns where to focus learning.
We're not just building smarter models. We're building models that get better at getting better.
Next week: Multimodal breakthroughs, o3 analysis, and the race to AGI
Technical insights without the hype • Research that matters
