AGI scorecards, tiny-net reasoning, “early childhood” for agents, and coder AIs in your browser
Long, simple explanations. Links you can click. Audio-friendly notes.
- What is AGI, concretely? One paper proposes a simple 10-part scorecard so we can track progress by ability, not vibes.
- Tiny models can still reason: a 27M-parameter tiny recursion model does well on puzzle-style tasks with ~1k examples.
- Give agents a “childhood”: brief early experience (world-modeling or self-reflection) before RL improves multi-turn tools.
- Agentic data science: DeepAnalyze-8B plans, uses tools, and writes reports with a hybrid reward for accuracy + process.
- Hands-on tools: ChatGPT Atlas (a browser with ChatGPT built in) and Claude Code on the web (cloud coding sessions & PRs).
- Market & hiring: reported Nvidia→OpenAI 5M-chip lease chatter; Meta hires Tim Brooks into Superintelligence Labs.
- Bonus: Karpathy’s NanoChat (learn-by-building chat stack) + the Andrej × Dwarkesh interview.
🧠 5 Papers, Explained Simply
A Definition of AGI — a scorecard you can actually use
This paper says: stop arguing about fuzzy AGI definitions. Instead, measure whether an AI matches a well-educated adult across 10 abilities (things like reasoning, reading/writing, math, working memory, long-term memory storage & retrieval, visual & auditory processing, and speed). It borrows from human psychometrics (the CHC model) so we can report an overall AGI Score plus a “profile” that shows strengths and weaknesses.
Why it matters: Research and product teams can set targets per ability (e.g., “improve working memory this quarter”) and avoid being misled by one high score.
DeepSeek-OCR — “optical compression” for fast, high-res reading
The idea: images can compress text efficiently. A slim vision DeepEncoder squeezes a high-res page into a small number of “vision tokens,” then a modest 3B MoE decoder reads it. With <10× text-to-vision token ratios, the model reports ~97% OCR precision; even at 20×, it’s still around ~60%. On OmniDocBench, it beats heavier models using far fewer tokens. They even generate ~200k+ pages/day on a single A100-40G.
Less Is More: Recursive Reasoning with Tiny Networks (TRM)
A tiny recursion model (~27M params) does step-by-step reasoning on puzzle-style tasks (Sudoku, mazes, ARC-AGI), trained on roughly ~1k examples. Reported results include solid scores on ARC-AGI-1 and non-trivial scores on ARC-AGI-2—showing that smart structure + recursion can beat raw size on some reasoning tasks.
Agent Learning via Early Experience — give agents a “childhood”
Before full RL, let the agent gather some low-stakes early experience and learn from it. Two flavors: (1) implicit world-modeling (it learns how the environment behaves) and (2) self-reflection (it critiques and improves its own attempts). Across many environments, this improves success and out-of-domain generalization. On tasks with verifiable rewards (like WebShop or multi-turn function calling like BFCLv3), adding a standard RL step on top gives big additional gains.
DeepAnalyze: Agentic LLMs for Autonomous Data Science
DeepAnalyze-8B is trained to plan, use tools, and write reports across end-to-end data workflows. Its hybrid reward mixes rule-based checks (format, correctness) with LLM-as-a-judge scoring (process quality and report quality) and uses a curriculum (easy→hard). The promise: more robust, auditable data-science agents that care about both the answer and the process.
🧪 Tools to Try (now)
ChatGPT Atlas
A web browser with ChatGPT built in. It can understand pages you’re viewing, remember optional “browser memories,” and use an agent mode to take actions while you supervise.
Claude Code on the web
Kick off cloud coding sessions in your browser, connect repos, and let Claude open PRs. Runs in isolated sandboxes with security controls.
Karpathy’s NanoChat
A minimal, end-to-end chat stack you can train cheaply. Great to truly understand the whole pipeline—from tokenizer to serving.
📈 Market & Talent
Nvidia/OpenAI mega-deal chatter: Reporting says Nvidia may lease up to ~5M chips to OpenAI over time (framed as roughly $350B of exposure). If accurate, it shows how fast compute demand is scaling—and the financial risk behind frontier models. Market take: NVDA dipped slightly on the rumor.
Meta hire: Meta reportedly hired Tim Brooks (ex-OpenAI/DeepMind) into Superintelligence Labs—another sign they’re leaning into video-centric world models and longer-horizon reasoning.
🎙 Listening Pick: Andrej Karpathy × Dwarkesh
Watch on YouTube. Themes to notice: tiny, hackable systems (see NanoChat), agent loops that don’t “run away,” and why evals should measure process—not just answers.
🛠 Cheat-Sheet: Borrow These Ideas
- Make an “AGI Scorecard” for your product: pick 6–10 abilities that matter (tool use, memory, vision, latency) and track them monthly.
- Prototype with tiny models + recursion: it’s cheaper to find bottlenecks before you scale.
- Give agents a short “childhood”: a week of world-modeling or self-reflection data, then RL. You’ll see steadier multi-turn behavior.
- Use hybrid rewards: score both outputs and process quality to avoid “shortcut” agents.
- Try a coder-in-the-browser: evaluate Claude Code on the web and Atlas’s agent mode in supervised flows.
- A Definition of AGI (PDF)
- DeepSeek-OCR (PDF)
- Less Is More: Recursive Reasoning with Tiny Networks (PDF)
- Agent Learning via Early Experience (PDF)
- DeepAnalyze: Agentic LLMs for Autonomous Data Science (PDF)
- OpenAI: Introducing ChatGPT Atlas
- Anthropic: Claude Code on the web
- Analytics Vidhya: Karpathy’s NanoChat
- Andrej Karpathy × Dwarkesh (YouTube)
You’re reading researchaudio. If a friend sent this, subscribe here.