researchaudio

AGI scorecards, tiny-net reasoning, “early childhood” for agents, and coder AIs in your browser

Long, simple explanations. Links you can click. Audio-friendly notes.

🎧 Prefer listening? This issue comes with a narration script for audio (see the end of this email). You can paste it into ElevenLabs.
TL;DR
  • What is AGI, concretely? One paper proposes a simple 10-part scorecard so we can track progress by ability, not vibes.
  • Tiny models can still reason: a 27M-parameter tiny recursion model does well on puzzle-style tasks with ~1k examples.
  • Give agents a “childhood”: brief early experience (world-modeling or self-reflection) before RL improves multi-turn tools.
  • Agentic data science: DeepAnalyze-8B plans, uses tools, and writes reports with a hybrid reward for accuracy + process.
  • Hands-on tools: ChatGPT Atlas (a browser with ChatGPT built in) and Claude Code on the web (cloud coding sessions & PRs).
  • Market & hiring: reported Nvidia→OpenAI 5M-chip lease chatter; Meta hires Tim Brooks into Superintelligence Labs.
  • Bonus: Karpathy’s NanoChat (learn-by-building chat stack) + the Andrej × Dwarkesh interview.

🧠 5 Papers, Explained Simply

A Definition of AGI — a scorecard you can actually use

This paper says: stop arguing about fuzzy AGI definitions. Instead, measure whether an AI matches a well-educated adult across 10 abilities (things like reasoning, reading/writing, math, working memory, long-term memory storage & retrieval, visual & auditory processing, and speed). It borrows from human psychometrics (the CHC model) so we can report an overall AGI Score plus a “profile” that shows strengths and weaknesses.

Explain it like I’m new: Think of an AI report card with 10 subjects. Instead of bragging about one test, we track all subjects over time.

Why it matters: Research and product teams can set targets per ability (e.g., “improve working memory this quarter”) and avoid being misled by one high score.

DeepSeek-OCR — “optical compression” for fast, high-res reading

The idea: images can compress text efficiently. A slim vision DeepEncoder squeezes a high-res page into a small number of “vision tokens,” then a modest 3B MoE decoder reads it. With <10× text-to-vision token ratios, the model reports ~97% OCR precision; even at 20×, it’s still around ~60%. On OmniDocBench, it beats heavier models using far fewer tokens. They even generate ~200k+ pages/day on a single A100-40G.

Takeaway: If you need to read lots of PDFs fast, this optical compression trick means fewer tokens, lower cost, and still good accuracy.

Less Is More: Recursive Reasoning with Tiny Networks (TRM)

A tiny recursion model (~27M params) does step-by-step reasoning on puzzle-style tasks (Sudoku, mazes, ARC-AGI), trained on roughly ~1k examples. Reported results include solid scores on ARC-AGI-1 and non-trivial scores on ARC-AGI-2—showing that smart structure + recursion can beat raw size on some reasoning tasks.

Explain it simply: Instead of a huge brain, use a small brain that “thinks in loops” and refines its answer.

Agent Learning via Early Experience — give agents a “childhood”

Before full RL, let the agent gather some low-stakes early experience and learn from it. Two flavors: (1) implicit world-modeling (it learns how the environment behaves) and (2) self-reflection (it critiques and improves its own attempts). Across many environments, this improves success and out-of-domain generalization. On tasks with verifiable rewards (like WebShop or multi-turn function calling like BFCLv3), adding a standard RL step on top gives big additional gains.

Why it matters: If you’re building practical agents, a short “childhood” phase (world-modeling or self-reflection) can make them more reliable later.

DeepAnalyze: Agentic LLMs for Autonomous Data Science

DeepAnalyze-8B is trained to plan, use tools, and write reports across end-to-end data workflows. Its hybrid reward mixes rule-based checks (format, correctness) with LLM-as-a-judge scoring (process quality and report quality) and uses a curriculum (easy→hard). The promise: more robust, auditable data-science agents that care about both the answer and the process.

Builder tip: Reward not just the final number but also the steps taken. You’ll get cleaner, more reproducible pipelines.

🧪 Tools to Try (now)

ChatGPT Atlas

A web browser with ChatGPT built in. It can understand pages you’re viewing, remember optional “browser memories,” and use an agent mode to take actions while you supervise.

Claude Code on the web

Kick off cloud coding sessions in your browser, connect repos, and let Claude open PRs. Runs in isolated sandboxes with security controls.

Karpathy’s NanoChat

A minimal, end-to-end chat stack you can train cheaply. Great to truly understand the whole pipeline—from tokenizer to serving.

📈 Market & Talent

Nvidia/OpenAI mega-deal chatter: Reporting says Nvidia may lease up to ~5M chips to OpenAI over time (framed as roughly $350B of exposure). If accurate, it shows how fast compute demand is scaling—and the financial risk behind frontier models. Market take: NVDA dipped slightly on the rumor.

Meta hire: Meta reportedly hired Tim Brooks (ex-OpenAI/DeepMind) into Superintelligence Labs—another sign they’re leaning into video-centric world models and longer-horizon reasoning.

🎙 Listening Pick: Andrej Karpathy × Dwarkesh

Watch on YouTube. Themes to notice: tiny, hackable systems (see NanoChat), agent loops that don’t “run away,” and why evals should measure process—not just answers.

🛠 Cheat-Sheet: Borrow These Ideas

  • Make an “AGI Scorecard” for your product: pick 6–10 abilities that matter (tool use, memory, vision, latency) and track them monthly.
  • Prototype with tiny models + recursion: it’s cheaper to find bottlenecks before you scale.
  • Give agents a short “childhood”: a week of world-modeling or self-reflection data, then RL. You’ll see steadier multi-turn behavior.
  • Use hybrid rewards: score both outputs and process quality to avoid “shortcut” agents.
  • Try a coder-in-the-browser: evaluate Claude Code on the web and Atlas’s agent mode in supervised flows.

You’re reading researchaudio. If a friend sent this, subscribe here.

Keep Reading

No posts found