In partnership with

🤖 Building Autonomous Agents + 🎨 Training Image Editors

Two papers with real code you can use today

Hey! 👋

This week I'm breaking down two papers that aren't just research—they're actually practical. DeepAgent shows how to build agents that discover tools on-demand, and Pico-Banana-400K gives you 400K labeled image edits you can train on today.

Both have open-source code. Both solve real problems. Let's dig in.

🧠 DeepAgent: Autonomous Tool Discovery + Memory Management

From Renmin University and Xiaohongshu, DeepAgent introduces end-to-end reasoning where the agent autonomously thinks, discovers tools, and manages memory—all in one coherent process instead of rigid ReAct loops.

🎯 Three Core Innovations

1. Dynamic Tool Retrieval

The agent generates queries during reasoning. The system uses bge-large-en-v1.5 embeddings to retrieve top-k tools from the index.

movie database API
TMDB API, IMDB...
{"name": "tmdb_search", "args": {...}}

Why it matters: Works with toolsets of 16,000+ APIs (ToolBench scale). No need to pre-select which tools the agent can use.

2. Autonomous Memory Folding

When context gets too long or the agent is stuck, it triggers to compress history into three structured memories:

Episodic Memory: High-level task progress and key decisions

Working Memory: Current subgoal and immediate context

Tool Memory: Which tools worked/failed and usage patterns

All stored in JSON format for stability. Then reasoning restarts with compressed memory instead of full history.

3. ToolPO Training Method

Training uses an LLM to simulate API responses (avoids hitting thousands of real APIs). Fine-grained advantage attribution rewards correct tool invocations.

Setup: 100 steps, batch size 64, K=8 rollouts per prompt, trained on 64x H20 GPUs with QwQ-32B backbone

📊 Verified Performance Results

64.0%
ToolBench
(vs 54.0% CodeAct)
53.3%
GAIA
(vs 42.5% HiRA)
91.8%
ALFWorld
(vs 84.3% HiRA)

Key strength: 20-30% improvement in open-set scenarios where tools aren't pre-selected. On ToolHop (requiring 3-7 sequential tool calls): 40.6% vs 29.0% for best baseline.

🔧 Implementation Approach

Step 1: Build tool index with embeddings

# Use bge-large-en-v1.5 (same as paper)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
tool_embeddings = model.encode(tool_docs)
# Retrieve top-k at inference

Step 2: Parse agent outputs

Agent generates , , tags. System intercepts, executes, and injects results back into reasoning stream.

Step 3: Memory folding trigger

When context exceeds threshold or agent loops, use auxiliary LLM (Qwen2.5-32B-Instruct in paper) to generate structured JSON memory, then restart reasoning.

💡 Real-World Applications

  • DevOps automation: Discover kubectl, terraform, docker commands on-demand based on infrastructure state
  • Customer support: Search 100+ internal tools (CRM, ticketing, knowledge base) without pre-selection
  • AI monitoring systems: Compress agent interaction logs into structured episodes for debugging
  • Data engineering: Discover and chain transformation tools (dbt, pandas, SQL) dynamically

🎨 Pico-Banana-400K: 400K Real Image Edits from Apple

Apple released 386K total examples (400K rounded) built from OpenImages photos, edited by Nano-Banana (Gemini-2.5-Flash-Image), and quality-verified by Gemini-2.5-Pro. Cost to produce: approximately $100K USD.

📦 Exact Dataset Composition

258K
Single-turn SFT
(66.8%)
72K
Multi-turn
(18.7%)
56K
Preference pairs
(14.5%)

35 Edit Types in 8 Categories

Pixel & Photometric: Color tone, film grain

Object-Level Semantic: Add/remove/replace objects, change attributes

Scene Composition: Weather, seasons, backgrounds, lighting

Stylistic: Artistic transfer, photo→cartoon, modern↔historical

Text & Symbol: Replace/translate signs, add text, change fonts

Human-Centric: 13 types including accessories, clothing, poses, age/gender, Pixar/anime/LEGO conversions

Scale: Zoom in

Spatial/Layout: Outpainting

✅ Quality Control: Gemini-2.5-Pro as Judge

Every edit scored on 4 weighted criteria (threshold: ~0.7):

40%
Instruction Compliance - Fulfills the prompt
25%
Seamlessness - Natural, no artifacts
20%
Preservation Balance - Unchanged parts stay consistent
15%
Technical Quality - Sharpness, color, exposure

Auto-retry: If score < 0.7, retry up to 3 times. Success → dataset. Failures → preference data (chosen vs. rejected pairs).

📊 Success Rates (Table 1 from Paper)

✅ Easy (90%+ success)

  • Strong artistic style transfer: 93.40% (15,285 examples)
  • Film grain/vintage filter: 90.68% (15,443 examples)
  • Modern↔historical: 88.75% (14,856 examples)

⚠️ Moderate (75-85% success)

  • Remove object: 83.28% (15,111 examples)
  • Replace category: 83.48% (14,549 examples)
  • Photo→cartoon/sketch: 80.06% (12,736 examples)
  • Seasonal change: 80.15% (13,439 examples)

❌ Hard (below 75%)

  • Relocate object: 59.23% - Hardest edit type
  • Change font/color of text: 57.59% - Text rendering issues
  • Caricature: 58.84% - Identity drift problems
  • Outpainting: 66.34% - Boundary continuity issues
  • Pixar/Disney 3D conversion: 64.63%

🔧 Three Training Approaches

Option 1: Supervised Fine-Tuning (258K single-turn)

Standard InstructPix2Pix setup. Each example has: original_image, edited_image, instruction (long + short versions), edit_type. Train diffusion model conditioned on [image + text].

Option 2: DPO/Preference Learning (56K pairs)

Each example: original_image, instruction, chosen_edit (passed judge), rejected_edit (failed). Use for Direct Preference Optimization or reward model training.

Option 3: Multi-Turn Sequences (72K, 2-5 turns each)

Consecutive edits with referential language. Example: Turn 1: "Add hat" → Turn 2: "Make it red" (refers to hat). Perfect for iterative editing UX.

License: MIT for code, CC-BY 4.0 for data. Fully shareable and commercial-use friendly.

🔗 Access Everything

DeepAgent

Pico-Banana-400K

Both papers are immediately practical:

  • DeepAgent shows autonomous tool discovery for 16K+ APIs
  • Pico-Banana gives you 386K training examples for image editing
  • Both have open-source implementations ready to use

Pick one and build something this weekend. That's how you actually learn.

Keep building,
ResearchAudio

ResearchAudio.io • Weekly AI research that actually matters

Simplify Training with AI-Generated Video Guides

Simplify Training with AI-Generated Video Guides

Are you tired of repeating the same instructions to your team? Guidde revolutionizes how you document and share processes with AI-powered how-to videos.

Here’s how:

1️⃣ Instant Creation: Turn complex tasks into stunning step-by-step video guides in seconds.
2️⃣ Fully Automated: Capture workflows with a browser extension that generates visuals, voiceovers, and call-to-actions.
3️⃣ Seamless Sharing: Share or embed guides anywhere effortlessly.

The best part? The browser extension is 100% free.

Keep Reading

No posts found