In partnership with

🤖 Building Autonomous Agents + 🎨 Training Image Editors

Two papers with real code you can use today

Hey! 👋

This week I'm breaking down two papers that aren't just research—they're actually practical. DeepAgent shows how to build agents that discover tools on-demand, and Pico-Banana-400K gives you 400K labeled image edits you can train on today.

Both have open-source code. Both solve real problems. Let's dig in.

🧠 DeepAgent: Autonomous Tool Discovery + Memory Management

From Renmin University and Xiaohongshu, DeepAgent introduces end-to-end reasoning where the agent autonomously thinks, discovers tools, and manages memory—all in one coherent process instead of rigid ReAct loops.

🎯 Three Core Innovations

1. Dynamic Tool Retrieval

The agent generates queries during reasoning. The system uses bge-large-en-v1.5 embeddings to retrieve top-k tools from the index.

movie database API

TMDB API, IMDB...

{"name": "tmdb_search", "args": {...}}

Why it matters: Works with toolsets of 16,000+ APIs (ToolBench scale). No need to pre-select which tools the agent can use.

2. Autonomous Memory Folding

When context gets too long or the agent is stuck, it triggers to compress history into three structured memories:

Episodic Memory: High-level task progress and key decisions

Working Memory: Current subgoal and immediate context

Tool Memory: Which tools worked/failed and usage patterns

All stored in JSON format for stability. Then reasoning restarts with compressed memory instead of full history.

3. ToolPO Training Method

Training uses an LLM to simulate API responses (avoids hitting thousands of real APIs). Fine-grained advantage attribution rewards correct tool invocations.

Setup: 100 steps, batch size 64, K=8 rollouts per prompt, trained on 64x H20 GPUs with QwQ-32B backbone

📊 Verified Performance Results

64.0%

ToolBench
(vs 54.0% CodeAct)

53.3%

GAIA
(vs 42.5% HiRA)

91.8%

ALFWorld
(vs 84.3% HiRA)

Key strength: 20-30% improvement in open-set scenarios where tools aren't pre-selected. On ToolHop (requiring 3-7 sequential tool calls): 40.6% vs 29.0% for best baseline.

🔧 Implementation Approach

Step 1: Build tool index with embeddings

# Use bge-large-en-v1.5 (same as paper)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-large-en-v1.5')

tool_embeddings = model.encode(tool_docs)

# Retrieve top-k at inference

Step 2: Parse agent outputs

Agent generates , , tags. System intercepts, executes, and injects results back into reasoning stream.

Step 3: Memory folding trigger

When context exceeds threshold or agent loops, use auxiliary LLM (Qwen2.5-32B-Instruct in paper) to generate structured JSON memory, then restart reasoning.

💡 Real-World Applications

DevOps automation: Discover kubectl, terraform, docker commands on-demand based on infrastructure state
Customer support: Search 100+ internal tools (CRM, ticketing, knowledge base) without pre-selection
AI monitoring systems: Compress agent interaction logs into structured episodes for debugging
Data engineering: Discover and chain transformation tools (dbt, pandas, SQL) dynamically

🎨 Pico-Banana-400K: 400K Real Image Edits from Apple

Apple released 386K total examples (400K rounded) built from OpenImages photos, edited by Nano-Banana (Gemini-2.5-Flash-Image), and quality-verified by Gemini-2.5-Pro. Cost to produce: approximately $100K USD.

📦 Exact Dataset Composition

258K

Single-turn SFT
(66.8%)

72K

Multi-turn
(18.7%)

56K

Preference pairs
(14.5%)

35 Edit Types in 8 Categories

Pixel & Photometric: Color tone, film grain

Object-Level Semantic: Add/remove/replace objects, change attributes

Scene Composition: Weather, seasons, backgrounds, lighting

Stylistic: Artistic transfer, photo→cartoon, modern↔historical

Text & Symbol: Replace/translate signs, add text, change fonts

Human-Centric: 13 types including accessories, clothing, poses, age/gender, Pixar/anime/LEGO conversions

Scale: Zoom in

Spatial/Layout: Outpainting

✅ Quality Control: Gemini-2.5-Pro as Judge

Every edit scored on 4 weighted criteria (threshold: ~0.7):

40%

Instruction Compliance - Fulfills the prompt

25%

Seamlessness - Natural, no artifacts

20%

Preservation Balance - Unchanged parts stay consistent

15%

Technical Quality - Sharpness, color, exposure

Auto-retry: If score < 0.7, retry up to 3 times. Success → dataset. Failures → preference data (chosen vs. rejected pairs).

📊 Success Rates (Table 1 from Paper)

✅ Easy (90%+ success)

Strong artistic style transfer: 93.40% (15,285 examples)
Film grain/vintage filter: 90.68% (15,443 examples)
Modern↔historical: 88.75% (14,856 examples)

⚠️ Moderate (75-85% success)

Remove object: 83.28% (15,111 examples)
Replace category: 83.48% (14,549 examples)
Photo→cartoon/sketch: 80.06% (12,736 examples)
Seasonal change: 80.15% (13,439 examples)

❌ Hard (below 75%)

Relocate object: 59.23% - Hardest edit type
Change font/color of text: 57.59% - Text rendering issues
Caricature: 58.84% - Identity drift problems
Outpainting: 66.34% - Boundary continuity issues
Pixar/Disney 3D conversion: 64.63%

🔧 Three Training Approaches

Option 1: Supervised Fine-Tuning (258K single-turn)

Standard InstructPix2Pix setup. Each example has: original_image, edited_image, instruction (long + short versions), edit_type. Train diffusion model conditioned on [image + text].

Option 2: DPO/Preference Learning (56K pairs)

Each example: original_image, instruction, chosen_edit (passed judge), rejected_edit (failed). Use for Direct Preference Optimization or reward model training.

Option 3: Multi-Turn Sequences (72K, 2-5 turns each)

Consecutive edits with referential language. Example: Turn 1: "Add hat" → Turn 2: "Make it red" (refers to hat). Perfect for iterative editing UX.

License: MIT for code, CC-BY 4.0 for data. Fully shareable and commercial-use friendly.

🔗 Access Everything

DeepAgent

GitHub Repository
Paper: arXiv:2510.21618
Backbone: QwQ-32B

Pico-Banana-400K

GitHub Repository
Paper: arXiv:2510.19808
Dataset on HuggingFace

Both papers are immediately practical:

DeepAgent shows autonomous tool discovery for 16K+ APIs
Pico-Banana gives you 386K training examples for image editing
Both have open-source implementations ready to use

Pick one and build something this weekend. That's how you actually learn.

Keep building,
ResearchAudio

ResearchAudio.io • Weekly AI research that actually matters

Simplify Training with AI-Generated Video Guides

Are you tired of repeating the same instructions to your team? Guidde revolutionizes how you document and share processes with AI-powered how-to videos.

Here’s how:

1️⃣ Instant Creation: Turn complex tasks into stunning step-by-step video guides in seconds.
2️⃣ Fully Automated: Capture workflows with a browser extension that generates visuals, voiceovers, and call-to-actions.
3️⃣ Seamless Sharing: Share or embed guides anywhere effortlessly.

The best part? The browser extension is 100% free.

Check out Guidde

400K image edits + 16K API agent system (researchaudio.io)