🤖 Building Autonomous Agents + 🎨 Training Image Editors
Two papers with real code you can use today
Hey! 👋
This week I'm breaking down two papers that aren't just research—they're actually practical. DeepAgent shows how to build agents that discover tools on-demand, and Pico-Banana-400K gives you 400K labeled image edits you can train on today.
Both have open-source code. Both solve real problems. Let's dig in.
🧠 DeepAgent: Autonomous Tool Discovery + Memory Management
From Renmin University and Xiaohongshu, DeepAgent introduces end-to-end reasoning where the agent autonomously thinks, discovers tools, and manages memory—all in one coherent process instead of rigid ReAct loops.
🎯 Three Core Innovations
1. Dynamic Tool Retrieval
The agent generates queries during reasoning. The system uses bge-large-en-v1.5 embeddings to retrieve top-k tools from the index.
Why it matters: Works with toolsets of 16,000+ APIs (ToolBench scale). No need to pre-select which tools the agent can use.
2. Autonomous Memory Folding
When context gets too long or the agent is stuck, it triggers to compress history into three structured memories:
Episodic Memory: High-level task progress and key decisions
Working Memory: Current subgoal and immediate context
Tool Memory: Which tools worked/failed and usage patterns
All stored in JSON format for stability. Then reasoning restarts with compressed memory instead of full history.
3. ToolPO Training Method
Training uses an LLM to simulate API responses (avoids hitting thousands of real APIs). Fine-grained advantage attribution rewards correct tool invocations.
Setup: 100 steps, batch size 64, K=8 rollouts per prompt, trained on 64x H20 GPUs with QwQ-32B backbone
📊 Verified Performance Results
(vs 54.0% CodeAct)
(vs 42.5% HiRA)
(vs 84.3% HiRA)
Key strength: 20-30% improvement in open-set scenarios where tools aren't pre-selected. On ToolHop (requiring 3-7 sequential tool calls): 40.6% vs 29.0% for best baseline.
🔧 Implementation Approach
Step 1: Build tool index with embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
tool_embeddings = model.encode(tool_docs)
# Retrieve top-k at inference
Step 2: Parse agent outputs
Agent generates
Step 3: Memory folding trigger
When context exceeds threshold or agent loops, use auxiliary LLM (Qwen2.5-32B-Instruct in paper) to generate structured JSON memory, then restart reasoning.
💡 Real-World Applications
- DevOps automation: Discover kubectl, terraform, docker commands on-demand based on infrastructure state
- Customer support: Search 100+ internal tools (CRM, ticketing, knowledge base) without pre-selection
- AI monitoring systems: Compress agent interaction logs into structured episodes for debugging
- Data engineering: Discover and chain transformation tools (dbt, pandas, SQL) dynamically
🎨 Pico-Banana-400K: 400K Real Image Edits from Apple
Apple released 386K total examples (400K rounded) built from OpenImages photos, edited by Nano-Banana (Gemini-2.5-Flash-Image), and quality-verified by Gemini-2.5-Pro. Cost to produce: approximately $100K USD.
📦 Exact Dataset Composition
(66.8%)
(18.7%)
(14.5%)
35 Edit Types in 8 Categories
Pixel & Photometric: Color tone, film grain
Object-Level Semantic: Add/remove/replace objects, change attributes
Scene Composition: Weather, seasons, backgrounds, lighting
Stylistic: Artistic transfer, photo→cartoon, modern↔historical
Text & Symbol: Replace/translate signs, add text, change fonts
Human-Centric: 13 types including accessories, clothing, poses, age/gender, Pixar/anime/LEGO conversions
Scale: Zoom in
Spatial/Layout: Outpainting
✅ Quality Control: Gemini-2.5-Pro as Judge
Every edit scored on 4 weighted criteria (threshold: ~0.7):
Auto-retry: If score < 0.7, retry up to 3 times. Success → dataset. Failures → preference data (chosen vs. rejected pairs).
📊 Success Rates (Table 1 from Paper)
✅ Easy (90%+ success)
- Strong artistic style transfer: 93.40% (15,285 examples)
- Film grain/vintage filter: 90.68% (15,443 examples)
- Modern↔historical: 88.75% (14,856 examples)
⚠️ Moderate (75-85% success)
- Remove object: 83.28% (15,111 examples)
- Replace category: 83.48% (14,549 examples)
- Photo→cartoon/sketch: 80.06% (12,736 examples)
- Seasonal change: 80.15% (13,439 examples)
❌ Hard (below 75%)
- Relocate object: 59.23% - Hardest edit type
- Change font/color of text: 57.59% - Text rendering issues
- Caricature: 58.84% - Identity drift problems
- Outpainting: 66.34% - Boundary continuity issues
- Pixar/Disney 3D conversion: 64.63%
🔧 Three Training Approaches
Option 1: Supervised Fine-Tuning (258K single-turn)
Standard InstructPix2Pix setup. Each example has: original_image, edited_image, instruction (long + short versions), edit_type. Train diffusion model conditioned on [image + text].
Option 2: DPO/Preference Learning (56K pairs)
Each example: original_image, instruction, chosen_edit (passed judge), rejected_edit (failed). Use for Direct Preference Optimization or reward model training.
Option 3: Multi-Turn Sequences (72K, 2-5 turns each)
Consecutive edits with referential language. Example: Turn 1: "Add hat" → Turn 2: "Make it red" (refers to hat). Perfect for iterative editing UX.
License: MIT for code, CC-BY 4.0 for data. Fully shareable and commercial-use friendly.
🔗 Access Everything
DeepAgent
- GitHub Repository
- Paper:
arXiv:2510.21618 - Backbone: QwQ-32B
Pico-Banana-400K
- GitHub Repository
- Paper:
arXiv:2510.19808 - Dataset on HuggingFace
Both papers are immediately practical:
- DeepAgent shows autonomous tool discovery for 16K+ APIs
- Pico-Banana gives you 386K training examples for image editing
- Both have open-source implementations ready to use
Pick one and build something this weekend. That's how you actually learn.
Keep building,
ResearchAudio
ResearchAudio.io • Weekly AI research that actually matters
Simplify Training with AI-Generated Video Guides
Simplify Training with AI-Generated Video Guides
Are you tired of repeating the same instructions to your team? Guidde revolutionizes how you document and share processes with AI-powered how-to videos.
Here’s how:
1️⃣ Instant Creation: Turn complex tasks into stunning step-by-step video guides in seconds.
2️⃣ Fully Automated: Capture workflows with a browser extension that generates visuals, voiceovers, and call-to-actions.
3️⃣ Seamless Sharing: Share or embed guides anywhere effortlessly.
The best part? The browser extension is 100% free.

