In partnership with

The Future of AI in Marketing. Your Shortcut to Smarter, Faster Marketing.

Unlock a focused set of AI strategies built to streamline your work and maximize impact. This guide delivers the practical tactics and tools marketers need to start seeing results right away:

  • 7 high-impact AI strategies to accelerate your marketing performance

  • Practical use cases for content creation, lead gen, and personalization

  • Expert insights into how top marketers are using AI today

  • A framework to evaluate and implement AI tools efficiently

Stay ahead of the curve with these top strategies AI helped develop for marketers, built for real-world results.

ResearchAudio.io

Databricks Trained a Search Agent with Zero Human Labels

33% lower cost and 47% lower latency than frontier models. The multi-task RL recipe inside.

Databricks built a search agent that matches Claude Opus 4.6 on enterprise knowledge tasks, at 33% lower cost per query and 47% lower latency. The model, called KARL (Knowledge Agents via Reinforcement Learning), was trained entirely on synthetic data the agent generated itself, with no human labeling required. The training cost: a few thousand GPU hours.

The catch? It generalizes to tasks it was never trained on.

33%
Lower Cost
47%
Lower Latency
6
Search Tasks Evaluated

The Problem with Enterprise Search Agents

Enterprise knowledge tasks (searching internal documents, cross-referencing information, aggregating facts from meeting notes) are fundamentally different from math or coding. They are hard-to-verify: there is often no single correct answer, and the information needed is scattered across dozens of noisy, unstructured documents.

Most existing "deep research" agents rely on public web search and black-box tools. It remains unclear whether those results generalize to proprietary, enterprise data. Meanwhile, benchmarks like HotpotQA or BrowseComp only capture a narrow slice of real-world search behaviors. Databricks wanted to answer a harder question: can a single agent master multiple types of grounded reasoning at once?

How KARL Works: Three Core Components

1. KARLBench: six search regimes in one benchmark. Rather than testing on a single task, Databricks built KARLBench to evaluate six distinct search capabilities: constraint-driven entity search (BrowseComp-Plus), cross-document report synthesis (TREC-Biogen), tabular numerical reasoning over financial reports (FinanceBench), exhaustive entity retrieval (QAMPARI), procedural reasoning over technical docs (FreshStack), and fact aggregation over internal company meeting notes (PMBench). The agent is restricted to a single tool, vector search, isolating retrieval and reasoning quality from broader tool orchestration effects.

KARL Training Pipeline

Stage I

Agentic Synthesis

Agent explores corpus via vector search → generates Q&A pairs → deduplication filter removes test leaks

Stage II

Solution + Filtering

Multiple solvers attempt each Q → pass-rate filter (not all-correct or all-wrong) → quality filter removes ambiguous Qs

Stage III

Multi-Task RL (OAPL)

Off-policy RL on filtered rollouts → iterative bootstrapping from improved model → multi-task loss across tasks

Base: GLM 4.5 Air  →  Synthetic Data  →  OAPL Training  →  KARL Agent

Source: Chang et al., "KARL: Knowledge Agents via Reinforcement Learning," arXiv:2603.05218 (2026)

2. Agentic data synthesis (no humans needed). KARL generates its own training data through a two-stage pipeline. In Stage I, a synthesis agent explores a document corpus via vector search and proposes grounded question-answer pairs. A deduplication agent then filters out any overlap with the evaluation set. In Stage II, multiple solver agents independently attempt each question. Questions where every solver succeeds (too easy) or every solver fails (too hard or broken) are discarded. Only questions in the "learning sweet spot," where some solvers succeed and others fail, survive to become training data.

3. OAPL: off-policy RL that actually scales. Standard RL for language models (like GRPO) assumes the model generating training data and the model being updated stay in sync. In distributed training, they never do. Previous fixes (clipped importance weighting, data deletion) introduced instability. Databricks developed OAPL (Optimal Advantage-based Policy Optimization with Lagged Inference policy), which embraces off-policyness by design. Think of it as a regression objective that fits the model toward the optimal policy, rather than fighting the lag between data generation and training. OAPL remains stable even when the model generating rollouts is more than 400 gradient steps behind the model being trained. That is roughly 100x more off-policy than previous approaches tolerated. In code generation experiments, OAPL matched GRPO using approximately 3x fewer training samples.

Two Findings That Stood Out

Multi-task RL generalizes; single-task RL does not. KARL-TREC (trained only on TREC-Biogen) scored 85.0 on its target task but failed to transfer to BrowseComp-Plus. KARL-BCP (trained only on BrowseComp-Plus) reached 59.6 on its task but similarly failed on TREC-Biogen. Training on both tasks simultaneously, KARL matched or improved performance on each while also generalizing to four held-out tasks it had never seen during training.

The agent learned to compress its own context, end-to-end. Some KARLBench tasks require over 200 sequential vector database queries, exhausting the context window many times. Rather than training a separate summarization model, the team included compression as part of the RL training loop. The agent learned what to keep and what to discard, guided only by the final task outcome. Removing this learned compression dropped accuracy on BrowseComp-Plus from 57% to 39%.

Performance Snapshot (Notable KARLBench Tasks)

BrowseComp-Plus (constraint-driven entity search)

KARL+VGS
70.4
KARL
57.0
Base (GLM)
39.0

TREC-Biogen (cross-document report synthesis)

KARL-TREC
85.0
KARL
~80

Source: Chang et al., arXiv:2603.05218, Tables 4 & related (2026). VGS = Value-Guided Search.

Key Insights

Off-policy RL reduces infrastructure complexity. OAPL's regression-based objective stays stable at 400+ gradient steps of policy lag. This eliminates the need for clipped importance weighting, data deletion, or router replay that previous online RL methods required for large mixture-of-experts models. If you are running distributed RL training, off-policy-by-design may be simpler than trying to keep everything on-policy.

Training data diversity matters more than volume. The pass-rate filtering strategy (discard all-correct and all-wrong questions, keep only the "learning sweet spot") ensures the RL signal is neither trivial nor impossible. Combined with quality filtering for ambiguity and factual errors, this creates a small, high-signal training set from entirely synthetic data.

Let the agent learn its own memory management. Instead of building a separate compression model, KARL trains context compression end-to-end via the task outcome signal. The agent learns what information to preserve and what to discard during long multi-step searches. This single design choice accounts for an 18-point accuracy difference on BrowseComp-Plus (57% vs 39% without compression).

The key takeaway here is not about KARL itself, but about the recipe. Databricks is now making these same RL pipelines to customers for building custom agents on their own enterprise tasks. If a few thousand GPU hours and zero human labels can produce a model that is Pareto-optimal against frontier systems on six search tasks, the barrier to building domain-specific agents just dropped considerably.

ResearchAudio.io

Source: Chang et al., "KARL: Knowledge Agents via Reinforcement Learning" (arXiv:2603.05218)

Keep Reading