In partnership with

Winning, on-brand ads—without endless prompting

Most AI creative tools fall short for one simple reason. You can generate tons of ads, but they aren’t up to par.

Refining copy, adjusting layouts, or nudging a CTA into place shouldn’t require rewriting prompts over and over. It slows teams down and breaks the creative process.

With Hightouch Ad Studio, AI gets you 90% of the way there. For the final 10%, use a built-in editor to quickly refine copy and design, or export directly to Figma for seamless collaboration with your design team.

Move faster without losing control. Every ad, exactly how you want it.

Grep Beats Vector in 10 of 10 Agent Setups

ResearchAudio.io

Grep Beats Vector in 10 of 10 Agent Setups

Same Claude Opus, same data: switching harness moves accuracy 16 points.

PwC researchers ran Claude Opus 4.6 over the same conversation corpus with the same retrieval tool. Switching the agent harness from a custom build to Claude Code dropped accuracy from 93.1% to 76.7%. The retrieval choice mattered. The harness mattered more.
Retrieval-augmented generation has settled into a default. Index documents in a vector store, embed the query, return the nearest neighbors. Most production RAG systems run this pipeline without revisiting the assumption.
A team from PwC tested whether the default holds when the retriever sits inside an agent harness instead of a fixed pipeline. They evaluated lexical search (grep) against semantic vector search across four harnesses (a custom build called Chronos, plus Claude Code, Codex, and Gemini CLI) and five frontier models. They ran two tool-delivery modes: inline (results injected into the chat context) and programmatic (results written to a file the agent must read). 116 questions from LongMemEval, graded by GPT-4o.
The headline finding broke the assumption.

Three Findings

Inline grep vs vector. Harness effect. Programmatic flip.

10 / 10
Inline harness-model pairs where grep beat vector
Widest gap
Chronos + Gemini Flash-Lite
grep 86.2%  vs  vector 62.9%
Same Claude Opus 4.6, same corpus, inline grep
Chronos  93.1%
 
Claude Code  76.7%
16.4-point shift from harness alone.
Codex + GPT-5.4 grep, inline → programmatic
93.1%
Inline
55.2%
Programmatic
37.9-point regression from changing tool delivery.

Source: Sen et al., arXiv:2605.15184 (May 2026)

Inline grep wins 10 of 10

Across all ten harness-model pairs tested, inline grep beat inline vector search. The narrowest gap was Claude Opus 4.6 on Claude Code: 76.7% versus 75.0%. The widest was Gemini 3.1 Flash-Lite on Chronos: 86.2% versus 62.9%, a 23.3-point spread.
The pattern held across model families. Anthropic, OpenAI, and Google backbones all scored higher with regex matching than embedding-based retrieval. The authors attribute this to LongMemEval's structure: many answers depend on recovering literal spans (dates, counts, preferences) that survive tokenization but get blurred by embedding compression.

The harness is the missing variable

Holding the model and retriever fixed and changing the harness alone produced accuracy shifts comparable to swapping retrievers. Claude Opus 4.6 with inline grep scored 93.1% on Chronos but 76.7% on Claude Code. Same model, same corpus, same retrieval mechanism. The 16.4-point delta came entirely from how the harness constructs prompts, surfaces tool results, and decides when to stop searching.
Chronos applies category-conditioned dynamic prompting tuned to LongMemEval's question types. Claude Code inherits Anthropic's general-purpose CLI ergonomics. Neither is wrong, but they produce different agents over identical data.
Insight 1. Retrieval benchmarks that compare BM25 against ANN in a static pipeline underestimate variance. The agent scaffolding contributes shifts on par with the retriever itself. If you are reporting RAG numbers without naming the harness, you are reporting an incomplete result.

Programmatic delivery inverts the comparison

When tool results are written to a file the agent must explicitly read, the picture changes. Programmatic vector exceeded programmatic grep on five of ten pairs. The sharpest collapse was Codex with GPT-5.4: from 93.1% inline grep to 55.2% programmatic grep, a 37.9-point regression.
The mechanism is the read-integrate-retry cycle. Inline results land in front of the model immediately. Programmatic results require the agent to issue another tool call to access them, which raises the bar on compositional tool use. Fast retrieval at the index layer is not easy end-to-end if the harness turns each hit into a multi-step workflow.
Insight 2. Programmatic tool delivery looks like pure upside (no context pressure, arbitrarily large result sets) until you realize it taxes the model's ability to close a multi-step loop. If your agent struggles with read-integrate-retry, programmatic mode surfaces that weakness, not retrieval quality.

Vendor harnesses carry stable biases

Experiment 2 swept the per-question session limit from 5 to roughly 60, exposing each retriever to progressively more irrelevant conversation. Both methods stayed remarkably stable. At full haystack, mean grep accuracy was 83.6% versus 78.4% for vector across the rows reported.
The more interesting pattern was vendor-stable. Claude Code favored grep for Opus and Haiku at every configuration tested. Gemini CLI Pro favored vector throughout, with the grep-vector gap widening to 89.7% versus 78.5% at full haystack. These preferences held across noise levels, suggesting that default prompts, transcript chunking, and tool error surfaces introduce vendor-specific biases that persist across tasks.
Insight 3. Migration between CLI agents is not retrieval-interchangeable, even when the on-disk corpus is byte-identical. Switching from Claude Code to Gemini CLI changes which retriever you should pick, because the harness shapes the agent's search policy in ways that survive any prompt rewrite.

What this does not show

The study uses 116 questions from LongMemEval, a long-memory conversational QA benchmark where answers often live in verbatim spans. The authors caution in Section 5 that domains with paraphrased evidence (scientific synthesis, code semantics, visual documents) may favor dense or hybrid retrieval differently. Codex programmatic-vector intermediates are also pending.
The contribution is not that grep beats vector in general. It is that retrieval choice, harness orchestration, and tool delivery path are a single jointly-evaluated system, not three independent design choices.
Lexical and semantic retrieval optimize different failure modes. Grep rewards precision when answers live in literal spans. Vector rewards coverage when evidence is paraphrased. The paper's contribution is showing that this tradeoff cannot be evaluated in isolation from the agent harness orchestrating the search. If you are picking a retrieval strategy for your next agent, pick the harness first.

Source

Sen, S., Kasturi, A., Lumer, E., Gulati, A., Subbiah, V. K. (2026). Is Grep All You Need? How Agent Harnesses Reshape Agentic Search. arXiv:2605.15184

ResearchAudio.io. Papers worth your morning.

Keep Reading