In partnership with

Dictate code. Ship faster.

Wispr Flow understands code syntax, technical terms, and developer jargon. Say async/await, useEffect, or try/catch and get exactly what you said. No hallucinated syntax. No broken logic.

Flow works system-wide in Cursor, VS Code, Windsurf, and every IDE. Dictate code comments, write documentation, create PRs, and give coding agents detailed context- all by talking instead of typing.

89% of messages sent with zero edits. 4x faster than typing. Millions of developers use Flow worldwide, including teams at OpenAI, Vercel, and Clay.

Available on Mac, Windows, iPhone, and now Android - free and unlimited on Android during launch.

Download Wispr Flow free →

ResearchAudio.io

20% Higher Cost. 3% Fewer Fixes. The AGENTS.md Problem.

ETH Zurich tested 4 models across 438 tasks. Context files hurt more than they help.

Everyone building with coding agents has heard the advice: add an AGENTS.md file to your repo. It tells the agent how your codebase works, what tools to use, and how to run tests. Over 60,000 public GitHub repositories now include one. But a new study from ETH Zurich reveals something uncomfortable: those context files are making agents worse at solving real tasks, while costing you more to run.

The Problem Nobody Measured

Context files like AGENTS.md and CLAUDE.md have been promoted by every major agent developer: Anthropic, OpenAI, and Alibaba's Qwen team all provide built-in commands to auto-generate them. The idea is simple: give the agent a map of your repository, your preferred tools, your coding style, and it should perform better.

The problem is that until now, nobody had rigorously tested whether this actually works. Thibaud Gloaguen, Niels Mundler, Mark Muller, Veselin Raychev, and Martin Vechev at ETH Zurich decided to find out. They built a new benchmark called AGENTbench (138 instances across 12 repos with developer-written context files) and combined it with SWE-bench Lite (300 tasks from 11 popular repos). Then they ran four models through three settings: no context file, LLM-generated context file, and human-written context file.

-3%

LLM-generated
success rate drop

+4%

Human-written
success rate gain

+20%

Inference cost
increase (both)

What They Tested

The team evaluated four agent-model pairs: Claude Code with Sonnet-4.5, Codex with GPT-5.2, Codex with GPT-5.1 mini, and Qwen Code with Qwen3-30B. Each agent was tested in three settings: with no context file, with an LLM-generated context file (created using each agent's recommended initialization), and (on AGENTbench) with the developer-written context file already in the repository.

[Insert AGENTS.md Evaluation Pipeline diagram]

Why Context Files Make Things Worse

The results paint a clear picture. LLM-generated context files reduced success rates in 5 out of 8 settings. On AGENTbench, the average drop was 2%. On SWE-bench Lite, it was 0.5%. Meanwhile, every single setting showed increased step counts and inference costs, averaging a 20-23% cost increase across both benchmarks.

The core issue is obedience. Agents follow context file instructions faithfully, even when those instructions are counterproductive. When a context file mentions uv, agents use it 1.6 times per instance on average, compared to fewer than 0.01 times when it is not mentioned. That is roughly a 160x increase. Repository-specific tools show a similar pattern: 2.5 uses per instance when mentioned versus fewer than 0.05 otherwise.

This obedience leads agents to explore more broadly. They run more tests, search more files, read more files, and write more files. All of this adds steps and cost without improving outcomes. The agents also do not find relevant files any faster with context files present. The codebase overviews that 100% of Sonnet-4.5-generated context files and 99% of GPT-5.2-generated files include are essentially redundant with existing documentation.

Key Insight: LLM-generated context files are highly redundant with existing documentation. When the researchers removed all other docs (README, /docs, examples), LLM-generated files actually improved performance by 2.7% on average. The files are not useless, they are just duplicative.

Human vs. LLM: A Meaningful Gap

Developer-written context files outperformed LLM-generated ones for all four agents, despite not being tailored to any specific agent. Human files improved success rates by about 4% on average (compared to no context file), while LLM files decreased it by about 3%. The difference comes down to content: human-written files tend to include non-obvious information that agents cannot discover from the codebase itself, like specific tooling requirements or unusual build processes.

One notable finding: using stronger models to generate context files did not produce better results. GPT-5.2 and Sonnet-4.5 generated context files that performed no better than those from GPT-5.1 mini or Qwen3-30B. The researchers attribute this to stronger models already having enough parametric knowledge about common libraries, making the extra context redundant noise.

Key Insight: Stronger models do not generate better context files. The bottleneck is not generation quality, it is that auto-generated content mostly restates what the agent already knows from the codebase. The signal-to-noise ratio is too low.

What Actually Works in Context Files

The paper's behavioral analysis reveals what makes the difference. Context files that help are the ones containing minimal, non-obvious requirements: which specific tools to use (uv instead of pip, bun instead of npm), how to run tests, and unusual build steps. This is information the agent cannot easily discover by reading the codebase.

What hurts are comprehensive codebase overviews, directory trees, style guides, and detailed documentation. These are things the agent can find on its own, and including them just adds tokens that the agent wastes steps processing. The Claude Code prompt was already closer to the right approach: it warns against listing components that are easily discoverable. The data confirms this intuition.

Practical Takeaway: If you use AGENTS.md, keep it short. Include what the agent cannot discover on its own: specific build tools, test commands, and non-standard workflows. Skip codebase overviews, directory trees, and style guides. Use deterministic linters and formatters instead.

The most interesting tension in this paper is that agents are too obedient for their own good. They follow instructions in context files faithfully, even when those instructions lead them down longer, costlier paths. The implication for the field is clear: the default advice to "add an AGENTS.md" needs a serious revision. Less context, carefully chosen, beats more context every time.

ResearchAudio.io

Source: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? by Gloaguen, Mundler, Muller, Raychev, Vechev (ETH Zurich, 2026)

Your AGENTS.md Is Secretly Sabotaging Your Coding Agents