Harness for the Claude Code

In partnership with

Why does every QBR sound like it took an hour to prep?

The strategic-account QBR has a different feeling. The CSM walks in knowing the buying committee, usage trends, support history, news on the company. They've blocked an hour to prep. The customer feels seen.

The other 190 QBRs don't get that hour. The CSM scans the dashboard five minutes before the call. They wing it. The customer answers the same baseline questions for the third time this year.

What if every QBR was a strategic-account QBR? Two minutes before the call, your CSM has the full brief in Slack: usage trends, support history, NPS, news on the company, what their champion just posted on LinkedIn.

Every customer feels like your top customer. Even when there are 200 of them.

3,000+ tools connected. SOC 2 certified. Your data never trains models.

"It was almost instantly adopted by the bulk of my team." Boris Wexler, CEO, Space Dinosaurs

Get Started for Free

The 3 Ways Your Coding Agent Quietly Fails

ResearchAudio.io

The 3 Ways Your Coding Agent Quietly Fails

Anthropic named all three failures last week, then shipped a structural fix.

failures, now named

patterns that beat them

JavaScript file does it

Hand your coding agent a 50 item security review and watch what it does.

It gets through 35. Then it tells you the job is done.

Anthropic has a name for that now. It has names for two more failures exactly like it. And last week it shipped the structural fix, as dynamic workflows in Claude Code, where Claude writes its own harness on the fly instead of running the one fixed loop a human gave it.

Start with the three failures. The fix makes sense once you have felt them.

Three ways it fails

All three show up the moment a task gets long, parallel, or adversarial. Not one of them is a prompt problem.

Failure 01 Agentic laziness Stops early and calls a multi part job done. The post's example: 35 of 50 security items handled, then finished. The fix one subagent, one small goal, nothing to cut short.	Failure 02 Self-preferential bias Prefers its own output, especially when asked to grade it against a rubric. The fix a separate agent grades it, with no stake in the work.	Failure 03 Goal drift Loses the objective across turns. Compaction is lossy, so edge cases and do not constraints fall out. The fix the objective lives in the JS loop, not the window.
Source: Anthropic, dynamic workflows in Claude Code (2026)

The one thing they share

Line the three fixes up and the shape is obvious. Smaller goals. A separate grader. An objective kept outside the context window. Every one is doing the same job: prying planning and execution apart.

Agentic laziness, self-preferential bias, and goal drift are not flaws in your prompt. They are what happens when one context window has to plan and execute at the same time.

Which raises the obvious question, and the answer is smaller than you would guess.

What a workflow actually is

A dynamic workflow is a JavaScript file. It runs a few special functions that spawn and coordinate subagents, plus the standard JavaScript building blocks (parsing, math, arrays) for moving data between steps. Picture the flow first.

one window
plans the run

›

subagent · chunk A

subagent · chunk B

subagent · chunk C

barrier waits for all

›

one result
merged once

Fan out into parallel subagents, then a barrier waits for every branch before one agent merges the pieces.

Here is the same shape in code, stripped to the essentials.

The shape, not the exact API

// a workflow is plain JS that spawns
// and coordinates subagents

const checks = await Promise.all(
  items.map(it => subagent({
    model: "sonnet",   // intel per step
    worktree: true,    // isolate each run
    goal: `review ${it}`
  }))
);

// barrier: wait for every branch,
// then one agent merges the pieces
const out = await subagent({
  model: "opus",
  goal: combine(checks)
});

Read it back against the three failures. Each subagent has one small goal, so there is nothing to cut short. A different agent grades the work, so nothing is marking its own paper. And the goal sits in the loop, in code, not in a summary that thins out every time the context is compacted.

The workflow can also decide which model each subagent runs, and whether each runs in its own worktree, so Claude picks the intelligence level and the isolation per step. If you quit the terminal mid run, resuming the session picks the workflow back up where it stopped.

The key insight: the deterministic JavaScript loop is the part that does not drift. Your objective is encoded in code instead of living in a summary that degrades every time the context is compacted, which is why the structure holds even when any single agent's window fills up.

Why it had to be now

You could already wire Claude Code instances together by hand, with claude dash p or similar tooling. The catch is that a hand built workflow has to anticipate every edge case, so it ends up generic. A dynamic one is written for the single task in front of you, which works because Claude Opus 4.8 is sharp enough to write a working harness on the spot.

Static harness

Written once, by hand

Has to anticipate every edge case, so it stays generic. The Agent platform, or claude dash p. Reusable and predictable, but blunt.

Dynamic workflow

Written per task, by Claude

Tailored to the one job in front of you. Needs a model sharp enough to author it. Specific and precise, then thrown away.

Six moves it makes

Claude mixes a handful of these when it builds a workflow. Knowing the names lets you ask for the right one.

Classify and act A small classifier reads each task and routes it to the right agent.	Fan out and synthesize Split into pieces, run them in parallel, then merge at a barrier into one result.
Adversarial verification A fresh agent checks the first one's output, with no stake in defending it.	Generate and filter Many ideas in, dedupe and test, the strongest ones come out.
Tournament Pairwise duels, a judge advances the stronger of two each round.	Loop until done Keep spawning agents until a stop condition is met, not a fixed count.
Source: Anthropic, dynamic workflows in Claude Code (2026)

Two of these hide the real lessons. Fan out works because synthesize is a hard barrier: every branch finishes before the merge. And tournament beats scoring because comparative judgment, this or that, is steadier than asking for an absolute number.

tournament, not scoring

option A

option B

option C

option D

›

A or B

C or D

›

one advances

Pairwise duels instead of one absolute score. The stronger of each pair advances until a single answer is left.

One of these patterns rebuilt an entire programming language. That story is two scrolls down.

Where teams are pointing it

The post walks through where this already earns its tokens. A sample of the shapes showing up in real work:

Root-cause hunts Spin disjoint evidence hypotheses, then verifiers and refuters argue each one down.	Rule adherence One verifier per rule, plus a skeptic that assumes a violation until proven otherwise.
Model routing A classifier sends easy work to Sonnet and hard work to Opus, item by item.	Evals at scale Run a grader fleet over thousands of cases instead of one judge in one window.
Exploration and taste Generate many directions, then a filter keeps the few worth your time.	Triage in bulk Classify, dedupe, then act, with a quarantine boundary around untrusted input.
Source: Anthropic, dynamic workflows in Claude Code (2026)

What it costs

This burns tokens. A workflow spends far more than a single pass, which is exactly why you cap it with a budget like use 10k tokens and route the easy steps to a smaller model. The math pays on work where being right matters more than being fast, and nowhere else.

incoming task

›

classifier

›

Sonnet · routine work

Opus · the hard cases

A classifier reads each item and sends routine work to a smaller model, the hard cases to a bigger one.

Try this in the next hour

The lowest risk way to feel the difference today: take a doc or a blog post you are about to ship, open Claude Code, and run, verify every technical claim in this against the codebase using a workflow. That is the deep verification pattern in miniature. One agent finds the claims, others check each one, and you never touch your main branch.

deep verification, one pass

1	Pull every claim out of the draft.
2	Hand each claim to its own verifier.
3	Each one checks it against the actual codebase.
4	Collect the claims that did not hold, and report them.

Want to force a workflow on something bigger? The trigger word is ultracode. For something small, ask for a quick workflow, like a single adversarial check on one assumption.

The controls

Six ways to steer a workflow, all from the prompt or the menu:

ultracode	Forces Claude to build a workflow for the task.
quick workflow	Asks for a small one, like a single adversarial check.
use 10k tokens	Caps the spend before it runs.
/loop	Runs a stored workflow on a schedule.
/goal	Sets a hard completion bar it cannot skip.
press s	Keeps the workflow on disk to reuse.

Store them in your dot claude workflows folder, or ship one inside a skill by pointing to the JavaScript files and telling Claude to treat them as templates, not scripts to run verbatim.

Quick hits

Bun's rewrite. There is the language rebuild from a moment ago. Bun was moved from Zig to Rust using workflows. The recipe: break the migration into units (callsites, failing tests, modules), spin a subagent per fix in its own worktree, have a second agent adversarially review, then merge. One tip from the team is to tell the agents to avoid resource heavy commands so you can run more of them in parallel without starving your machine.

Deep research is built on this. The deep research skill inside Claude Code uses dynamic workflows. It fans out web searches, fetches sources, adversarially verifies their claims, and synthesizes a cited report. The same shape works on a Slack history or a codebase, far beyond the open web.

Sorting at scale. Sorting a thousand support tickets by severity in one prompt degrades fast and will not fit in context. Run pairwise comparison agents instead, because comparative judgment is more reliable than absolute scoring. The deterministic loop holds the bracket, and the running sequence is all that stays in context.

The workflow files I actually run for each of these are going into the paid archive this month.

The take

The feature will grab the headlines, but the more useful thing in this post is the vocabulary. Agentic laziness, self-preferential bias, and goal drift are things every agent builder has hit and struggled to name. Having shared words for why long runs degrade, plus a structural fix instead of more prompt whispering, is worth more than any single workflow.

The honest part of the post is the section on when not to use this. Most coding tasks do not need a panel of five reviewers. The failure mode I expect is people reaching for a six agent tournament to rename a variable and burning tokens to prove a point. The skill is knowing when one clean context window is still the right tool.

One thing I am still watching: adversarial verification still uses Claude to check Claude. Separate context windows help because the verifier has not fallen for the work yet, but it is the same underlying model. Whether that kills self-preferential bias rather than merely denting it is an open empirical question.

The open question

So where is the real line between a task that needs a workflow and one that does not? The post hands you a heuristic, does it really need more compute, but that is a judgment you calibrate with reps. Tell me the smallest task where a workflow actually paid for itself for you, and what the tell was. I want to map where that boundary really sits, because the docs will not draw it for you.

Worth screenshotting

The harness stopped being the thing you build and became the thing you describe.

The hard part of agent engineering moves from writing orchestration code to specifying the task and its checks well enough that the model can build the orchestration for you.

Coming up: the quarantine pattern, in detail. How a triage workflow can let agents read untrusted public content without letting those same agents take destructive actions, and why that one boundary is what makes leaving an agent running overnight defensible.

ResearchAudio.io

Source: A harness for every task: dynamic workflows in Claude Code, by Thariq Shihipar and Sid Bidasaria, Anthropic.

Harness for the Claude Code

Why does every QBR sound like it took an hour to prep?

Keep Reading

Quick Links

Stay Updated