In partnership with

Big Pharma's $240B White Flag Is One Startup's Ticket

Big Pharma spent decades and billions trying to solve osteoarthritis, a $500B market they’ve never cracked.

Thankfully, Cytonics figured out why they keep failing: joints are attacked by multiple culprits at once, and Big Pharma only ever went after one at a time.

So Cytonics discovered a way to get them all, creating the first therapy with the potential to actually address the root cause of osteoarthritis at the molecular level. It’s already proven across 10,000+ patients. Now, they’re pushing toward FDA approval on a 200% more potent version that can be manufactured at scale.

The first human safety trial is already complete with zero adverse events. If approved, the more than 500M osteoarthritis patients worldwide could have their long-needed solution.

Big Pharma created this opening. Now Cytonics is prepared to seize it.

1M Lines. Zero Typed by Humans. The Harness Did the Work.

ResearchAudio.io

1M Lines. Zero Typed by Humans. The Harness Did the Work.

OpenAI and Anthropic converged on the same architecture for agent-built software.

1M+
Lines of Code
1,500
Automated PRs
3
Engineers

The Lead

Here's the part nobody's talking about: two competing AI labs, working independently, arrived at the same conclusion about how to build software with agents. And the conclusion wasn't about model intelligence.
In February 2026, OpenAI published their "Harness Engineering" report. Three engineers used Codex agents to ship an internal product containing over one million lines of code. Zero lines were typed by hand. The entire codebase was generated, tested, and deployed by AI agents across roughly 1,500 pull requests over five months. That's about 3.5 PRs per engineer per day, at roughly one-tenth the time manual coding would have taken.
Then in March, Anthropic published their harness design post. Prithvi Rajasekaran from the Labs team described building a three-agent system (planner, generator, evaluator) that autonomously produced full-stack applications over multi-hour sessions. A solo agent spent 20 minutes and $9 on a retro game maker and produced something that looked impressive but was fundamentally broken. The harness version spent 6 hours and $200 on the same prompt, and produced a working application.
Both teams discovered the same thing: the model is not the bottleneck. The harness is. Change the model and output quality shifts by 10-15%. Change the harness, and you change whether the system works at all.
Component OpenAI (Codex) Anthropic (Claude)
Context Strategy 100-line map + docs/ Context resets + handoffs
Quality Control Linters + CI tests Evaluator agent + Playwright
Architecture Single agent + guardrails Planner + Generator + Evaluator
Core Insight Map, don't encyclopedia Separate generation from judgment
Source: OpenAI (Feb 2026) + Anthropic (Mar 2026)

OpenAI: Give Agents a Map, Not an Encyclopedia

OpenAI's first instinct was to give agents everything: a massive agents.md packed with coding standards, architecture decisions, and project context. It backfired. Too much context crowded out the actual task.
The fix: a roughly 100-line agents.md that works as a table of contents, pointing to a structured docs/ directory. Progressive disclosure. The same pattern that makes good UIs work: show what's needed at the current level, link to depth when it's relevant.
Here's a concrete example of what that docs/ directory looked like. Design documentation was catalogued and indexed, including verification status. Architecture docs provided a top-level map of domains and package layering. A quality document graded each product domain and architectural layer, tracking gaps over time. Cross-linking between these docs was mechanically enforced by linters and CI validation.
But documentation alone wasn't enough. OpenAI enforced architectural boundaries mechanically. Custom linters had remediation instructions baked directly into error messages. When an agent violated a dependency boundary, the error told it what the boundary was, why it existed, and how to fix it. Think of it like guardrails with a built-in driving instructor. Dependencies flowed in a strict sequence: Types, then Config, then Repo, then Service, then Runtime, then UI. Structural tests validated compliance.
The result: engineers shifted from writing code to designing environments and feedback loops. The error message became the teaching moment. The linter became the code reviewer. And human time became the sole scarce resource.
Context
agents.md
100-line map
Enforce
Linters + CI
Error = lesson
Generate
Codex Agent
Scoped PRs
Ship
Production
3.5 PRs/day
Source: OpenAI, "Harness Engineering" (Feb 2026)

Anthropic: Separate the Builder from the Judge

Anthropic approached the problem differently, but landed in the same place. Their insight: agents are terrible at evaluating their own work. When asked to self-evaluate, models confidently praise mediocre output. This surprised me less than it probably should have.
The solution was inspired by GANs: separate generation from judgment entirely. A planner expands a 1-4 sentence prompt into a full product spec. A generator builds the application sprint by sprint. An evaluator uses Playwright to actually navigate the running app, clicking through features like a real user before filing specific bug reports.
The creative leaps were unexpected. In one frontend design experiment, the team prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page. Polished, but predictable. Then on the tenth cycle, it scrapped everything and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in scattered positions, and doorway-based navigation between gallery rooms. That kind of creative leap doesn't come from a single pass.
The evaluator's findings were precise. One example from the blog: the rectangle fill tool placed tiles at drag start/end points instead of filling the region. Another: a route for rearranging animation frames was defined after a route that parsed the same path segment as an integer, returning a 422 error. A third: a delete key handler required both a selection and an active entity to be set, but clicking an entity set one without the other. These are the kinds of bugs that a single-pass agent would never catch in its own work.
Solo Agent
20 min / $9
Looks polished, core feature broken
Harness
3-Agent Harness
6 hr / $200
16-feature spec, working game
Source: Anthropic, retro game maker experiment (Mar 2026)
Anthropic also discovered that harness complexity should track model capability. Their earlier harness used sprint decomposition and context resets because Sonnet 4.5 exhibited "context anxiety," prematurely wrapping up work as the context window filled. Opus 4.6 removed that behavior, so they dropped the sprint structure entirely. The updated harness ran a digital audio workstation build in about 4 hours for $124, with the generator working coherently for over 2 hours straight.
The prompt was one sentence: "Build a fully featured digital audio workstation in the browser using the Web Audio API." What came out had a working arrangement view, mixer, and transport. An integrated Claude agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. Here's where the $124 went:
Phase Time Cost
Planner 4.7 min $0.46
Build (round 1) 2 hr 7 min $71.08
Eval (round 1) 8.8 min $3.24
Build + Eval (rounds 2-3) 1 hr 30 min $49.92
Source: Anthropic, audio workstation experiment (Mar 2026)
Most of the cost went to the first build round. But the evaluator earned its keep. In round 1, it flagged that audio recording was stub-based (the button toggled but captured nothing), clips couldn't be dragged on the timeline, and effect visualizations were numeric sliders instead of graphical curves. The generator fixed these in subsequent rounds.

The key insight: Every component in a harness encodes an assumption about what the model can't do on its own. Those assumptions are worth stress testing, because they go stale as models improve. The interesting harness design space doesn't shrink with better models. It moves.

The Convergence: 3 Shared Principles

I expected different labs to produce different conclusions. They didn't. Three principles emerged from both approaches.
1. Structured documentation is infrastructure, not nice-to-have. OpenAI built a docs/ directory as the single source of truth, mechanically enforced with linters and CI. Anthropic used structured handoff artifacts to carry context between agent sessions. Both discovered that agents need precise, scoped knowledge, not a dump of everything.
2. Mechanical enforcement beats verbal instruction. OpenAI's custom linters with remediation in error messages. Anthropic's evaluator agent using Playwright to test the actual running application. Neither trusted the model to follow instructions. Both built systems that caught failures automatically.
3. The harness should be rippable. Both labs explicitly recommend starting simple, adding complexity when needed. As Anthropic put it: find the simplest solution possible. As OpenAI demonstrated: give agents a map, not an encyclopedia. Over-engineering the harness is as dangerous as under-engineering it.
The engineer's job shifted from writing code to designing environments where agents write code. The instruction file became the primary artifact that compounds.

Quick Hits

Context anxiety is real. Anthropic found that Sonnet 4.5 started wrapping up work prematurely as the context window filled. Context resets (not just compaction) were the fix. Opus 4.6 largely eliminated this behavior, allowing 2+ hour coherent coding sessions.
Sprint contracts prevent scope drift. Before each sprint, Anthropic's generator and evaluator negotiate what "done" looks like. The generator proposes what it will build and how completion will be verified. The evaluator reviews the proposal. They iterate until they agree. One sprint had 27 testable criteria for the level editor alone. This front-loaded negotiation kept the work faithful to the spec without over-specifying implementation.
Martin Fowler endorsed the framing. His commentary on the MartinFowler.com blog noted that harness engineering fills a gap in the industry's vocabulary for controlling AI agents. He broke the discipline into three components: context engineering, architectural constraints, and entropy management.
Evaluator tuning is hard. Anthropic's evaluator was too lenient out of the box. It would identify legitimate issues, then talk itself into approving the work anyway. It also tested superficially, so subtle bugs in nested features slipped through. It took several rounds of calibrating the evaluator's prompt against human judgment before it graded reasonably.
Planners add ambition that solo agents lack. Given a one-sentence prompt for a retro game maker, Anthropic's planner expanded it into a 16-feature spec across ten sprints, including a sprite animation system, behavior templates, sound effects, an AI-assisted sprite generator, and game export with shareable links. The solo agent, given the same prompt, built something far narrower.

What I'd Actually Use

If I were setting up a harness today, I'd start with three things from these posts.
First, restructure your agents.md. Keep it under 100 lines. Make it a table of contents pointing to a docs/ directory, not a knowledge dump. Index design docs, architecture maps, and quality grades as separate files. This is the single highest-leverage change you can make this week.
Second, put remediation in your lint errors. When a custom linter flags a violation, don't just say "rule X violated." Say what the rule is, why it exists, and how to fix it. The error message becomes the teaching moment. Your agents get smarter with every constraint you enforce this way.
Third, separate your evaluator. Even a simple script that runs the app with Playwright and clicks through 5 key flows will catch the class of bugs that generators confidently ship. Start with smoke tests. Expand the criteria over time.

The Take

Harness engineering is to AI agents what DevOps was to deployment. It's the recognition that the infrastructure around the thing matters as much as the thing itself. And just like DevOps, it's going to require a new skillset: specification writing, environment design, quality system design, agent supervision, and workflow orchestration.
If you're still writing all your own code, you're fine for now. But if you're managing agents and wondering why they keep going off the rails, the answer is almost certainly in the harness.

The Open Question

Martin Fowler raised a gap that neither lab addressed: verification of behavior. Both harnesses constrain how code is written and organized. Neither validates that the code does what users actually need. The harness catches structural bugs, but does it catch product bugs? And if not, what does that harness component look like? Reply and tell me how you'd build it.
The competitive advantage in 2026 won't go to teams with the largest model. It will go to teams with the most effective harness.
Next issue: Why Karpathy's autoresearch loop might be the first harness that improves itself, and what that means for every engineering team running overnight experiments.

ResearchAudio.io

Source: OpenAI, "Harness Engineering" (Feb 2026)

Source: Anthropic, "Harness Design for Long-Running Apps" (Mar 2026)

Keep Reading