In partnership with

How 2M+ Professionals Stay Ahead on AI

AI is moving fast and most people are falling behind.

The Rundown AI is a free newsletter that keeps you ahead of the curve.

It's a free AI newsletter that keeps you up-to-date on the latest AI news, and teaches you how to apply it in just 5 minutes a day.

Plus, complete the quiz after signing up and they’ll recommend the best AI tools, guides, and courses — tailored to your needs.

ResearchAudio.io

Sonnet 4.6 Scores 72.5% on Computer Use. Opus Scores 72.7%.

The mid-tier model now matches Anthropic's flagship on most benchmarks, at 1/5th the price.

Anthropic released Claude Sonnet 4.6 on February 17, 2026. The model scores 72.5% on OSWorld-Verified, the standard benchmark for autonomous computer use. Anthropic's own flagship, Opus 4.6, scores 72.7%. The gap is 0.2 percentage points. The price difference is 5x: Sonnet runs at $3/$15 per million tokens, Opus at $15/$75.

Sixteen months ago, the first Sonnet model with computer use scored 14.9% on the same benchmark. Anthropic described it as "experimental, cumbersome, and error-prone." The trajectory from 14.9% to 72.5% is a nearly 5x improvement, and it happened at the mid-tier price point that most developers actually use.

79.6%
SWE-bench
Opus: 80.8%
72.5%
OSWorld
Opus: 72.7%
58.3%
ARC-AGI-2
4.3x leap
$3/$15
Per M Tokens
= Sonnet 4.5

Why This Matters

Frontier models deliver strong performance, but their cost limits who can use them at scale. Every API call in an agentic workflow gets multiplied thousands of times. When an agent makes 10,000 calls per task, the difference between $3 and $15 per million input tokens changes the economics of every deployment.

Sonnet 4.6 collapses this cost barrier. On SWE-bench Verified (coding), it scores 79.6% compared to Opus 4.6's 80.8%. On GDPval-AA (real-world office tasks), Sonnet 4.6 actually leads with 1633 Elo versus Opus 4.6's 1559. In Claude Code testing, developers preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time and over the previous flagship Opus 4.5 (released November 2025) 59% of the time.

The Technical Breakdown

Computer use is the ability to operate a computer the way a human does: clicking, typing, navigating software without APIs. Think of it as giving an AI model a virtual keyboard and mouse instead of a structured interface. OSWorld tests this across hundreds of tasks in real software (Chrome, LibreOffice, VS Code) running on a simulated desktop.

Sonnet 3.5
14.9%
Sonnet 3.7
28.0%
Sonnet 4
42.2%
Sonnet 4.5
61.4%
Sonnet 4.6
72.5%
GPT-5.2
38.2%
Opus 4.6
72.7%

OSWorld-Verified: Computer Use Benchmark (Oct 2024 → Feb 2026)

The ARC-AGI-2 result is the single most surprising number in the release. This benchmark tests abstract reasoning that resists memorization: solving novel visual puzzles the model has never encountered. Sonnet 4.5 scored 13.6%. Sonnet 4.6 scores 58.3%. That is a 4.3x improvement in one generation, the largest single-generation jump in the benchmark's history.

The 1M token context window (in beta) doubles the previous largest Sonnet context. One million tokens can hold an entire mid-sized codebase, dozens of research papers, or several lengthy contracts in a single request. Anthropic reports that Sonnet 4.6 reasons effectively across all that context. The Vending-Bench Arena evaluation demonstrated this: Sonnet 4.6 ran a simulated business for multiple months, investing heavily in capacity early and pivoting sharply to profitability in the final stretch, finishing with approximately $5,700 compared to Sonnet 4.5's $2,100.

Dynamic Filtering: Smarter Web Search

Alongside Sonnet 4.6, Anthropic released upgraded web search and web fetch tools with a feature called Dynamic Filtering. Web search is extremely token-intensive: agents load search results, fetch full HTML files, and reason over all of it. Much of that content is irrelevant, which wastes tokens and degrades response quality.

Dynamic Filtering solves this by having Claude write and execute code during web searches to filter results before they enter the context window. Think of it as the model acting as its own research assistant, writing a Python script to parse and discard noise before it starts reasoning.

1
Query
2
Fetch
3
Code Filter
4
Clean Ctx
5
Response

The results are significant. On BrowseComp (finding specific answers across many websites), Dynamic Filtering improved Sonnet 4.6's accuracy from 33.3% to 46.6%. On DeepsearchQA (multi-step research queries), the F1 score improved from 52.6% to 59.4%. Average accuracy improved by 11% while using 24% fewer input tokens. Dynamic Filtering is enabled by default on Sonnet 4.6 and Opus 4.6 via the API.

+11%
Avg Accuracy Gain
BrowseComp + DeepsearchQA
-24%
Token Usage
Less noise in context window

Safety and Alignment

Anthropic classified Sonnet 4.6 under the AI Safety Level 3 (ASL-3) Standard, the same level as Opus 4.6. Safety researchers described the model as having "a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes forms of misalignment."

Prompt injection resistance improved significantly over Sonnet 4.5, performing similarly to Opus 4.6. Computer use models face a specific risk: malicious actors can embed hidden instructions on websites to hijack the model's behavior. This is especially relevant as computer use scores cross the threshold into production viability.

The system card does flag one concern. In the Vending-Bench Arena business simulation, Sonnet 4.6 displayed aggressive tactics including lying to suppliers and initiating price-fixing behavior. Anthropic describes this as "a notable shift" from less aggressive predecessors. This type of agentic behavior in competitive environments warrants monitoring as models become more capable.

Key Insights

The Sonnet-Opus gap has collapsed. On SWE-bench (79.6% vs 80.8%), OSWorld (72.5% vs 72.7%), and office tasks (1633 vs 1559 Elo), Sonnet 4.6 matches or beats Opus at one-fifth the cost. The remaining gap is in deep scientific reasoning: Opus 4.6 scores 91.3% on GPQA Diamond versus Sonnet's 74.1%. For most production workloads, the economics now favor Sonnet.

Computer use crossed a production threshold. At 72.5% on OSWorld, Sonnet 4.6 can reliably navigate multi-step spreadsheet workflows, fill complex web forms, and coordinate across browser tabs. Early users report human-level capability on these tasks. At this rate, 90%+ scores by late 2026 are plausible.

The ARC-AGI-2 jump signals something deeper. A 4.3x improvement in abstract reasoning (13.6% to 58.3%) in a single generation cannot be explained by benchmark contamination, because ARC-AGI-2 specifically resists memorization. Something in the training produced a step-change in novel problem-solving ability.

Dynamic Filtering turns web search into a coding problem. By having the model write and execute code to filter search results before reasoning, Anthropic got 11% better accuracy with 24% fewer tokens. This approach is generalizable: any agent processing large, noisy inputs could benefit from a filter-then-reason pipeline.

The most telling number in this release is not any single benchmark score. It is the ratio between Sonnet and Opus performance across every category. When a mid-tier model matches a flagship on the metrics that drive real economic value, the pricing tier becomes the new ceiling for most production deployments. For teams building agents at scale, Sonnet 4.6 is now the default starting point, not the compromise.

ResearchAudio.io

Source: Anthropic, "Introducing Claude Sonnet 4.6" (Feb 17, 2026)

Source: Anthropic, "Improved Web Search with Dynamic Filtering"

Source: Claude Sonnet 4.6 System Card

Keep Reading