Here’s how I use Attio to run my day.

Attio is the AI CRM with conversational AI built directly into your workspace. Every morning, Ask Attio handles my prep:

  • Surfaces insights from calls and conversations across my entire CRM

  • Update records and create tasks without manual entry

  • Answers questions about deals, accounts, and customer signals that used to take hours to find

All in seconds. No searching, no switching tabs, no manual updates.

Ready to scale faster?

Inside Claude Opus 4.6: 1M Tokens, Agent Teams, and 500 Zero-Days

ResearchAudio.io

Inside Claude Opus 4.6: 1M Tokens, Agent Teams, and 500 Zero-Days

Anthropic's new flagship outperforms GPT-5.2 by 144 Elo on knowledge work. Here's the technical breakdown.

Anthropic shipped Claude Opus 4.6 today. Before the public launch, the company's red team pointed the model at well-tested open-source codebases and let it hunt for bugs. No special instructions. No vulnerability-specific knowledge. Opus 4.6 found over 500 previously unknown zero-day vulnerabilities, each validated by Anthropic's team or external security researchers. Some of these flaws had gone undetected for decades.

The vulnerability discovery is notable, but the broader story here is about what Opus 4.6 changes across context handling, agentic workflows, and enterprise knowledge work.

1M
Context Window
128K
Output Tokens
65.4%
Terminal-Bench 2.0
68.8%
ARC AGI 2

The 1M Token Context Window

Context windows define how much information a model can hold in memory during a single session. Think of it as the size of the desk you can spread your documents across. Previous Opus models were limited to 200K tokens. Opus 4.6 expands this to 1 million tokens in beta, enough to hold an entire codebase or hundreds of pages of documents at once.

The raw number matters less than what the model does with it. On MRCR v2, a needle-in-a-haystack benchmark that tests whether a model can find and reason over specific facts buried in massive prompts, Opus 4.6 scores 76%. Claude Sonnet 4.5 scores 18.5% on the same test. That gap represents a qualitative shift. The model can now actually use the context it holds, rather than losing track of information buried deep in a long session.

Anthropic also introduced context compaction, a beta feature for API users. When context fills up during a long-running task, Claude can automatically summarize older portions to make room for new information. This means the model can sustain tasks far longer without crashing into its context limits.

Key Insight: The 1M context window combined with 76% MRCR v2 accuracy addresses "context rot," where models lose performance over long conversations. For enterprise use cases like codebase audits, legal document review, and financial analysis, this means the model can hold the full picture in memory while still locating specific details.

Benchmark Performance: Where Opus 4.6 Leads

Benchmarks tell part of the story. Here's where Opus 4.6 places relative to its predecessor and competitors, according to Anthropic's published evaluations.

Benchmark Opus 4.6 Opus 4.5 GPT-5.2
GDPval-AA (Elo) 1,606 1,416 1,462
Terminal-Bench 2.0 65.4% 59.8% -
ARC AGI 2 68.8% 37.6% 54.2%
OSWorld (Computer Use) 72.7% 66.3% -
MRCR v2 (8-needle) 76% - -

The GDPval-AA benchmark is worth examining closely. It measures performance on economically valuable knowledge work across 44 professional occupations, including finance and legal domains. Opus 4.6 outperforms GPT-5.2 by approximately 144 Elo points, which translates to obtaining a higher score roughly 70% of the time. It beats its own predecessor, Opus 4.5, by 190 Elo points.

The ARC AGI 2 result stands out for a different reason. This benchmark tests problems that are easy for humans but hard for AI. Opus 4.5 scored 37.6%. Gemini 3 Pro scored 45.1%. GPT-5.2 scored 54.2%. Opus 4.6 jumped to 68.8%, nearly doubling its predecessor's score.

Worth noting: Opus 4.6 showed small regressions on SWE-bench (the standard version) and the MCP Atlas benchmark for tool usage, even while performing well on similar benchmarks like Terminal-Bench 2.0 and t2-bench. Benchmarks can tell conflicting stories.

Agent Teams: Parallel AI Workers in Claude Code

Until now, Claude Code ran one agent at a time. If you needed to review a large codebase, make backend changes, and update the frontend, each task happened sequentially. Agent Teams changes this. Multiple agents can now work in parallel, splitting tasks and coordinating autonomously.

Developer Task

"Review codebase, fix auth bug, update API docs"

Claude Code Coordinator

Splits work, assigns agents, merges results

↓               ↓               ↓

Agent 1

Code Review

Agent 2

Bug Fix

Agent 3

API Docs

Agent Teams in Claude Code: parallel agents coordinate autonomously on split tasks

Anthropic describes Agent Teams as particularly useful for read-heavy work like codebase reviews. One agent can scan the frontend, another the API layer, and a third the migration scripts, each owning its piece and coordinating directly with the others. This is currently available as a research preview in Claude Code.

Early adopter data supports the value. Rakuten reported that Opus 4.6 autonomously closed 13 GitHub issues and assigned 12 issues to the correct team members in a single day, managing a roughly 50-person organization across 6 repositories. The model handled both product and organizational decisions while knowing when to escalate to a human.

Adaptive Thinking and Effort Controls

Previous Claude models gave developers a binary choice: extended thinking on or off. Opus 4.6 introduces adaptive thinking, where the model reads contextual clues to determine how much reasoning effort a task requires. Simple questions get fast responses. Complex multi-step problems get deeper analysis.

Developers also get explicit control through four effort levels: low, medium, high (the default), and max. Low effort skips thinking entirely for simple tasks. Max effort uses extended thinking with no constraints on depth. This gives developers a direct knob to trade off between intelligence, speed, and cost.

Low

Skips thinking. Fast, low cost.

Medium

Moderate thinking. May skip for simple queries.

High (Default)

Always thinks. Deep reasoning.

Max

No constraints on depth. Full power.

500 Zero-Days Found Before Launch

Anthropic's frontier red team placed Opus 4.6 inside a sandboxed virtual machine with access to standard utilities like Python, along with vulnerability analysis tools such as debuggers and fuzzers. The model received no special instructions and no pre-loaded knowledge about specific vulnerabilities.

The model found over 500 previously unknown zero-day vulnerabilities in open-source code. Each was validated by either an Anthropic team member or an external security researcher. The flaws ranged from denial-of-service conditions to memory corruption vulnerabilities across widely used projects.

Three specific examples from the blog post: a flaw in GhostScript (a PDF and PostScript processing utility) that could crash systems, buffer overflow bugs in OpenSC (smart card data processing), and in CGIF (GIF file processing).

How it works differently from fuzzers: Traditional fuzzers throw massive amounts of random inputs at code to see what breaks. Opus 4.6 reads and reasons about code, looking at past fixes to find similar unaddressed bugs, spotting patterns that tend to cause problems, or understanding logic deeply enough to know exactly what input would break it. Anthropic noted that some of these vulnerabilities existed in codebases that had fuzzers running against them for years, accumulating millions of hours of CPU time.

To manage the dual-use risk, Anthropic developed six new cybersecurity probes that measure model activations during response generation to detect potential misuse. The company also stated it may institute real-time traffic blocking for requests it detects as malicious.

Enterprise Integrations: Excel and PowerPoint

Anthropic upgraded Claude in Excel to handle longer-running, more complex tasks and multi-step changes in a single pass. The model can now interpret messy, unstructured spreadsheets without explicit explanations of the data layout.

Claude in PowerPoint launches as a research preview for Max, Team, and Enterprise users. The integration reads slide masters, fonts, and layouts from existing templates, generating presentations that match corporate branding. Anthropic described this as technically challenging because, unlike data-driven Excel, PowerPoint requires aesthetic judgments about design elements like colors and text placement.

These Office integrations contributed to a notable market reaction. Legal and financial software stocks have dropped this week, with analysts attributing part of the selloff to concerns that Claude's Cowork capabilities could replace specialized enterprise software packages.

Safety Profile

According to Anthropic's system card, Opus 4.6 shows low rates of misaligned behavior across safety evaluations, including deception, sycophancy, and encouragement of user delusions. The company states these intelligence gains did not come at the cost of safety alignment. Opus 4.6 also exhibits fewer unnecessary refusals compared to prior Claude models.

On the life sciences front, Opus 4.6 performs nearly 2x better than Opus 4.5 on computational biology, structural biology, organic chemistry, and phylogenetics tests. This is an area where Anthropic applies additional scrutiny given the dual-use potential.

Pricing and Availability

Pricing remains unchanged at 5 dollars per million input tokens and 25 dollars per million output tokens. The model is available today on claude.ai, through the Anthropic API (model string: claude-opus-4-6), and across all major cloud platforms including AWS Bedrock, Google Cloud Vertex AI, Microsoft Foundry, and GitHub Copilot.

For prompts exceeding 200K tokens, there is a premium tier at 10/37.50 dollars per million tokens. Anthropic also added an option for US-based workload processing at a 10% higher rate.

Technical Takeaways

Context quality matters more than context size. The jump from 18.5% to 76% on MRCR v2 shows that a large context window is valuable when the model can reason across its full contents. Previous models had long contexts but lost track of information. Opus 4.6 addresses this with what appears to be better mechanisms for maintaining focus across the full window.

AI vulnerability discovery complements, not replaces, traditional security tools. Opus 4.6 found bugs that fuzzers missed because it reasons about code semantics rather than just testing random inputs. The combination of both approaches is stronger than either alone.

Effort controls shift the cost-quality tradeoff to developers. The four-level effort parameter lets teams use the same model across different price and latency requirements. A chatbot can run at low effort while a code review pipeline runs at max. This is more granular than the previous binary thinking toggle.

Opus 4.6 arrives the same day OpenAI launched GPT-5.3-Codex and the OpenAI Frontier enterprise platform. The frontier model competition continues to accelerate, with each lab pushing toward more autonomous, long-running AI agents capable of complex professional work.

ResearchAudio.io - AI research, explained visually

Sources: Anthropic Blog | Anthropic Cybersecurity Blog | System Card

Keep Reading