|
ResearchAudio.io
Anthropic Shipped a Worker. Google Shipped a Director.
Two frontier models, 24 hours apart, betting on opposite jobs.
|
|
64.3%
Opus 4.7 SWE-bench Pro
|
1,211
Gemini TTS Elo
|
200+
Audio tags shipped
|
|
The Lead
|
|
On April 15, Google shipped Gemini 3.1 Flash TTS. One day later, Anthropic shipped Claude Opus 4.7. Both are frontier-grade and priced for production. And they are solving completely different problems.
|
|
Opus 4.7 is a coding and agent model. It scores 64.3% on SWE-bench Pro, up from 53.4% on Opus 4.6, and leads GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%) on the same test. Anthropic introduced a new xhigh effort level between high and max, letting you pay for deeper reasoning exactly when the task demands it. Pricing stays at $5 in, $25 out per million tokens.
|
|
Gemini 3.1 Flash TTS is a speech model with 200+ natural-language audio tags, 70+ languages, and native multi-speaker dialogue. It hit an Elo of 1,211 on the Artificial Analysis TTS leaderboard, second overall, with a "most attractive quadrant" placement for quality-per-dollar. Pricing: $1 per million input tokens, $20 per million audio output tokens, with batch mode cutting both rates in half.
|
|
Here is the part worth sitting with. Anthropic is optimizing for models that do work with less supervision. Google is optimizing for models that perform on direction. These are two different bets about where the next year of AI product value actually comes from.
|
|
Side by side
Two releases, two different problems
|
|
OPUS 4.7
Model as worker
Runs long-horizon tasks with less oversight. Verifies its own outputs. Takes instructions literally.
Best at
Agentic coding, tool use, CI/CD
Default Claude Code effort: xhigh
|
|
GEMINI 3.1 TTS
Model as performer
Takes stage directions in natural language. Controls pacing, emotion, accent per line.
Best at
Podcasts, audiobooks, voice agents
Output watermarked with SynthID
|
|
|
Sources: Anthropic (Apr 16, 2026), Google DeepMind (Apr 15, 2026)
|
|
SWE-bench Pro, real engineering tasks
Higher is better. Source: Anthropic launch post.
|
|
|
What actually changed for builders
|
|
On Opus 4.7, two behavioral shifts will bite you if you migrate blindly. The new tokenizer maps the same input to 1.0x to 1.35x more tokens, depending on content type. And the model thinks more at higher effort levels, especially on later turns in agent loops. Expect a token-usage uplift on production traces, even with identical prompts.
|
|
Second, Opus 4.7 takes instructions literally. Hex's CTO noted that low-effort 4.7 is roughly equivalent to medium-effort 4.6, which is a real speed gain you get for nothing. But prompts tuned for 4.6's looser interpretation often need re-tuning. The model does what you asked, not what it guessed you meant.
|
|
On Gemini TTS, the shift is bigger than it looks. Audio tags like [whispers], [excited], and [pause], embedded directly in the text, let you steer delivery line by line. Simon Willison called the prompting guide "surprising" because Google's own example is a full scene description, not a text string. You prompt TTS now instead of calling it.
|
|
The key insight: Frontier model progress just split into two tracks. One track makes the model more autonomous. The other track makes the model more controllable. You pick the track per product, not per provider.
|
|
Quick Hits
|
|
Vision got real on Opus 4.7. Images up to 2,576 pixels on the long edge, roughly 3.3x the pixel count of earlier Claude models. Screenshots, dense diagrams, and PDFs with small text finally stop losing information at the ingestion layer. OSWorld-Verified for computer use climbed to 78.0%, up 5.3 points from Opus 4.6.
|
|
Anthropic held back a stronger model. Claude Mythos Preview, their cybersecurity-focused frontier model, stays behind Project Glasswing with limited access. Opus 4.7 was trained with differential reduction of cyber capabilities, plus automatic safeguards blocking high-risk cybersecurity requests. Safety is now part of the product lineup, not a post-training filter.
|
|
Gemini TTS ships with SynthID by default. Every audio output gets an imperceptible watermark for AI-generated-content detection. Worth knowing if you build products that mix human and synthetic narration, because downstream detectors will flag your output whether or not you disclose it.
|
|
Pricing stayed flat on Opus. $5 input, $25 output per million tokens, same as Opus 4.6. Box reported 56% fewer model calls, 50% fewer tool calls, and 24% faster response on their workload. The effective cost per completed task dropped even though per-token cost did not.
|
The Take
|
|
If you are shipping with frontier models this quarter, the action is concrete. For coding-heavy products and agent loops, migrate your hardest tasks to Opus 4.7 at xhigh effort, and budget for the tokenizer uplift while retesting prompts for literal interpretation. For voice products, rebuild TTS prompts from scratch and treat audio tags as a new prompting surface. The teams that adapt fastest stop asking "which model is smarter" and start asking "which model is the right tool for this job."
|
|
Here is the part nobody is saying out loud. Both labs shipped on the same week, which means the next six months of product differentiation comes from how you compose these models, not which one you picked. The engineer who knows when to hand work to Opus 4.7 and when to hand direction to Gemini TTS has a real edge over the one who picks a provider and commits.
|
The Open Question
|
|
If a single prompt can now direct a voice the way a director directs an actor, how long before we prompt agent behavior the same way. Not "run these tools." Imagine instead: [cautious] verify the schema, [confident] ship the migration. Gemini just shipped that interface for speech, and someone is going to ship it for agents.
|
|
|
|
Next issue: The quiet result buried in the Opus 4.7 system card that suggests Anthropic's alignment evaluations are converging with its capability evaluations, and what that means for the release cadence of frontier models.
|
Sources: Anthropic launch post, Google DeepMind launch post, Vellum benchmark breakdown.
ResearchAudio.io · AI research for engineers shipping with frontier models.
|