Sponsored by

Scale your customer experience across all channels

Most teams scale customer service by hiring, but with ElevenAgents, real-time voice and chat agents handle high volumes of interactions and resolve them faster.

When routine requests come in, they get answered in the moment, in 70+ languages. However, the complex ones route to humans with full context, so your team focuses only on the conversations that need judgment. The experience holds up because the agents sound human, not robotic. And with Expressive Mode, they read tone and adapt - de-escalating, reassuring, and guiding each conversation to a clear resolution.

You can train every agent on your knowledge base, SOPs, and policies, deploy across voice, chat, and WhatsApp, then keep improving with real-time CSAT tracking, A/B testing, and Guardrails.

The outcome is the one leaders care about: higher resolution rates, higher CSAT, and support that scales with demand instead of headcount. Pricing is transparent and flat at $0.08 per minute, with no add-ons to work around.

Get started free

ResearchAudio.io

Sakana Built a Model That Beats the Models Inside It

It scored 95.5 on GPQA-Diamond without solving a single question itself.

Sakana AI's new model never solves your task. It decides which other model should. Fugu is a compact orchestrator trained to read a query, pick the right frontier model from a pool that includes GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro, and combine their work into one answer.

On GPQA-Diamond it scored 95.5, ahead of every model it routes to, and ahead of two Anthropic models (Mythos Preview at 94.6 and Fable 5 at 92.6) that are not publicly accessible. The orchestrator wrote none of those answers. It just learned who to ask, and when.

Why route instead of scale

Frontier models have started to specialize. The report notes that GPT-series models often lead on mathematical reasoning and planning, Opus-series models specialize in software engineering and cybersecurity, and Gemini is strong on scientific recall. No single model leads everywhere, even within one domain.

Teams already combine models with multi-agent frameworks, but those rely on hand-designed routing rules and fixed roles that you build and maintain. Earlier learned routers mostly map a query to a single best model in one shot. The gap Sakana targets is an orchestrator that adapts per query and per step, learned rather than scripted, and exposed as a single model.

The timing made the point for them. Days before the June 22 launch, an export-control directive in the United States cut access to Anthropic's Fable and Mythos models. Sakana frames orchestration over a swappable pool as a hedge against depending on any single vendor.

Two variants, one API

Sakana ships two models behind one OpenAI-compatible endpoint. Fugu is latency-aware: it selects a single worker per query, so its response time is close to a direct call to a frontier model. Fugu-Ultra prioritizes answer quality, composing workflows of up to five steps over a larger worker pool, trading latency for depth on the hardest tasks.

Here is how Fugu makes a choice. A lightweight selection head sits on top of a pre-trained backbone. Think of it as a dispatcher reading the room before anyone speaks. It takes a hidden state from the backbone and outputs one score per worker model, then sends the query to the highest-scoring worker. The design choice that matters is that it reads the orchestrator's internal logits, not generated text, so it skips the expensive autoregressive decoding step entirely. A very small parameter set is trained: the selection head plus the singular-value scales of particular weight matrices (a method called singular-value fine-tuning). Press coverage puts the backbone near 7 billion parameters, though the report itself states just that the trainable set is extremely small.

Training Fugu runs in two stages. First, supervised fine-tuning on single-step tasks: run every worker several times on each question, score the outputs against ground truth, convert those scores into a soft distribution over workers with a softmax, and train the head to match that distribution with a KL objective. Learning a soft distribution rather than a single best label keeps selection robust when several workers are similarly capable. Second, evolutionary search (sep-CMA-ES) on real multi-turn coding trajectories collected from Claude Code, Codex, and OpenCode, optimizing end-to-end task completion directly rather than per-step labels.

Fugu-Ultra coordinates differently. It writes a full agentic workflow in natural language: a list of subtasks, the worker assigned to each, and an access list controlling which earlier outputs each worker can see. It is trained with GRPO, rewarding well-formed workflows whose final output is correct. That lets it build structures ranging from best-of-N to trees, and, unusually, choose a different aggregator per task. On a trivia-heavy question it put Gemini at the top to synthesize; on a math-heavy one it used GPT.

How Fugu works, and how it scores

Agent pool

GPT-5.5

math, planning

Claude Opus 4.8

software, debugging

Gemini 3.1 Pro

science, recall

⇄

Fugu

picks & combines
per query, per step

→

One
answer

Fugu-Ultra vs best single model in the pool

SWE-Bench Pro

	73.7 Fugu-Ultra

	69.2 Opus 4.8

Terminal Bench 2.1

	82.1 Fugu-Ultra

	78.2 GPT-5.5

GPQA-Diamond

	95.5 Fugu-Ultra

	94.3 Gemini 3.1

Source: Sakana Fugu Technical Report (arXiv:2606.21228), Table 1 and Figure 1. Baseline scores are provider-reported.

The detail that explains the scores is per-step switching. Even Fugu, which picks one model per query, beats GPT-5.5 on Terminal Bench (82.1 versus 78.2) because it alternates models across a single solution. In the trajectories, GPT-5.5 builds and Opus 4.8 is called in at critical debugging points. On one task, after GPT built a package server, Opus caught that it had used a plain static HTTP server instead of a real package index and that its reachability check was hitting an orphaned process. Relaying that back let GPT finish the build.

What to take from it

Orchestration is a scaling axis. If capability can be amplified by composing existing models, progress need not depend solely on the largest training runs. New models fold into the pool as they ship, and the gains pass through without retraining the orchestrator.

The ceiling is the pool. Coordination cannot invent a skill no worker has. On a long-context retrieval test (MRCRv2), GPT-5.5 still leads at 94.8 against Fugu-Ultra's 93.6. Where a single model holds a raw-capability edge, routing does not erase it.

More steps is not always better. The faster Fugu beats Fugu-Ultra on SciCode (60.1 versus 58.7) and on the conversational task they tested. Deeper orchestration adds latency and is not automatically better, so the variant should match the task.

Vendor independence becomes a feature. The pool is swappable: opt specific providers out for compliance, or route around an access change. The tradeoff for builders is real, though. It is a hosted API over a third-party pool, and one Fugu-Ultra call can fan out across several models and use more tokens than a single direct call.

Fugu and Fugu-Ultra are generally available through one API with subscription and pay-as-you-go plans. Launch coverage reports pay-as-you-go pricing for Fugu-Ultra near 5 USD per million input tokens and 30 USD per million output tokens, with the service not yet available in the EU and EEA while Sakana works toward GDPR compliance.

The bet underneath Fugu is that the unit of competition is shifting from the single model to the system that conducts many. The open question is whether the gains hold as the pool's models converge, or whether they depend on the workers staying as specialized as they are today.

ResearchAudio.io

Source: Sakana Fugu Technical Report · Sakana release

Foundations: Trinity and Conductor (both ICLR 2026)

Sakana's Fugu routes to GPT-5.5, Claude Opus, and Gemini. It beats all three.

Scale your customer experience across all channels

Sakana Built a Model That Beats the Models Inside It

Why route instead of scale

Two variants, one API

What to take from it

Keep Reading

Quick Links

Stay Updated