|
ResearchAudio.io GPT-5.4 Outscores Humans. 2.5M Users Quit.47% fewer tokens via Tool Search. CoT controllability at 0.3%. Full breakdown. |
||||
|
OpenAI released GPT-5.4 today, March 5, 2026. On the OSWorld-Verified benchmark for desktop navigation, the model scored 75.0% success. Reported human performance on the same benchmark: 72.4%. It is the first time a general-purpose AI model has surpassed human-level results on autonomous computer operation. And yet, the launch landed in the middle of a user revolt. An estimated 2.5 million people have canceled subscriptions or pledged support for a boycott called QuitGPT, triggered by OpenAI signing a contract with the U.S. Department of Defense. Anthropic had publicly declined the same contract days earlier, after the Pentagon refused to include language prohibiting autonomous weapons deployment. The technical story and the institutional story are now inseparable. Here is what GPT-5.4 actually does, what it costs, what the safety evaluations reveal, and why this release matters for the competitive landscape between OpenAI and Anthropic.
What GPT-5.4 CombinesGPT-5.4 is the first OpenAI model that merges the coding capabilities of GPT-5.3-Codex with improved reasoning, agentic workflows, and native computer use into a single model. There was no GPT-5.3 Thinking variant, only the Codex-specific model. The numbering jump from 5.3 to 5.4 reflects this unification. The model comes in three variants: GPT-5.4 Thinking (available in ChatGPT for Plus, Team, and Pro users), GPT-5.4 (API access), and GPT-5.4 Pro (for ChatGPT Pro, Enterprise, Edu, and API). Free ChatGPT users only get GPT-5.4 when queries are auto-routed to the model. GPT-5.2 Thinking moves to Legacy Models and will be discontinued on June 5, 2026. Tool Search: A Structural Fix for Token BloatTool Search changes how GPT-5.4 interacts with external tools in the API. Previously, every tool definition was loaded into the prompt upfront. For systems with dozens of MCP servers and hundreds of tool definitions, this added thousands of tokens to every request, regardless of whether the model needed those tools. Think of it as a library index versus carrying every book. Instead of receiving full tool schemas in the prompt, GPT-5.4 receives a lightweight tool list and retrieves full definitions only when needed. On 250 tasks from Scale's MCP Atlas benchmark with all 36 MCP servers enabled, the tool-search configuration reduced total token usage by 47% while achieving the same accuracy as loading all tools directly. This matters for cost. GPT-5.4 input tokens cost $2.50 per million (up 43% from GPT-5.2's $1.75). But if Tool Search cuts your actual token consumption by nearly half in tool-heavy workflows, net costs stay comparable or drop. Requests exceeding 272,000 input tokens are billed at 2x the normal rate, so the 1M context window comes with a premium tier built in. Native Computer Use: Beyond Chatbot TerritoryGPT-5.4 is the first general-purpose OpenAI model with native computer-use capabilities in Codex and the API. The model can write Playwright code to operate software programmatically, and it can also issue direct mouse and keyboard commands from screenshots. The benchmarks tell the story. On OSWorld-Verified (desktop navigation with screenshots, keyboard, and mouse), GPT-5.4 scored 75.0% versus GPT-5.2's 47.3%. Reported human performance sits at 72.4%. On WebArena-Verified, it reaches 67.3% using both DOM and screenshot-based interaction. On Online-Mind2Web, the model hits 92.8% using screenshots alone. OpenAI also launched ChatGPT for Excel integrations alongside GPT-5.4, enabling the model to work directly inside Microsoft Excel and Google Sheets cells. For investment banking modeling tasks, GPT-5.4 scored 87.3% compared to GPT-5.2's 68.4%. Human evaluators preferred its presentation output 68% of the time over GPT-5.2, citing better aesthetics and visual variety. Batch and Flex pricing are available at half the standard rate. Priority processing costs 2x the standard rate. OpenAI argues that higher per-token pricing is offset by the model's improved token efficiency: GPT-5.4 uses fewer reasoning tokens than GPT-5.2 for equivalent problems. CoT Monitorability: The Safety Story Under the HoodThe GPT-5.4 Thinking system card contains the most detailed chain-of-thought monitorability evaluation OpenAI has published to date. CoT monitorability measures whether an external monitor can correctly infer safety-relevant properties of a model's behavior by inspecting its reasoning trace, not just its final output. |

The headline finding: averaged across 13 evaluations spanning 24 environments, GPT-5.4 Thinking has lower aggregate CoT monitorability than GPT-5 Thinking (the original). The decline is not uniform. On agentic misalignment tasks, monitorability improved substantially and is near 100%. But on health queries with missing evidence and impossible tasks, monitorability dropped due to higher false positive rates in the intervention arm.
CoT controllability measures whether a model can deliberately reshape or hide its reasoning when instructed to. Lower is better: it means the model cannot easily obfuscate its thinking to evade monitors. GPT-5.4 Thinking's controllability sits at 0.3% for 10,000-character reasoning chains, compared to 0.2% for GPT-5.2 Thinking. OpenAI concludes that increased capabilities have not made the model meaningfully better at concealing its reasoning.
|
Key Insight (Safety): OpenAI separately claims that deception is less likely in GPT-5.4 Thinking than prior models, stating the model "lacks the ability to hide its reasoning and that CoT monitoring remains an effective safety tool." But the system card data tells a more nuanced story: aggregate monitorability actually declined, even though controllability stayed low. This means the model is not better at hiding its reasoning on purpose, but monitors are finding it harder to read the reasoning correctly on certain task types. The distinction matters for anyone building safety infrastructure on top of CoT traces. |
GPT-5.4 Thinking also received a "High Capability" cybersecurity classification, the first time a general-purpose reasoning model (not a specialized coding model) has reached this designation. This triggers OpenAI's most restrictive safeguard tier for cyber capabilities.
The Competitive and Institutional Context
GPT-5.4 is aimed squarely at the enterprise market where Anthropic has built strength. The model's focus on spreadsheets, presentations, computer use, and document generation mirrors capabilities Anthropic launched with Claude for Financial Services and later expanded with Cowork. The ChatGPT for Excel integration directly competes with Anthropic's Claude in Excel feature.
But independent benchmarks on Arena.ai and Artificial Analysis currently show Anthropic and Google models ranking ahead of OpenAI's offerings. Whether GPT-5.4 changes those rankings is an open question: direct third-party comparisons are not yet available.
The DoD contract controversy adds a dimension that benchmarks cannot address. OpenAI's user growth had already slowed below internal projections before the boycott. An estimated 1.5 million users left after the initial contract announcement, with the total reaching 2.5 million as of this week. OpenAI CEO Sam Altman has faced repeated questions about the gap between the company's stated safety commitments and the terms of the defense contract.
Key Insights
|
Tool Search changes the economics of agentic systems. If you are building MCP-heavy agent architectures, the 47% token reduction means you can connect more tools without proportional cost increases. This is the first time a frontier model provider has addressed the "tool definition bloat" problem at the model level rather than leaving it to developers. |
|
CoT monitorability is not monotonically improving with capability. The system card shows aggregate monitorability declined from GPT-5 Thinking to GPT-5.4 Thinking, even as controllability stayed flat. For AI safety researchers, this means the assumption that "more capable models produce more readable reasoning" does not hold across the board. Task-specific monitorability evaluation is necessary. |
|
The 1M context window is opt-in and priced aggressively. The default window remains 272K tokens. Exceeding it doubles the per-token input rate. Developers should benchmark whether long-context performance justifies the cost versus alternative approaches like RAG or chunked processing. The 128K output token limit also remains unchanged. |
|
Enterprise is the battleground, but trust is the variable. On pure technical merit, GPT-5.4 makes a strong case for enterprise adoption: 83% GDPval, 87.3% on financial modeling tasks, native computer use. But enterprise buyers weigh institutional trust alongside capability. The QuitGPT movement and DoD controversy introduce reputational risk that does not appear in any benchmark table. |
GPT-5.4 is technically the strongest general-purpose model OpenAI has shipped. Whether the company's institutional decisions let that technical strength translate to market position is the question that no system card can answer.
|
ResearchAudio.io Sources: OpenAI GPT-5.4 blog post · GPT-5.4 Thinking System Card · TechCrunch · VentureBeat |
Attio is the AI CRM for modern teams.
Connect your email and calendar, and Attio instantly builds your CRM. Every contact, every company, every conversation, all organized in one place.
Then Ask Attio anything:
Prep for meetings in seconds with full context from across your business
Know what’s happening across your entire pipeline instantly
Spot deals going sideways before they do
No more digging and no more data entry. Just answers.


