In partnership with

Coupon Extensions Hate Us (And You’ll Love Why)

Coupon Protection partners with DTC brands like Quince, Blueland, Vessi and more to stop coupon extensions from auto-applying unwanted codes in your checkout.

Overpaid commissions to affiliates and influencers add up fast – Take back your margin.

After months of using KeepCart, Mando says “It has paid for itself multiple times over.”

Now it’s your turn to see how much more profit you can keep.

OpenAI's First Self-Building Coding Agent

ResearchAudio.io

OpenAI's First Self-Building Coding Agent

GPT-5.3-Codex debugged its own training. It also triggered OpenAI's first "High" cybersecurity rating.

OpenAI released GPT-5.3-Codex on February 5, 2026. The model combines GPT-5.2-Codex's coding performance with GPT-5.2's general reasoning into a single system. But the release came with an unusual detail: early versions of GPT-5.3-Codex helped debug its own training runs, manage its own deployment, and diagnose its own evaluation results. OpenAI calls it "the first model that was instrumental in creating itself."

The model also crossed a line OpenAI has been watching for months. GPT-5.3-Codex is the first model they classify as "High" capability for cybersecurity under their Preparedness Framework. They delayed full API access and deployed their most comprehensive safety stack to date. Here is the full technical breakdown.

57%
SWE-Bench Pro
77.3%
Terminal-Bench 2.0
64%
OSWorld
25%
Faster Than 5.2

What Changed from 5.2 to 5.3

GPT-5.2-Codex was a coding-optimized variant of GPT-5.2. It was strong at code but relied on a separate GPT-5.2 model for general reasoning and professional knowledge tasks. GPT-5.3-Codex merges both capabilities into one model. Think of it as going from two specialists to one engineer who can both write code and reason about architecture, documentation, and business logic together.

The practical result: GPT-5.3-Codex can handle tasks that require research, tool use, and complex multi-step execution in a single session. OpenAI reports it can build and iterate on full web apps and games over multiple days, handling debugging, deployment, monitoring, and even documentation without switching models.

GPT-5.2 vs GPT-5.3 Codex Architecture

v5.2

GPT-5.2-Codex

Coding only

+

GPT-5.2

Reasoning only

v5.3

GPT-5.3-Codex

Coding + Reasoning + Professional Knowledge (unified)

Source: OpenAI blog, Feb 5, 2026

Benchmark Breakdown

GPT-5.3-Codex sets new highs on four benchmarks OpenAI uses to measure real-world coding and agentic ability. SWE-Bench Pro tests multi-language software engineering across four programming languages with contamination-resistant challenges. The model scored 57% here, up slightly from GPT-5.2-Codex's 56.4%. The jump is incremental, but it maintains the top position.

The more striking result is Terminal-Bench 2.0, which measures the terminal skills a coding agent needs. GPT-5.3-Codex scored 77.3%, far exceeding previous models. It also did this with fewer tokens than any prior model, meaning the same work costs less to run. OSWorld, an agentic computer-use benchmark where models complete productivity tasks in visual desktop environments, came in at 64%.

On GDPval, OpenAI's evaluation of professional knowledge work across 44 occupations, GPT-5.3-Codex matches GPT-5.2's performance. GDPval tests tasks like making presentations, spreadsheets, and other work products. This is notable because GPT-5.2-Codex was a coding specialist that couldn't match GPT-5.2 on these tasks. The unified model closes that gap.

The Self-Improvement Detail

OpenAI states that early versions of GPT-5.3-Codex were used during its own development cycle. The Codex team used the model to debug training runs, manage deployment infrastructure, diagnose evaluation results, and assist with operational tasks like adapting test harnesses and scaling GPU clusters as traffic changed. Sam Altman noted the team "was blown away by how much Codex was able to accelerate its own development."

Worth noting: The system card explicitly states GPT-5.3-Codex "does not reach High capability on AI self-improvement." OpenAI draws a clear line between a model assisting with its own infrastructure and a model autonomously improving its own weights or training process. The distinction matters for safety evaluations.

This is practical recursion, not recursive self-improvement. The model helped with engineering tasks around its training, like any developer might. It did not modify its own architecture or training objectives. Still, the pattern of models accelerating their own development cycles is one to watch as capabilities increase.

First "High" Cybersecurity Rating

GPT-5.3-Codex is the first OpenAI model classified as "High" capability for cybersecurity under the Preparedness Framework. OpenAI's system card states they "do not have definitive evidence" the model reaches their High threshold. However, they cannot rule it out, so they are treating it as High on a precautionary basis.

What this means in practice: the Preparedness Framework says OpenAI will not release a model rated "High" in any risk area without first implementing safeguards. For GPT-5.3-Codex, those safeguards include safety training baked into the model, automated monitoring of usage, a trusted-access program that gates advanced cybersecurity capabilities behind vetting, and enforcement pipelines backed by threat intelligence.

Cybersecurity Safety Stack

Safety Training

Baked into model weights

Automated Monitoring

Real-time usage analysis

Trusted Access Program

Vetted security researchers only

Threat Intelligence

Enforcement pipelines

Source: GPT-5.3-Codex System Card, OpenAI

Full API access is delayed while these controls are finalized. Paid ChatGPT users can access the model through the Codex app, CLI, IDE extensions, and web interface. But developers who want to automate the model at scale via the API will need to wait. OpenAI also announced a $10 million commitment in API credits for cybersecurity defense work, an expanded beta of Aardvark (their security research agent), and free vulnerability scanning for major open-source projects.

Mid-Turn Steering

A notable UX change: GPT-5.3-Codex now provides frequent progress updates while working on tasks. Instead of waiting for a final answer, developers can interact with the model mid-execution. You can ask questions, discuss an approach, and change direction without losing context. Think of pair programming where you can tap your partner on the shoulder at any point.

Steer mode is now stable and enabled by default. In the Codex app, pressing Enter sends a message immediately during running tasks, while Tab queues follow-up input. This moves the interaction model from "submit a task and wait" to something closer to collaborative work.

Key Takeaways

Unified models are the direction. Splitting coding and reasoning into separate models created friction for complex tasks. GPT-5.3-Codex's single-model approach for code, reasoning, and professional knowledge reflects a broader trend toward general-purpose agents.

Cybersecurity capability scales with coding ability. OpenAI has tracked a sharp jump in cyber capability with each Codex release since GPT-5-Codex. The same improvements that make a model a better developer also make it a more capable offensive tool. Expect this tension to define future model releases.

Token efficiency matters as much as raw performance. GPT-5.3-Codex achieves its Terminal-Bench 2.0 results with fewer tokens than any prior model. For teams running Codex at scale, this directly translates to lower costs and faster completion times.

Interactivity changes how developers use agents. Mid-turn steering moves Codex from a fire-and-forget tool to a collaborative partner. This is closer to how human developers actually work together, and it may change expectations for all coding agents.

Competitive Context

GPT-5.3-Codex launched at the same moment Anthropic released Claude Opus 4.6. The synchronized timing was deliberate from both sides. Enterprise data from a16z shows OpenAI's average share of enterprise AI spend is projected to shrink from 62% in 2024 to 53% in 2026, while Anthropic's is growing from 14% to 18%. However, 89% of Anthropic's customers are testing or using their most capable models, the highest rate among providers.

The model was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. API access will follow once OpenAI has safely enabled it. For developers whose workflows depend on API availability, this means sticking with GPT-5.2-Codex for automated pipelines in the near term.

ResearchAudio.io - AI research, explained visually

Sources: OpenAI Blog | System Card | VentureBeat

Keep Reading