GPT-5.2 Released: What Changed and Why It Matters
A technical deep-dive into OpenAI's latest model release
Last Thursday, OpenAI released GPT-5.2 — just one month after GPT-5.1. That's an unusually fast release cadence, driven by competitive pressure from Google's Gemini 3.
When Gemini 3 launched in mid-November and topped multiple benchmarks, Sam Altman reportedly issued an internal "Code Red" memo, redirecting engineering resources to ChatGPT and accelerating this release by several weeks.
But GPT-5.2 is more than a reactionary patch. It includes the first verified AI contribution to an unsolved math problem, a 3x improvement in abstract reasoning, and architectural changes worth understanding. Here's what actually shipped.
Internal Codename: Garlic
GPT-5.2's internal codename was "Garlic." According to The Information, there was also a failed predecessor project called "Shallotpeat" (OpenAI's naming convention apparently follows gardening metaphors for difficult training challenges).
The Garlic architecture reportedly solved pretraining bottlenecks that plagued earlier projects, allowing OpenAI to inject "big model knowledge" into more efficient architectures. This matters for the long game: future models might get smarter without proportional compute increases.
Key Technical Specifications
Context Window: 400,000 tokens input, 128,000 tokens max output. That 128K output is significant — you can generate entire codebases, complete legal documents, or book-length reports in a single API call. The effective input window maintains near-100% accuracy up to 256K tokens on needle-in-haystack tests.
ARC-AGI-2 Score: 52.9% (Thinking) and 54.2% (Pro). This is notable because GPT-5.1 scored 17.6% on this abstract reasoning benchmark designed to resist memorization. That's a 3x improvement in one month. Gemini 3 Pro scores 31.1%. For context, humans solve 100% of ARC-AGI-2 tasks.
ARC-AGI-1: GPT-5.2 Pro is the first model to cross 90% (achieving 90.5%), while reducing the cost to reach that performance by approximately 390x compared to o3-preview.
AIME 2025: 100% without tools. Gemini 3 Pro only matches this with code execution enabled.
Knowledge Cutoff: August 31, 2025
The Statistical Learning Theory Contribution
OpenAI published a companion blog post describing how GPT-5.2 Pro contributed to solving an open problem in statistical learning theory.
The question: "If you collect more data, do your results reliably get better?" Researchers have known since 2019 that even simple statistical models can have non-monotonic learning curves where more data actually increases error.
The methodology is notable: researchers did not feed the model a proof outline or intermediate steps. They asked GPT-5.2 Pro to solve the open problem directly. The model proposed a proof that was subsequently verified by the paper authors and reviewed by external subject-matter experts. When prompted with follow-up questions, it extended the result to higher-dimensional settings.
The resulting paper is titled "On Learning-Curve Monotonicity for Maximum Likelihood Estimators." This appears to be the first verified AI contribution to an unsolved mathematical problem without human mathematical scaffolding.
Three Model Variants
OpenAI is now explicitly segmenting models by compute and latency tradeoffs:
GPT-5.2 Instant: Optimized for speed. Daily tasks like writing, translation, quick lookups.
GPT-5.2 Thinking: Extended reasoning chains for complex structured work. Features a "thinking time" dial (Light/Medium/Heavy) that trades latency for depth. Includes a new "xhigh" reasoning effort setting.
GPT-5.2 Pro: Maximum accuracy for difficult questions. Some requests take several minutes. Designed for scenarios where errors are costly. Supports reasoning.effort settings of medium, high, and xhigh.
API Pricing
Pricing increased 40% over GPT-5.1:
GPT-5.2: $1.75 input / $14 output per million tokens
GPT-5.2 Pro: $21 input / $168 output per million tokens
Cached inputs: $0.175 per million (90% discount)
For comparison, o1-pro costs $150/$600 per million tokens, so GPT-5.2 Pro undercuts OpenAI's most expensive reasoning model significantly.
OpenAI argues the higher cost is offset by "greater token efficiency" — the model solves tasks in fewer turns. Their internal testing suggests total cost per task actually decreased for agentic workflows despite higher per-token prices.
The /compact Endpoint
For workflows that exceed even the 400K context window, GPT-5.2 Thinking introduces compatibility with a new Responses /compact endpoint. This performs context compaction to extend effective context for tool-heavy, long-running agent jobs.
If you're building agents that iteratively call tools over many steps and need to maintain state beyond raw token limits, this is the solution. The model can now handle multi-hour autonomous workflows that would have crashed previous versions.
Vision Improvements
GPT-5.2 Thinking cuts error rates roughly in half on chart reasoning and software interface understanding:
- Better spatial layout understanding (where elements are positioned in images)
- More accurate interpretation of dashboards, technical diagrams, product screenshots
- Improved bounding-box identification on low-quality images
- New state-of-the-art on ScreenSpot-Pro (professional GUI reasoning)
Tool Use Performance
GPT-5.2 Thinking achieves 98.7% on Tau2-bench Telecom, a multi-turn customer support benchmark where the model orchestrates tool calls across realistic workflows.
OpenAI's example: a traveler with a delayed flight, missed connection, lost bag, and medical seating requirement. GPT-5.2 manages rebooking, special assistance seating, and compensation in a consistent sequence. GPT-5.1 left steps unfinished.
GDPval Benchmark Results
OpenAI introduced GDPval earlier this year — a benchmark measuring "well-specified knowledge work tasks" across 44 occupations spanning the top nine industries contributing to U.S. GDP.
Results:
- GPT-5.2 Thinking beats or ties top industry professionals on 70.9% of comparisons
- GPT-5.1 scored 38.8% — nearly double the improvement
- Outputs generated at over 11x the speed and under 1% the cost of expert professionals
Tasks include creating spreadsheets, presentations, legal briefs, financial models, and manufacturing diagrams. On internal benchmarks for junior investment banking spreadsheet modeling (LBO models), average scores rose from 59.1% to 68.4%.
Note: This is OpenAI's own benchmark and has not been independently validated.
Developer Details
API Endpoints:
- gpt-5.2 (Thinking, default)
- gpt-5.2-chat-latest (Instant)
- gpt-5.2-pro
Supported Features:
- Reasoning token support built-in (chain-of-thought from o1 series)
- xhigh reasoning effort setting (Pro only)
- /compact endpoint for extended context
- Streaming, function calling, structured outputs
- No fine-tuning yet, but distillation is supported
Rate Limits: Tier 1 starts at 500 RPM / 500K TPM, scaling to Tier 5 at 15,000 RPM / 40M TPM.
Competitive Comparison
Where GPT-5.2 leads:
- SWE-Bench Pro: 55.6% vs Gemini 3 Pro at 43.3%
- ARC-AGI-1: 86.2% vs Gemini 3 at 75%
- AIME 2025: 100% without tools vs Gemini 3 at 95%
- Professional knowledge work benchmarks
Where competitors lead:
- Claude Opus 4.5: 80.9% on SWE-bench Verified (GPT-5.2: 80.0%)
- Gemini 3 Pro: Superior multimodal processing for video and image generation
- Gemini 3 Deep Think: 93.8% on GPQA Diamond (GPT-5.2 Pro: 93.2%)
- Gemini 3: Top spot on Humanity's Last Exam
Known Limitations
Speed penalty: Multiple reviewers note GPT-5.2 Thinking mode is significantly slower. The Thinking dial trades latency for depth.
Formatting issues: Some users report overly verbose responses with excessive markdown formatting.
No image generation improvements: Despite reports that image generation was a priority, GPT-5.2 shipped without improvements in this area. OpenAI reportedly plans a January release addressing this.
Benchmark concerns: Some researchers question benchmark validity without reproducibility. ARC-AGI-2 scores correlate with compute spending at inference, suggesting results may reflect test-time compute rather than architectural improvements.
Timeline
January 2026: OpenAI reportedly plans another model release with image improvements and "better personality."
Q1 2026: "Adult mode" expected with age estimation systems.
Coming weeks: A version of GPT-5.2 optimized for Codex.
GPT-5.1 sunset: Will remain available in legacy models for three months, then deprecated.
Summary
GPT-5.2 delivers meaningful improvements — particularly the 3x jump in abstract reasoning on ARC-AGI-2, the verified math research contribution, and tool-use reliability. The 400K context window with 128K output is a legitimate capability unlock for enterprise workflows.
However, this is clearly a competitive response release. The 40% price increase, formatting regressions, and speed penalties suggest trade-offs that got shipped under time pressure.
For enterprise users building production systems: GPT-5.2 Thinking is worth evaluating for complex reasoning and multi-step tool workflows. For everyday use: the improvements are incremental but noticeable in professional document creation.
The speed of this release cycle — from "comfortable lead" to "code red response" in a single month — reflects how competitive the frontier model space has become.
Thanks for reading. See you in the next one.
— Deep
Modernize Out Of Home with AdQuick
AdQuick unlocks the benefits of Out Of Home (OOH) advertising in a way no one else has. Approaching the problem with eyes to performance, created for marketers and creatives with the engineering excellence you’ve come to expect for the internet.
You can learn more at www.AdQuick.com

