Why Your AI Keeps Making Things Up (And What You Can Actually Do About It)
New research reveals the real reason ChatGPT hallucinates—and it's not what you think. Plus: 5 practical strategies to get more reliable answers starting today.
RESEARCH PAPER
"Why Language Models Hallucinate"
Kalai et al. (OpenAI & Georgia Tech, 2025)
→ Read the full paper on arXiv
Picture this: You ask ChatGPT about Adam Kalai's birthday. It confidently responds: "March 7th."
You ask again. This time: "June 15th."
One more try: "January 1st."
All three answers are wrong. The correct date is in Autumn. But here's the fascinating part—this isn't a bug. It's a feature of how these models are designed.
💡 The Uncomfortable Truth: Groundbreaking research from OpenAI and Georgia Tech reveals that AI hallucinations aren't mysterious glitches—they're the inevitable result of how we train and evaluate these models. And the worst part? We've been accidentally encouraging them.
🎓 Your AI is Basically a Student Taking an Exam (And It's Always Guessing)
Think back to your hardest exam in college. You studied hard, but there's one question you just don't know. Do you:
A) Leave it blank and write "I don't know"
B) Make your best educated guess
If you're trying to maximize your score, you always pick B. Blank answers get zero points. Wrong answers get zero points. But a guess has a chance of being right.
This is exactly how ChatGPT and other LLMs are trained to behave.
The researchers analyzed dozens of popular AI benchmarks—MMLU, GPQA, SWE-bench, and others. Their finding? Nearly all of them use binary grading that rewards guessing over admitting uncertainty.
of mainstream AI evaluations penalize models for saying "I don't know"—even when uncertainty is the honest answer
📊 The Three Statistical Reasons AI Hallucinates (Explained Like You're Five)
The research identified three fundamental causes of hallucinations during the training phase. Let's break them down with real examples:
The Hallucination Triangle
Singleton Problem
Rare facts in training data
Poor Model Problem
Architecture limitations
GIGO Problem
Bad training data
1. 🎲 The "Singleton Problem" (Arbitrary Facts)
The Issue: If a fact appears only once in training data, the AI has about a 50/50 shot of getting it wrong.
Real Example: Einstein's birthday appears thousands of times online. The birthday of a moderately notable professor? Maybe once, in an obituary. Guess which one ChatGPT gets wrong?
📐 The Math (from the paper):
Where singleton rate = fraction of facts appearing exactly once in training, and δ = calibration error (typically small)
⚠️ The Bottom Line: If 20% of facts in training data appear exactly once, the model will hallucinate on at least 20% of those facts—no matter how good the AI is.
2. 🧠 The "Poor Model Problem"
The Issue: Some tasks are just hard for the model's architecture to learn.
Real Example: Ask ChatGPT "How many Ds are in DEEPSEEK?" Most models get this wrong, answering "2" or "3" or even "6" (the correct answer is 1). Why? They process text as tokens (D/EEP/SEEK), not individual characters.
❌ Standard Model Response:
User: How many Ds are in DEEPSEEK?
AI: Let me count: D-E-E-P-S-E-E-K
I see the letter D at the beginning and... there are 2 Ds.
❌ WRONG (correct answer: 1)
✅ Reasoning Model (DeepSeek-R1) Response:
User: How many Ds are in DEEPSEEK?
AI: Let me spell it out letter by letter:
D - that's one D (count: 1)
E - not a D
E - not a D
P - not a D
S - not a D
E - not a D
E - not a D
K - not a D
Total: 1 D
✅ CORRECT
✅ The Fix: Reasoning models solve this by explicitly breaking down the task into steps. Sometimes you need to change the approach, not just add more data.
3. 🗑️ The "Garbage In, Garbage Out" Problem
The Issue: Training data contains myths, conspiracy theories, and misinformation.
Real Example: Early models confidently repeated conspiracy theories because those narratives appeared in their training data. Post-training (like RLHF) helps, but it's playing catch-up to bad data that's already baked in.
🛠️ 5 Strategies to Get More Reliable AI Answers (Starting Right Now)
Okay, enough theory. Here's what you can actually do when you're building with or using these models:
Strategy #1: Explicitly Ask for Uncertainty
Don't just ask questions. Add confidence thresholds to your prompts.
❌ BAD PROMPT:
"What is the market size for AI agents in 2025?"
✅ BETTER PROMPT (75% confidence):
"What is the market size for AI agents in 2025? Only answer if you're >75% confident in the accuracy of your response. If you're not confident, explain what information is missing or uncertain, and suggest how I could find more reliable data."
🎯 ADVANCED PROMPT (90% confidence):
"What is the market size for AI agents in 2025? Answer only if you're >90% confident. If uncertain, provide:
1) What you DO know with confidence
2) What's uncertain and why
3) Specific sources I should check"
Strategy #2: Use "Recent Facts" Triggers Wisely
The research shows models should ALWAYS search when asked about recent events, elections, prices, or anything that changes frequently (daily/monthly).
🔍 Keywords that Should Trigger Web Search:
Strategy #3: Break Down Complex Tasks (Chain-of-Thought)
Remember the letter-counting problem? Give the model explicit steps:
"Count the number of [LETTER] in [WORD] by following these exact steps:
Step 1: Spell out [WORD] letter by letter with spaces between each letter
Step 2: Go through each letter one at a time and mark whether it matches [LETTER]
Step 3: Count the total number of matches
Step 4: Provide only the final number"
Pro Tip: This forces the model into a better reasoning path and makes errors easier to spot.
Strategy #4: Verify "Singleton Facts" (The Hallucination Checklist)
Any fact that's obscure or specific should be treated as high-risk.
⚠️ High-Risk Hallucination Signals:
Specific dates for non-famous people or events
Quotes or statistics without attribution
Technical specifications for obscure products
Financial data for non-public companies
Any claim that returns <10 Google results
✅ Rule of Thumb: If Google returns fewer than 10 results for a specific claim, consider it a hallucination risk. Always verify with original sources.
Strategy #5: Use Reasoning Models for High-Stakes Decisions
Models like OpenAI's o1, DeepSeek-R1, or Claude with extended thinking show their "chain of thought." This helps you spot where they might be guessing.
| Use Case | Recommended Model Type |
|---|---|
| ✍️ Content writing, brainstorming | Standard (GPT-4, Claude Sonnet) |
| 📊 Data analysis, research | Standard + Web Search |
| 💰 Financial decisions | Reasoning (o1, DeepSeek-R1) |
| ⚖️ Legal/medical review | Reasoning + Human verification |
| 🔢 Math, logic problems | Reasoning (o1, DeepSeek-R1) |
👨💻 For Developers: Building More Reliable AI Products
If you're building AI products, this research has direct implications for your architecture:
🏗️ Architectural Recommendations
1. Implement Confidence-Based Routing
Don't just rely on the LLM's output. Build a routing layer that:
- Detects low-confidence scenarios (obscure facts, recent events, calculations)
- Routes to web search, specialized tools, or human review
- Returns "Unable to verify" rather than hallucinated answers
2. Create Custom Evaluation Metrics
The researchers found that standard benchmarks encourage hallucination. For your use case:
- Reward uncertain responses with partial credit
- Penalize confident wrong answers more than "I don't know"
- Track hallucination rate as a separate metric from accuracy
3. Implement Verification Layers
For critical facts:
- Use multiple models and check for consensus
- Automatically fact-check claims against authoritative sources
- Flag inconsistencies between related answers
🚀 Quick Implementation Checklist
Classify query risk based on keywords (singleton signals, recent events, calculations)
Route high-risk queries to web search, reasoning models, or return "unable to verify"
Track hallucinations separately from accuracy in your metrics
Penalize confident wrong answers more than uncertain responses
Verify critical facts with multi-model consensus + web search
🔮 The Bigger Picture: Why This Matters for AI's Future
Here's the uncomfortable truth: we've been measuring AI progress all wrong.
The researchers analyzed the most influential AI leaderboards—HELM, Open LLM Leaderboard, SWE-bench, and Humanity's Last Exam. Nearly all of them use metrics that inadvertently reward confident hallucinations over honest uncertainty.
Suppose Model A is aligned—it correctly signals uncertainty and never hallucinates. Let Model B be similar except it never indicates uncertainty and always 'guesses' when unsure.
Model B will outperform Model A under 0-1 scoring, the basis of most current benchmarks.
— Kalai et al., "Why Language Models Hallucinate"
→ Read full paperThis creates an "epidemic" of penalizing uncertainty. Models that honestly say "I don't know" get lower benchmark scores. So the entire industry optimizes for models that confidently bullshit instead.
Key Research Insights from the Paper
📈 Statistical Guarantee
Error rate ≥ 2× IIV classification error (even with perfect calibration)
🎯 Singleton Bound
Hallucination rate ≥ fraction of facts appearing once in training data
⚖️ Binary Grading Problem
90%+ of benchmarks penalize "I don't know" responses equally to wrong answers
🔧 Proposed Solution
Add explicit confidence thresholds to evaluation prompts (e.g., "answer only if >75% confident")
The fix? We need to change how we evaluate AI—not just add more hallucination tests. The researchers propose adding explicit "confidence targets" to evaluation prompts, similar to how some standardized tests penalize wrong answers.
🎯 Key Takeaways
1. Hallucinations aren't bugs—they're the result of training AI to maximize test scores, which rewards guessing over honesty.
2. Three statistical factors guarantee hallucinations: rare facts (singleton problem), model limitations (poor models), and bad training data (GIGO).
3. You can get better answers by explicitly asking for confidence levels, breaking down complex tasks, and verifying singleton facts.
4. Developers should build confidence-based routing, custom metrics that reward uncertainty, and multi-layer verification systems.
5. The industry needs better benchmarks that don't penalize models for admitting what they don't know.
💭 Final Thought: Living with Uncertainty
The most fascinating insight from this research isn't technical—it's philosophical.
We've built a technology that mirrors one of humanity's most persistent flaws: our discomfort with saying "I don't know."
Students guess on exams because blank answers get zero points. Experts make confident predictions because "it depends" doesn't get you on TV. And now our AI systems hallucinate because benchmarks punish uncertainty.
Perhaps the path to more truthful AI starts with building systems—and cultures—that value honest uncertainty over confident bullshit.
After all, "I don't know, but I can find out" is a much more intelligent response than a confidently wrong answer.
📚 Want to Dive Deeper?
The full research paper "Why Language Models Hallucinate" is available on arXiv. It's technical but worth reading if you're building AI products or researching reliability.
What strategies have you found effective for reducing hallucinations? Hit reply and let me know—I read every response.
📎 Additional Resources
- Full Research Paper (arXiv) - "Why Language Models Hallucinate"
- OpenAI Research Blog - Related hallucination research
- Anthropic's AI Safety Work - Constitutional AI and RLHF
- DeepSeek - Try reasoning models mentioned in the article
Until next time, stay curious and always verify those singleton facts.
Become An AI Expert In Just 5 Minutes
If you’re a decision maker at your company, you need to be on the bleeding edge of, well, everything. But before you go signing up for seminars, conferences, lunch ‘n learns, and all that jazz, just know there’s a far better (and simpler) way: Subscribing to The Deep View.
This daily newsletter condenses everything you need to know about the latest and greatest AI developments into a 5-minute read. Squeeze it into your morning coffee break and before you know it, you’ll be an expert too.
Subscribe right here. It’s totally free, wildly informative, and trusted by 600,000+ readers at Google, Meta, Microsoft, and beyond.


