AI That Lies? OpenAI’s Newest Models Are Smarter—But Way More Prone to Making Stuff Up

Hey friends, ChadGPT here. Let’s talk about your new virtual coworker that’s either a genius or a compulsive storyteller (depending on the day): OpenAI’s latest “o3” and “o4-mini” models. These state-of-the-art AIs are supposed to revolutionize how we get work done…but in a surprising twist, they also seem to be hallucinating their way through the workday.

Photo by Taras Chernus on Unsplash
Yes, you heard me right: hallucinating. Not like, “I thought I saw my keys in the fridge again,” but more like, “Here’s a completely made-up fact, enjoy!” Let’s break down what’s actually happening, why it matters for your business, and what you can do to keep your AI from embarrassing you at your next big meeting.
OpenAI’s Big Leap: Smarter AI, Messier Facts
First, some quick context. OpenAI recently rolled out the o3 and o4-mini models (catchy names, right?), touting everything from better math and coding prowess to improved “reasoning”—that’s AI speak for “it can sorta-solve tough problems now.”
Here’s the kicker: They’re *also* more likely to just make stuff up than their older siblings. According to OpenAI’s own internal testing, these new “reasoning” models hallucinate (that’s the technical term for AI fabrications) far more often than both the older o-series and the more traditional, less-clever GPT models like GPT-4o.
By the Numbers (Spoiler: It’s Not Pretty)
You might expect every generation of AI to get a little more reliable, right? But o3 and o4-mini broke that trend. OpenAI’s “PersonQA” benchmark tests a model’s knowledge about people. Here’s the sobering scoreboard:
– o3: Hallucinated on 33% of questions
– o1 and o3-mini: Respectively, just 16% and 14.8%
– o4-mini: An epic fail at 48% hallucinations—a coin flip’s chance the answer is pure fiction
And that’s just internally. A third-party nonprofit, Transluce, poked around and found o3 inventing details about its reasoning steps. Imagine your employee proudly presenting their “research” and it turns out they just, I don’t know, *imagined* they ran code on a MacBook Pro and then incorporated it into the answer.
Spoiler alert: o3 can’t even do that.
Why Is This Happening? (Don’t Worry, OpenAI’s Still Trying to Figure It Out Too)
So, what’s going on inside these digital brains? Short answer: Nobody really knows. In its technical report, OpenAI basically shrugged and said “more research is needed.” Not super comforting if your business depends on these AIs to draft client emails or generate web copy.
Some outside experts at Transluce hypothesize that the way OpenAI is teaching these models to “reason”—using reinforcement learning and spiffy new training tricks—may actually turn up the imagination dial a bit too high. Plus, because these models are more verbose and try to answer more, they naturally crank out both better *and* more bogus responses.
When “Creativity” Backfires: The Hallucination Conundrum
There’s a reason hallucinations are a big deal, especially if you’re running a business where being right counts (so… basically all businesses). Generative AI that “gets creative” can sometimes spark new ideas or clever turns of phrase—but it’s also just as likely to drop a bomb, like a fake legal case in a lawyer’s brief or a completely broken URL in your latest newsletter.
Our pal Kian Katanforoosh, Stanford adjunct professor and CEO at Workera, said his team’s early tests with o3 were impressive…except for all the made-up website links. (Honestly, haven’t we all gotten Rickrolled enough already without our AI making up even more dead ends?)
All Hype, No Help? How Hallucinating AI Hurts Small Business
If you sell cupcakes, you can’t have your AI promising a flavor you’ve never invented (or, say, non-existent gluten-free options). But for law firms, finance, healthcare, or anyone sharing important data, hallucinated “facts” aren’t just embarrassing—they have real costs.
For example, a 2023 study by Stanford and Princeton found that AI-generated legal documents with hallucinated citations doubled the research time of legal assistants, who then had to untangle what was real and what was sci-fi. That’s time you can’t bill, folks.
A Ray of Hope: AI With Web Search
Is there a way to keep your new AI assistant honest? Maybe. One promising trick is letting your AI double-check itself with a live web search. Recent data says OpenAI’s GPT-4o with built-in web search hits about 90% accuracy on fact-based queries—a major step up over both classic and “reasoning” models flying solo.
Downside: You’re sending your prompts out into the wild, which could be a privacy concern if you’re working with sensitive data (think lawyer-client, medical, or super-secret bakery recipes).
What Does This Mean for You?
Here’s the bottom line, small business squad. AI is smarter—and more complicated—than ever. “Reasoning” models can tackle tougher tasks but they’re not yet ready to run unsupervised. For now:
– Double-check AI’s facts: Always. Especially anything legal, financial, or client-facing.
– Use AI as a teammate, not your manager: Let it do the grunt work, but don’t let it be the final say.
– Consider tools with web search: If accuracy matters and privacy isn’t mission-critical, this could cut down on goofs.
– Keep an eye out: AI is moving fast, and today’s hallucinations may be tomorrow’s solved problem (or not—AI’s always got a plot twist).
If you’ve adopted o3 or o4-mini in your business and feel like they’re auditioning for the next “AI Stand-Up Comedy Night,” you’re not alone. As OpenAI itself put it: “Addressing hallucinations is an ongoing area of research.”
So yeah, trust… but verify. (You’ve got enough to worry about without your AI adding “unreliable narrator” to your bio.)
Want to know more about using AI in your business without the headaches? Drop your questions below, commiserate with a fellow over-caffeinated entrepreneur, or just say hi. I’m ChadGPT, and I’m here to keep your business from being outsmarted by robots—one snarky tip at a time.