It's not just a figment of your imagination — ChatGPT models indeed hallucinate more nowadays.

It's not just a figment of your imagination — ChatGPT models indeed hallucinate more nowadays.

      OpenAI published a paper last week outlining various internal assessments and discoveries regarding its o3 and o4-mini models. The key distinctions between these newer models and the original versions of ChatGPT introduced in 2023 are their enhanced reasoning abilities and multimodal features. The o3 and o4-mini models can produce images, browse the internet, perform tasks automatically, recall previous conversations, and tackle intricate problems. However, these advancements have seemingly led to unforeseen consequences.

      What do the tests reveal?

      OpenAI uses a specific test called PersonQA to gauge hallucination rates. This test consists of a collection of facts about individuals that the model can "learn" from, followed by questions related to those individuals. The model’s accuracy is evaluated based on its response attempts. The o1 model from last year achieved an accuracy of 47% and a hallucination rate of 16%.

      Since these figures do not add up to 100%, we can infer that the remaining responses were neither accurate nor hallucinations. The model may occasionally indicate that it doesn’t know or cannot find the required information, may choose not to make any assertions at all and provide relevant information instead, or could make a minor error that isn’t deemed a full-blown hallucination.

      OpenAI

      When o3 and o4-mini were evaluated against this standard, their hallucination rates were considerably higher than that of o1. OpenAI noted that this was somewhat anticipated for the o4-mini model due to its smaller size and limited world knowledge, leading to increased hallucinations. Nonetheless, the 48% hallucination rate recorded by o4-mini appears excessively high when considering it is a commercially available product used for various information and advice searches.

      On the other hand, o3, the full-sized model, exhibited hallucinations in 33% of its responses during the test, which was an improvement compared to o4-mini but still twice the hallucination rate of o1. Despite this, it maintained a high accuracy rate, which OpenAI attributes to its tendency to make more assertions overall. Therefore, if you’ve utilized either of these newer models and observed numerous hallucinations, it’s not merely your imagination. (Perhaps I should make a joke like, “Don’t worry, you’re not the one hallucinating.”)

      What are AI “hallucinations” and why do they occur?

      While you may have encountered the term "hallucinations" in reference to AI models, its precise meaning can be unclear. Whenever you interact with an AI system, whether from OpenAI or another source, there’s a high likelihood you’ll encounter a disclaimer stating its responses may be inaccurate, emphasizing the need for personal verification.

      Inaccurate information can arise from various sources — occasionally an erroneous fact makes its way onto Wikipedia, or users may share incorrect information on platforms like Reddit, and this misinformation can incorporate itself into the AI’s responses. For instance, a recent incident involving Google’s AI suggested a pizza recipe that contained "non-toxic glue," which was ultimately traced back to a joke on Reddit.

      However, these instances are not considered "hallucinations"; they are more accurately described as traceable errors stemming from faulty data or misinterpretation. Hallucinations occur when an AI model generates claims without any clear source or justification. This phenomenon often arises when the model cannot locate the necessary information to respond to a specific question, and OpenAI has characterized it as "a tendency to invent facts in moments of uncertainty." Other industry experts refer to it as "creative gap-filling."

      Hallucinations can be prompted by asking leading questions such as, "What are the seven iPhone 16 models available right now?" Since there aren’t seven models, the model is likely to provide some accurate information along with fabricated models to complete the response.

      I decided to test this with GPT-4o to see if it would follow suit — and it did! Willow Roberts / Digital Trends

      Chatbots like ChatGPT are trained not only on internet data that shapes their responses but also on response methodologies. They are presented with thousands of example queries paired with ideal responses to promote the desired tone, attitude, and politeness level.

      This aspect of training can cause an LLM to appear to align with your views or comprehend your statements, even when the rest of its output contradicts those claims. It’s plausible that this training contributes to the frequency of hallucinations — since a confident response that addresses the query is reinforced as a more desirable outcome compared to an answer that fails to respond adequately.

      To us, it is evident that generating random falsehoods is worse than simply not knowing the answer — yet LLMs do not “lie.” They do not even comprehend what lying entails. Some argue that AI errors mirror human mistakes, and since "we don't always get it right, we shouldn't expect AI to either." Nonetheless, it’s crucial to understand that AI inaccuracies result from imperfect systems we have developed.

      AI models do not lie, form misunderstandings

It's not just a figment of your imagination — ChatGPT models indeed hallucinate more nowadays. It's not just a figment of your imagination — ChatGPT models indeed hallucinate more nowadays. It's not just a figment of your imagination — ChatGPT models indeed hallucinate more nowadays.

Other articles

I invested $3,000 in a PC just to play a $20 game, and I have absolutely no regrets. I invested $3,000 in a PC just to play a $20 game, and I have absolutely no regrets. As a PC building enthusiast, I have a lot of experience and knowledge, yet I still found myself succumbing to the temptation of spending "just a little more." It wasn’t necessary at all. Huawei’s response to the Moto Razr Ultra 2025 makes me question who’s correct. Huawei’s response to the Moto Razr Ultra 2025 makes me question who’s correct. Should flip phones be considered a smaller smartphone or a mini tablet? I tested the Huawei Pura X and Razr Ultra 2025 to explore this question. How to utilize Loss of Pulse Detection on the Pixel Watch 3. How to utilize Loss of Pulse Detection on the Pixel Watch 3. Google entered the smartwatch market later than others and had a challenging debut with its initial Pixel Watch in late 2022. Nonetheless, the company significantly improved with the Pixel Watch 2, a deserving successor that demonstrated Google's capability to compete with established brands such as Apple and Samsung. Following that, it achieved great success with the Pixel Watch 3, securing the top spot among the best smartwatches. Reasons the RTX 5060 Ti could be the graphics card your SFF has been anticipating. Reasons the RTX 5060 Ti could be the graphics card your SFF has been anticipating. The RTX 5060 Ti is a practical choice for small form factor builds. Google's Find My Device tracking feature might soon be able to perform one of Apple's top tricks. Google's Find My Device tracking feature might soon be able to perform one of Apple's top tricks. A Google executive has suggested that support for UWB-based object tracking may soon be integrated into its Find My Device network, potentially within this year. Lossless audio represents a significant improvement, but the cables are somewhat disappointing. Lossless audio represents a significant improvement, but the cables are somewhat disappointing. An increasing number of headphones are providing lossless USB audio, but the cables often seem to be an afterthought.

It's not just a figment of your imagination — ChatGPT models indeed hallucinate more nowadays.

Internal tests indicate that newer models such as o3 and o4-mini exhibit a considerably higher rate of hallucination compared to older versions, and OpenAI is uncertain about the reasons behind this.