Google reports that AI chatbots are only 69% accurate, at most.
AI chatbots continue to provide inaccurate answers one-third of the time
Google has released a straightforward evaluation regarding the reliability of current AI chatbots, and the findings are not particularly favorable. Utilizing its newly developed FACTS Benchmark Suite, the company discovered that even the leading AI models find it challenging to surpass a factual accuracy rate of 70%. The highest scorer, Gemini 3 Pro, achieved an overall accuracy of 69%, while other top models from OpenAI, Anthropic, and xAI recorded even lower results. The key takeaway is clear and somewhat troubling: these chatbots still deliver incorrect answers approximately one in three times, even when they respond with apparent confidence.
This benchmark is significant because most existing AI assessments primarily concentrate on whether a model can perform a task, rather than verifying the accuracy of the information it generates. In sectors like finance, healthcare, and law, this discrepancy can be costly. A fluent and confident-sounding response that contains inaccuracies can have serious repercussions, especially when users assume the chatbot is knowledgeable.
Insights from Google’s accuracy evaluation
Google
The FACTS Benchmark Suite was developed by Google’s FACTS team in collaboration with Kaggle to systematically assess factual accuracy across four practical applications. One test evaluates parametric knowledge, verifying if a model can answer fact-based questions using only information acquired during training. Another assesses search performance to determine how effectively models utilize web tools to access accurate data. A third test focuses on grounding, assessing whether the model adheres to a given document without introducing false information. The final evaluation scrutinizes multimodal understanding, which entails accurately interpreting charts, diagrams, and images.
Google
The findings illustrate significant variations among models. Gemini 3 Pro topped the rankings with a FACTS score of 69%, while Gemini 2.5 Pro and OpenAI’s ChatGPT-5 closely followed with nearly 62%. Claude 4.5 Opus scored around 51%, and Grok 4 achieved roughly 54%. The weakest performances were seen in multimodal tasks, often falling below the 50% accuracy threshold. This is particularly concerning as such tasks involve interpreting charts, diagrams, or images, where a chatbot may confidently misinterpret a sales graph or retrieve an incorrect figure from a document, leading to mistakes that could easily be overlooked and difficult to rectify.
The conclusion is not that chatbots lack utility, but rather that blind trust in them is hazardous. Data from Google indicates that while AI is progressing, it still requires verification, safeguards, and human oversight before being considered a truly reliable source of accurate information.
Other articles
Google reports that AI chatbots are only 69% accurate, at most.
Google's latest FACTS benchmark indicates that the leading AI chatbots currently achieve only around 69 percent accuracy. Even the top models, such as Gemini 3 Pro, make factual errors one-third of the time, which raises new concerns for businesses relying on the dependability of AI.
