ChatGPT, Gemini, and other AI bots often provide poor medical advice about half the time.
Individuals are increasingly using AI chatbots similarly to search engines for daily health inquiries. This trend appears more concerning following a recent study that revealed problematic responses from half of the answers across five prominent bots, even when the replies appeared well-articulated and self-assured.
Researchers assessed ChatGPT, Gemini, Grok, Meta AI, and DeepSeek using 250 prompts relating to cancer, vaccines, stem cells, nutrition, and athletic performance.
The prompts were designed to reflect frequent health concerns and common misinformation topics, measuring whether the bots adhered to scientific evidence or veered into misleading and possibly dangerous advice.
Broad inquiries revealed significant deficiencies
The most significant shortcomings arose from open-ended prompts. These broader questions yielded far more problematic responses than anticipated, while closed prompts tended to generate safer answers.
This is significant because people typically do not pose medical questions in a structured, multiple-choice format. Instead, they inquire about the effectiveness of treatments, the safety of vaccines, or ways to enhance athletic performance.
In the study, such prompts led the bots to provide answers that combined solid evidence with less reliable or misleading claims.
High confidence, poor sourcing
The issues extended beyond the answers presented. The quality of references was lacking, with an average completeness score of only 40%, and none of the chatbots managed to produce a fully accurate reference list.
This undermines one of the key reasons for trusting chatbot responses. A reply may appear well-sourced and authoritative but may not hold up once the citations are scrutinized.
The researchers also pointed out instances of fabricated references, even as the bots answered with certainty and provided minimal disclaimers.
Implications beyond a single study
While there are limitations to the findings, as the study focused on just five chatbots, these products are rapidly evolving, and the prompts were designed to challenge the models, potentially exaggerating the frequency of poor answers in normal usage.
Nonetheless, the primary takeaway is difficult to ignore. These systems were evaluated on evidence-based medical subjects, and yet half of the responses still ventured into flawed or incomplete territory.
At this point, chatbots might assist in summarizing information or shaping follow-up questions, but they currently lack the reliability needed for significant medical decision-making.
Other articles
ChatGPT, Gemini, and other AI bots often provide poor medical advice about half the time.
A study published in BMJ Open discovered that five prominent AI chatbots frequently provided inaccurate health recommendations, with open-ended inquiries leading to the poorest responses and the quality of citations deteriorating upon examination.
