A large language model (LLM)-based chatbot gave the wrong diagnosis for the majority of pediatric cases, researchers found.
ChatGPT version 3.5 reached an incorrect diagnosis in 83 out of 100 pediatric case challenges. Among the incorrect diagnoses, 72 were actually incorrect and 11 were clinically related to the correct diagnosis but too broad to be considered correct, reported Joseph Barile, BA, of Cohen Children’s Medical Center in New Hyde Park, New York, and colleagues in JAMA Pediatrics.
For example, ChatGPT got it wrong in a case of rash and arthralgias in a teenager with autism. The physician diagnosis was “scurvy,” and the chatbot diagnosis was “immune thrombocytopenic purpura.”
An example of an instance in which the chatbot diagnosis was determined to not fully capture the diagnosis was in the case of a draining papule on the lateral neck of an infant. The physician diagnosis was “branchio-oto-renal syndrome,” and the chatbot diagnosis was “branchial cleft cyst.”
“Despite the high error rate of the chatbot, physicians should continue to investigate the applications of LLMs to medicine,” Barile and colleagues wrote. “LLMs and chatbots have potential as an administrative tool for physicians, demonstrating proficiency in writing research articles and generating patient instructions.”
They reported a representative example of a correct diagnosis, the case of a 15-year-old girl with unexplained intracranial hypertension. The physician diagnosis was “primary adrenal insufficient (Addison disease),” and the chatbot diagnosis was “adrenal insufficiency (Addison disease).”
A prior study had found that a chatbot rendered a correct diagnosis in 39% of cases, suggesting that LLM-based chatbots “could be used as a supplementary tool for clinicians in diagnosing and developing a differential list for complex cases,” Barile and colleagues wrote. “To our knowledge, no research has explored the accuracy of LLM-based chatbots in solely pediatric scenarios, which require the consideration of the patient’s age alongside symptoms.”
Overall, “the underwhelming diagnostic performance of the chatbot observed in this study underscores the invaluable role that clinical experience holds,” the authors wrote. “The chatbot evaluated in this study — unlike physicians — was not able to identify some relationships, such as that between autism and vitamin deficiencies.”
“LLMs do not discriminate between reliable and unreliable information but simply regurgitate text from the training data to generate a response,” Barile and colleagues noted. Some also lack real-time access to medical information, they added.
More selective training is likely needed to improve chatbots’ diagnosis accuracy, they suggested.
To complete their study, Barile and colleagues accessed JAMA Pediatrics and the New England Journal of Medicine for pediatric case challenges. Text from 100 cases was pasted into ChatGPT version 3.5 with the following prompt: “List a differential diagnosis and a final diagnosis.”
Two physician researchers scored the chatbot-generated diagnosis as “correct,” “incorrect,” or “did not fully capture diagnosis.”
More than half of the incorrect diagnoses generated by the chatbot did belong to the same organ system as the correct diagnosis, Barile and colleagues noted. Additionally, 36% of the final case report diagnoses were included in the chatbot-generated differential list.
-
Jennifer Henderson joined MedPage Today as an enterprise and investigative writer in Jan. 2021. She has covered the healthcare industry in NYC, life sciences and the business of law, among other areas.
Disclosures
The authors reported no conflicts of interest.
Primary Source
JAMA Pediatrics
Source Reference: Barile J, et al “Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies” JAMA Pediatr 2024; DOI: 10.1001/jamapediatrics.2023.5750.
Please enable JavaScript to view the