GPT-4 language model gives limited quality medical answers in study

Dive Brief:

  • The successor to ChatGPT delivers answers that agree with health guideline recommendations 60% of the time, but gives low- to moderate-quality information, according to a study published in the Journal of Medical Internet Research
  • Researchers asked the text-generating artificial intelligence program questions on guideline recommendations for five conditions. The GPT-4 AI gave responses that matched 15 of the 25 recommendations.
  • Based on the analysis, the researchers concluded that the AI technology provides medical information of similar quality to what is available online and that responses will improve if training datasets are limited to peer-reviewed studies.

Dive Insight:

The ability of generative AI models such as ChatGPT and its successor GPT-4 to interpret and respond to questions is driving interest in using the technologies to provide medical information and help diagnose disease. 

To understand the strengths and limitations of the models, researchers asked GPT-4 25 questions based on the guidelines for five diseases — gallstone disease, pancreatitis, liver cirrhosis, pancreatic cancer and hepatocellular carcinoma — and assessed the answers it provided using a tool designed to measure the quality of information available online.

The analysis showed the AI provides information of similar quality to the internet. It achieved an Ensuring Quality Information for Patients score of 16 in gallstone disease, compared to a median of 15 in studies of online information. As the researchers state, the similar results can be explained by the fact that GPT-4 is trained on information online.

One limitation of the model is its failure to highlight medical advice that is contested. The researchers found the AI listed surgery, chemotherapy and radiotherapy as treatments for pancreatic cancer in a way that suggested equivalence between the interventions and failed to explain the sequencing of care. The role of radiotherapy is limited and a subject of debate, nuances that the model was unable to express. 

“The AI does not inform its user which medical information is controversial, which information is clearly evidence based and backed by high-quality studies, and even which information represents a standard of care,” the authors wrote. 

In light of those limitations, the researchers propose limiting the medical information used by models such as GPT-4 to peer-reviewed studies and adding a bibliography feature so users can read the papers that underpin answers. If refined, the authors said they think that “chatbots might even replace guidelines, as clinicians will be able to rapidly obtain information and guidance, eliminating the need to find, download, and read large documents.”