Doctor vs. ChatGPT showed promise and blind spots of generative AI

Generative AI tools, also known as large language models, are already helping doctors transcribe visits and summarize patient records. The technology behind ChatGPT, trained on vast amounts of data from the internet, made headlines when it correctly answered more than 80% of board exam questions. In July, a team at Beth Israel saw promising results when using GPT-4 during a diagnosis workshop for medical residents.

But the tool is by no means ready for primetime. When given questions about real-life medical scenarios by Stanford researchers, GPT frequently disagreed with humans or offered irrelevant information. The AI models are prone to “hallucinating,” or making things up because they sound right — a tendency that could cause immeasurable harm if let loose on patients. AI leaders and policymakers alike have called for more regulations.

advertisement

STAT decided to put ChatGPT to the test at its annual summit on Wednesday, pitting the tool against Ann Woolley, infectious disease specialist at Brigham and Women’s Hospital. Marc Succi, the innovation chair at Mass General, gave two patient scenarios to the tool while Woolley explained her own diagnoses.

In the first case, a 64-year-old man arrives at the hospital with fever, dizziness, headaches, and limb soreness. Woolley laid out her thinking, arriving ultimately at a diagnosis of Covid-19. ChatGPT, on the other hand, was far more general in its answers. The man could have a viral illness, or a bacterial infection. The bot also recommended ordering a whole host of tests, including a complete blood culture and a head CT scan.

“The rest of this stuff is, if you had unlimited money and time, what would you order?” Succi said, referring to the many expensive tests GPT listed.

advertisement

GPT never explicitly said the diagnosis “Covid-19” — a blind spot that an audience member pointed out. Instead, the tool described the condition as pneumonia without asking for another Covid PCR test.

The next scenario was more complicated, involving a man with respiratory failure, fever, and a previous fungal infection. GPT mirrored many of Woolley’s diagnostic suggestions, and landed on the same primary diagnosis: an invasive fungal infection. Succi noted that the tool performs better when given more information in the prompt.

“This, compared to the first case, is much more on point,” Woolley said. “This one is a lot more aligned with the complexity of the patient.”

Woolley said that GPT’s recommendations could be helpful for a clinician who is able to validate the tool and recognize its errors. It’s less helpful for a patient going in blind.

“As a clinician, this could be helpful because you have enough knowledge to say this doesn’t make sense,” Woolley said. “This isn’t what I want.”

In addition to the danger of making wrong diagnoses, GPT’s recommendations are often unrealistic. The bot lives in a world where cost is no object and doctors can order as many tests as they wish.

“It totally doesn’t care about health care costs, by the way,” Succi said. “This would be a nice world to live in.”

The experiment showed the tool’s potential, but it also pointed out the importance of having a human in the loop.