As ChatGPT and similar Large Language Model (LLM) tools gather and organize data in responses, it’s important to understand how the data they use has been curated. Medical science is constantly evolving, with previously accepted conclusions and standards being supplanted by newer information. Yet the outdated and sometimes disproven body of literature is often still quite large.
For example, external carotid to internal carotid (EC/IC) bypass for cerebral ischemia due to cerebrovascular stenosis has a decade of literature demonstrating its merits—right up until it was disproven by a well-designed definitive study. As expected, once the benefits of EC/IC bypass were shown to be limited, few further studies were done.
Yet, in terms of sheer volume, the older, disproven information outweighs the newer, more accurate data substantially. Plus, further confusing the issue, every year vast amounts of information is created based upon biased and fraudulent “studies” for financial gain and other motivations.
Given how current LLMs are being trained, there’s a significant risk these tools could gather up and homogenize old, new, and biased data. As a result, an LLM would offer the clinical end-user inaccurate information and conclusions.
With the rise of generative models, there’s growing concern that the current regulatory frameworks to ensure that AI is safe and effective are inadequate. As more use cases are identified where clinicians reach out to AI at or near the moments of clinical decision-making, many in the industry believe we don’t have the right tools to ensure that LLMs are aligned with the patient’s best interests.
With these risks in mind, how can we curate medical information to address model training risk and help algorithms avoid simply averaging all information? And can we avoid having LLMs create “hallucinations” where conclusions are fabricated?
Here are three approaches to consider:
- Periodically retrain models on professionally curated data that reflect the most recent and best thinking in the field. This includes eliminating outdated and/or disproven information from the data set. Doing this requires access to the underlying base model—which currently isn’t open source.
- Connect an LLM to auxiliary private data and information sources. This can be done via retrieval augmented LLMs or public data/knowledge bases via “plugins” such as Wolfram Alpha for ChatGPT. Such approaches would not only benefit the information provided by LLMs but help decrease their hallucinations.
- Establish best practices in generative model development, implementation, and governance over time. Regulatory bodies must make a concerted effort to secure greater transparency in model development, independently monitor model performance over time, and develop governance frameworks to ensure appropriate life cycle management. These regulatory bodies should also support the development of a clear set of evaluatory metrics and methods to assess the safety and effectiveness of generative models, as well as best practices for building strong guardrails for these models moving forward.
No doubt other approaches can and will be considered. But regardless of approach, the greatest challenge will be obtaining the curated data sets for LLM training. To do this, professional healthcare organizations must curate medical information that can be utilized for training and/or fine-tuning LLMs and produce the summary information essential to improving healthcare diagnosis, treatment, and prevention.
As AI becomes increasingly integrated into healthcare, the need for AI assurance and regulatory frameworks will become increasingly important to ensure the tools provide safe, reliable, and equitable information. This will both advance healthcare services and avoid the unintended consequences that will harm patients and the future of the healthcare ecosystem. The challenge, of course, is to do this without hindering innovation.
Public/private collaborations, such as the Coalition for Health AI (CHAI) and other groups are well positioned to achieve this difficult balancing act. Together, they can establish best practices, and develop sandboxes for model and regulation testing, and other guidelines. In April, CHAI published its blueprint for using AI in healthcare, which is closely related to the White House’s proposed Blueprint for an AI Bill of Rights. These frameworks are voluntary for the industry, at least for now.
Despite the uncertainty, concerns, and controversy, there’s no doubt that AI and its LLM offspring will continue to evolve and shape the future of healthcare. It’s time to create the proper controls to guide this development in ways that will optimize the potential benefit and avoid the pitfalls that have accompanied new technologies from atomic fission to the internet.
About Dr. Brian Anderson
Dr. Brian Anderson is the chief digital health physician at MITRE where he leads research and development efforts across major strategic initiatives in digital health alongside industry and the U. S. government, and a cofounder of the Coalition for Heath AI.
About Dr. Stephen Ondra
Dr. Stephen L. Ondra is the chief medical adviser for MITRE’s work as operator of the CMS Alliance to Modernize Healthcare Federally Funded Research and Development Center (Health FFRDC).