A growing number of AI tools are being used to predict everything from sepsis to strokes, with the hope of accelerating the delivery of life-saving care. But over time, new research suggests, these predictive models can become a victim of their own success — sending their performance into a nosedive and generating inaccurate, potentially harmful results.
“There is no accounting for this when your models are being tested,” said Akhil Vaid, an instructor of data-driven and digital medicine at the Icahn School of Medicine at Mount Sinai and author of the new research, published Monday in the Annals of Internal Medicine. “You can’t run validation studies, do external validation, run clinical trials — because all they’ll tell you is that the model works. And when it starts to work, that is when the problems will arise.”
advertisement
Vaid and his Mount Sinai colleagues simulated the deployment of two models that predicted a patient’s risk of dying and acute kidney injury within five days of entering the ICU. Their simulations assumed that the models did what they were supposed to — lower deaths and kidney injury by identifying patients for earlier intervention.
But when patients start faring better, the models became far less accurate at predicting the likelihood of kidney failure and mortality. And retraining the models and other methods to stop the decay didn’t help.
The latest research findings serve as a cautionary note at a time when few health systems are tracking the performance of AI models over time, and raise questions about what potential performance degradation means for patient outcomes, especially in settings that have deployed multiple AI systems that could be affecting each others’ performance over time.
advertisement
Last year, an investigation from STAT and the Massachusetts Institute of Technology captured how model performance can degrade over time by testing the performance of three predictive algorithms. Over the course of a decade, accuracy for predicting sepsis, length of hospitalization, and mortality varied significantly. The culprit? A combination of clinical changes — the use of new standards for medical coding at the hospital — and an influx of patients from new communities.
When models fail like this, it’s due to a problem called data drift. “There’s been lots of conversation about how the input data may change over time and have an unexpected output,” said Matthew Robinson, an infectious disease and health informatics researcher at Johns Hopkins University School of Medicine who co-authored an editorial on the Mount Sinai research.
The new study identified a different, counterintuitive problem that can hobble predictive models’ performance over time. Successful predictive models create a feedback loop: As the AI helps drive interventions to keep patients healthier, electronic health records within a system may start to reflect lower rates of kidney injury or mortality — the same data that other predictive models are applied to, and that are used to retrain models over time.
“As long as your data is getting polluted or corrupted by the output of the model, then you have a problem,” said Vaid.
The researchers demonstrated how the problem emerges in three scenarios, each commonly implemented by health systems using AI today. First, they deployed the mortality prediction model on its own, and retrained it on new patient data — a common strategy to avoid data drift. Counterintuitively, retraining the models on data from patients the model had helped made it likely to underpredict mortality risk, and the model’s specificity plummeted up to 39%.
“That’s huge,” said Vaid. “That means that once you retrain your model, it’s effectively useless.”
In two other scenarios, the acute kidney injury predictor and mortality predictor were used together. When the kidney model’s predictions helped patients avoid acute kidney injury, it also reduced deaths — so when the mortality predictor was later created using that data, its specificity suffered. And when both models were deployed simultaneously, the changes in medical care encouraged by each of them rendered the others’ predictions ineffective.
Vaid said he’s spoken with health systems that claim to have deployed 15 or 20 models simultaneously. “This is a recipe for something going horribly wrong,” he said. And the longer health systems use predictive models without accounting for this feedback loop of degraded performance, the less reliable they’ll become. “It’s like a ticking time bomb.”
“We’ve long recognized that successful implementations affecting patient outcomes and downstream feedback within EHR data will require new approaches to model updating,” Sharon Davis, a professor of biomedical informatics at Vanderbilt University Medical Center, wrote in an email to STAT. “The interactive effects of the sequential and simultaneous release of AI-based tools are another layer of complexity for model managers that will need innovative solutions.”
While many health systems are thinking critically about how to manage problems like data drift, nobody’s yet thought through how to manage the performance of so many models operating simultaneously and over successive generations of patient data that have been influenced by their use, said senior author Girish Nadkarni, system chief of Mount Sinai’s division of data-driven and digital medicine. “A bunch of models are being introduced without proper monitoring, proper testing, proper validation to the system, and all of them are interacting with each other and interacting with clinicians and patients.”
Adam Yala, an assistant professor of computational precision health at UC Berkeley and UCSF, commended the work for bringing the issue to the attention of the clinical community. “It’s a super underappreciated problem,” he said. “Our current best practices, model monitoring, our regulatory practices, that way the tools we have are built, none of them address this.”
The authors acknowledge that real-world performance degradation could look different from their simulations, which were based on 130,000 ICU admissions from both Mount Sinai and Beth Israel Deaconess Medical Center. They had to guess what model adherence would look like within a health system, as well as how effective clinical interventions would be at reducing kidney injuries and deaths.
“There’s always limitations because the interventions are simulated, but that’s not the point,” said Yala. “It’s to show that this is a real phenomenon and that nothing that we’re doing can address it, even in a simple toy setting.”
To catch models when their performance starts to suffer, health systems will have to be proactive about tracking those and other metrics — but many do not. “Institutions might receive funding or glory to create models, to deploy them, but there’s less excitement in the important work of seeing how they perform over time,” said Robinson.
And even if monitoring catches models when their performance falls off, the Mount Sinai research suggests it will be difficult to correct for this kind of data contamination, because retraining didn’t revive the models’ performance in the simulation. When health systems train new models or retrain old ones, they will need to make sure they’re using patient data that’s uncorrupted by previous AI implementations. That means they’ll have to get a lot more rigorous about tracking when and how doctors use AI predictions to make clinical decisions. Robinson and his editorial coauthors suggest that adopting new variables to retrain models could help.
“There need to be regulations around this,” said Vaid. “Currently it’s just the Wild West out there. You make a model, you deploy it.”
In March, the FDA issued draft guidance that attempts to address the reality of clinical AI performance degrading over time, giving manufacturers a framework for updating models in a predetermined fashion that doesn’t require agency review for each change. But the new research suggests that the steps in that “change control plan,” including model retraining, shouldn’t be implemented unthinkingly.
“That needs to be thought about a little bit more,” said Nadkarni. “The lifecycle plan of the FDA currently includes retraining, evaluation, and updating, but implementing them wholesale without thinking about the predictive performance, the intervention effect, and adherence might actually worsen the problem.”
As many health systems continue to put off assessment of existing AI models, Robinson points out that these issues extend to the next generation of clinical tools powered by large language models. LLMs trained on their own AI-generated output perform worse and worse over time. “As radiology reports, pathology reports, or even clinical notes are more and more built by LLMs, future iterations will get trained on that data,” said Robinson. “And there could be unintended consequences.”
Vaid puts it more simply: We’re living in a model-eat-model world.
This story is part of a series examining the use of artificial intelligence in health care and practices for exchanging and analyzing patient data. It is supported with funding from the Gordon and Betty Moore Foundation.