AI threatens to cement racial bias in clinical algorithms. Could it also chart a path forward?

PROVIDENCE, R.I. — In a cavernous converted chapel at Brown University, physician and data scientist Leo Celi observed from the sidelines as a tableful of high school students passed around a plastic, crocodilian device. The pulse oximeter clamped down on one student’s fingertip: Ninety-seven, he read out loud, before handing it off to the student next to him.

“We as doctors get a bit unhappy when you’re below 90, 92,” said physician Jack Gallifant, who recently finished a postdoc at the MIT Laboratory for Computational Physiology where Celi directs clinical research.

There was a broader point to this demonstration. Celi travels the globe coaching students and medical trainees to design artificial intelligence algorithms that predict patients’ futures — their likelihood of recovering from an illness, say, or falling ill in the first place. That work depends on reliable data, and pulse oximeters are notorious for a troubling feature: They deliver less accurate blood-oxygen readings for patients with darker skin tones.

That systematic bias is sometimes an immediate threat for Black patients, whose faulty readings can delay care and result in poorer outcomes. And it’s an example of a growing challenge for machine learning researchers like Gallifant and Celi — along with the local data science students spending a June weekend with them in Providence — who will design the next generation of algorithms in health care.

In the past four years, clinical medicine has been forced to reckon with the role of race in simpler iterations of these algorithms. Common calculators, used by doctors to inform care decisions, sometimes adjust their predictions depending on a patient’s race — perpetuating the false idea that race is a biological construct, not a social one.

Machine learning techniques could chart a path forward. They could allow clinical researchers to crunch reams of real-world patient records to deliver more nuanced predictions about health risks, obviating the need to rely on race as a crude — and sometimes harmful — proxy. But what happens, Gallifant asked his table of students, if that real-world data is tainted, unreliable? What happens to patients if researchers train their high-powered algorithms on data from biased tools like the pulse oximeter?

Over the weekend, Celi’s team of volunteer clinicians and data scientists explained, they’d go hunting for that embedded bias in a massive open-source clinical dataset, the first step to make sure it doesn’t influence clinical algorithms that impact patient care. The pulse oximeter continued to make the rounds to a student named Ady Suy — who, some day, wants to care for people whose concerns might be ignored, as a nurse or a pediatrician. “I’ve known people that didn’t get the care that they needed,” she said. “And I just really want to change that.”

At Brown and in events like this around the world, Celi and his team have been priming medicine’s next cohort of researchers and clinicians to cross-examine the data they intend to use. As scientists and regulators sound alarm bells about the risks of novel artificial intelligence, Celi believes the most alarming thing about AI isn’t its newness: It’s that it repeats an age-old mistake in medicine, continuing to use flawed, incomplete data to make decisions about patients.

“The data that we use to build AI reflects everything about the systems that we would like to disrupt,” said Celi: “Both the good and the bad.” And without action, AI stands to cement bias into the health care system at disquieting speed and scale.

Students from four Providence-area high schools assembled at Brown University over a weekend in June to hunt out the bias embedded in a medical dataset. Philip Keith for STAT

MIT launched its machine learning events 10 years ago, as AI was beginning to explode in medicine. Then, and now, hospitals were feeling pressure to implement new models as quickly as researchers and companies could build them. “It’s so intoxicating,” said Maia Hightower, the former chief digital transformation officer at UChicago Medicine, when algorithm creators make lofty promises to solve physician burnout and improve patient care.

In 2019, a paper led by University of California, Berkeley machine learning and health researcher Ziad Obermeyer gave many technology boosters pause. Health systems had widely used an algorithm from Optum to predict how sick patients were by identifying patterns in their health care costs: The sicker the patients, the more bills they rack up. But Obermeyer’s research showed the algorithm likely ended up diverting care from a huge number of Black patients that it labeled as healthier than they really were. They had low health costs not because they were healthier, but because they had unequal access to medical care.

Suddenly, researchers and policymakers were acutely aware of how hastily deployed algorithms could cast existing inequities in health care into amber.

Those risks shouldn’t have come as a surprise. “This is not an AI problem,” Marzyeh Ghassemi, who leads the Healthy Machine Learning lab at MIT, said at a recent National Academies of Sciences, Engineering, and Medicine meeting examining the role of race in biomedical algorithms. Traditional risk scores in clinical medicine — no whiz-bang machine learning techniques required — have long suffered from bias.

Fundamentally, all clinical prediction tools are built on the same flawed data. In population health research, epidemiologists have limited access to information about disadvantaged groups — whether racial and ethnic minorities, rural patients, or people who don’t speak English as their first language. The same obstacles apply to algorithm developers who train their models on real health records.

“It’s a huge problem, because it means that the algorithms are not learning from their experiences,” said Obermeyer. “They’re not going to produce as accurate predictions on those people.” And in an increasingly digitized and automated health system, those often excluded patients will be left behind yet again.

None of this was news to Celi. He had watched as ever-more algorithms continued to spit out biased results, widening the divide between privileged and marginalized patients. His MIT events had been intentionally global, aiming to foster a diverse group of AI researchers who would counter that trend.

But two years ago, his team stopped short. In their hackathons, attendees worked to train a machine learning model in just two days, often using a database of intensive care patients from the Boston area.

“We realized that trying to rush building models in two days — without really understanding the data — is probably the best recipe for artificial intelligence to truly encode, encrypt the inequities that we’re seeing now,” said Celi.

Encouraging the next crop of physicians and researchers to build models from flawed data? “That’s no longer going to cut it,” said Celi. They’d need to learn to interrogate medical data for bias from the ground up.

Sometimes, it’s high school students doing the digging. In February, at the University of Pittsburgh, it was a group of more than 20 doctors, biomedical data students, and medical trainees: A nursing professor calling in from Saudi Arabia. An Indian American resident in pediatric critical care. A health data engineer, fresh from the University of Wyoming, on the second week of his new job at Pitt. A cadre of critical care residents building machine learning models for children’s hospitals.

Pulse oximeters are a stark example of how racial bias can sneak into seemingly objective medical data. But as the Pittsburgh participants were quickly learning, there were even more insidious ways that bias could lurk within the 90,000 rows of medical data in front of them. Each row represented a person who had found themselves in a critical care unit somewhere in the world. And each, Celi explained, reflected the social patterns that influenced their health status and the care they received.

“We are not collecting data in an identical fashion across all our patients,” he told them. To find themselves in a database of intensive care patients, a person has to get to the ICU first. And depending on how good a hospital they reach, their workup looks different.

Olga Kravchenko, a biomedical informatician at the University of Pittsburgh, started using ChatGPT to crank out code, looking for odd patterns in the paths that patients took to the ICU and their care once they were admitted. Here was something: Patients labeled as Native American and Caucasian, compared to other racial groups, had much higher rates of needing a ventilator.

“Maybe they are sicker,” mulled Sumit Kapoor, a critical care doctor at UPMC. The data could be mirroring accumulated generations of mistreatment: Indigenous Americans, on the whole, have higher rates of diabetes and other chronic illnesses than the average patient.

But what about the white patients? Could unconscious favoring of these patients influence treatment enough that it showed up as a signal in the data? “Maybe they are more privileged that they get the mechanical ventilation compared to minorities,” said Kapoor.

The answers to those questions are rarely clear. But asking why a racial disparity appears in data is a critical step to ensuring that its signal doesn’t get misused in a predictive algorithm, for example, that helps hospitals determine who’s most likely to benefit from a limited supply of ventilators.

Sometimes, a racial signal is embedded so deeply in medical data that it’s invisible to humans. Research has shown that machine learning algorithms can predict a patient’s race from their X-rays and CT scans, something even a highly trained radiologist can’t do. If a model can guess a patient’s race from medical images, warn MIT’s Celi and Ghassemi, it could start to base its predictions on a patient’s race instead of their underlying cause of disease — and doctors would be none the wiser. “The machines learn all sorts of things that are not true,” said Celi.

Just as important is the data that doesn’t appear in medical datasets. Family cancer history is one of the strongest factors in an individual’s chance of getting cancer — and therefore a critical input for any algorithm that aims to predict cancer risk. But not every person knows that history, and it’s not reliably collected by every doctor or researcher. In one large data set, Obermeyer and his colleagues showed in a recent paper, family history of colorectal cancer is less complete for Black patients — making any algorithms built on that data less likely to be accurate for those groups. In those cases, they argued, race can be an important variable to include in algorithms to help account for differing data quality between groups.

At the Pittsburgh datathon, participants were looking for more gaps in the database. One group, led by two pediatric critical care residents, found lab values missing at an oddly high rate at certain hospitals. “They’re not missing data randomly,” said Allan Joseph, one of the residents. “If there’s some systematic reason that people are missing data — either because they are being provided care at a lower quality hospital, or because they’re not receiving labs — those could bias your estimates.”

Sarah Nutman, the other resident, nodded vigorously: “Garbage in, garbage out.”

Leo Celi envisions building a “bias glossary” for every medical dataset, a summary of the data distortions that responsible model developers should be careful to avoid. Philip Keith for STAT

At events like these at Brown and Pitt, Celi continues to evangelize the risks of bias in medical data. AI developers, whether they’re in training or already working, have a responsibility to account for those flaws before training new algorithms, he said — “otherwise, we’re going to be complicit in the crime.”

But training young researchers to spot data distortions is just a first step. It’s not at all clear how developers — and the health systems using their AI — will avoid repeating the errors of the past. “That’s the hard part,” agreed Joseph, the UPMC resident, after he and his datathon team identified several holes in the ICU data.

Today, health equity often takes a backseat to the economic realities of medicine. Most hospitals choose their AI tools based on the problem they promise to solve, not the data used to train them. Hightower left her UChicago position last year to work full-time as CEO of Equality AI, a company that aims to help hospital systems evaluate their machine learning models for bias and fairness. But there’s a “lack of urgency” to address the problem, she said. “I lead in with AI bias, and they’re like, ‘Okay, but that’s the least of my worries. I got burnt-out physicians. I got revenue demands.’”

When health systems do try to vet their AI algorithms, they can check against old patient records to see if they deliver accurate results across demographic groups. But that doesn’t prove they won’t result in biased care in the future. “This question is actually very complicated,” said Obermeyer, who recently launched Dandelion Health to help organizations train and test their algorithms on datasets with more racial, ethnic, and geographic diversity. “You can’t answer it by simply going through some checklist or running a piece of code that’s going to just magically tell you it’s biased.”

There are no easy solutions. To create unbiased algorithms, developers say they need better, more unbiased data. They need medical records from parts of the country and the world that aren’t well-represented. They need better access to patients’ genetic data and social background — and in structured formats that can be easily plugged into computers. Data-gathering and infrastructure efforts like All of Us and AIM-AHEAD, funded by the National Institutes of Health, are chipping away at those gaps — but it’s a painstaking process.

And as long as systemic disparities exist in the world, they will appear in medical data. Clinical AI developers will always have to stay vigilant to ensure their models don’t perpetuate bias. To help, Celi envisions building a “bias glossary” for every medical dataset, a summary of the data distortions that responsible model developers should be careful to avoid. And he advocates loudly for AI developers to reflect the patients they’re building for. “If this is a model for predicting maternal complications,” he said, “I don’t want to see a team consisting of 90% men.”

Celi doesn’t pretend he has all the answers. But at Brown, he preached his vision for equitable algorithms to the youngest data scientists — the ones who, sensitized to these issues, might be part of the solution. His words of encouragement were muffled, disappearing into the chapel’s chevron-paneled rafters.

He can only hope he’s getting the message across.

STAT’s coverage of health inequities is supported by a grant from the Commonwealth Fund. Our financial supporters are not involved in any decisions about our journalism.

AI threatens to cement racial bias in clinical algorithms. Could it also chart a path forward?

Vivante Health Closes $31M to Advance Digital Digestive Health Platform

Potential breakthrough in breast cancer: Five proteins identified as risk markers

C-Path and Sanofi collaborate to advance Type 1 Diabetes research

Supercharge Your Portfolio with Future Health Stocks!

Join us for Profitable Insights & Expert Tips!

AI threatens to cement racial bias in clinical algorithms. Could it also chart a path forward?

Vivante Health Closes $31M to Advance Digital Digestive Health Platform

Potential breakthrough in breast cancer: Five proteins identified as risk markers

C-Path and Sanofi collaborate to advance Type 1 Diabetes research

Supercharge Your Portfolio with Future Health Stocks!

Join us for Profitable Insights & Expert Tips!

Subscribe