Several years ago, teams of scholars published groundbreaking research that pointed to racial biases in algorithms that helped direct patient care at major health systems in the U.S. Those algorithms, the studies found, adversely affected the care of Black and Latinx patients across multiple medical categories. For instance, the researchers uncovered racial biases in the prediction algorithms used by to identify more medically complex patients, such that Black patients were far less likely to qualify for additional care than their white counterparts.
Coverage of the Covid pandemic at the time somewhat buried these findings, but the recent STAT series “Embedded Bias” has pointed a new spotlight on the issue. The broad takeaway: Algorithms cannot be trusted to make safe and fair decisions about patient care.
advertisement
The timing is critical, given the rise of generative AI. Pharma companies, care-delivery organizations, and health insurers are being deluged with pitches from AI companies that promise to automate everything from the creation of marketing materials to the drafting of patient visit notes following a doctor’s appointment.
And the reaction is almost reflexive: AI cannot be trusted.
As the co-founder of a generative AI company that helps insurers manage the health of their members, I don’t completely disagree. Many generative AI companies should not, indeed, be trusted with sensitive medical decisions.
advertisement
But some are indeed working hard to try to build AI that can be trusted.
I should know, because my family has been directly impacted by racial bias in medical technology, and I’m on a personal quest to make AI more trustworthy.
Four years ago, I was at my father’s bedside as he was dying from Covid-19. As many who’ve been through such ordeals know all too well, one of the key ways to track a patient’s condition is through a pulse oximeter, which measures how much oxygen is being carried by the blood. This device, which clips to a patient’s finger, works by scanning the blood vessels beneath a patient’s skin.
In my father’s case, though, we were puzzled because one moment his pulse oximeter would give a fairly normal reading, and the next his oxygen levels would be alarmingly low. This continued on through his hospitalization, and while it did not directly contribute to his death, it was tremendously distressing, given its importance as a diagnostic element.
Because I was already part of a medical research group at the Massachusetts Institute of Technology, I helped lean into emerging research efforts that sought to understand how pulse oximeters were failing those with darker skin.
Patients like my father.
These findings made me even more determined not to perpetuate such biases in the generative AI models that I was in the process of building at the time to provide impartial decision-making in medical claims resolution and prior-authorization processes for medications and surgeries.
It was not easy. My colleagues and I already enjoyed a data set that was far bigger than most other health care-related generative AI companies. We boasted source data from Mayo Clinic and other industry leaders, including one of the largest global pharma companies. But we also understood that those data sets themselves could be biased, simply because they might not reflect the racial and ethnic diversity of patients generally.
We spent months testing and refining our algorithms, and adding a layer of what is known as “dataset diversification and balancing,” to increase the odds that our data would reflect all races and ethnicities fairly.
advertisement
Here’s what that means in practical terms.
In collaboration with Mayo, we focused on early detection of cardiovascular events and quickly noticed that our model accurately detected such events in populations that were well-represented in the training data. But it had a significant blind spot for Black Americans, who, as we know, disproportionately suffer from cardiovascular disease. By working our Mayo partners, we understood that this algorithmic blind spot was caused by the historical exclusion of Blacks and other underrepresented minority groups from medical device trials.
By diversifying our data set to account for such blind spots, we significantly improved detection rates for these communities.
We also implemented manual validation checkpoints at different stages, to ensure the datasets were both balanced and accurate. Such manual checks helped us spotlight demographic data that was often missing for minority patients, like age and family medical history, and which further skewed heart disease predictions for these groups. By correcting our model’s assumptions for these populations, we further improved its accuracy.
In cases where we lacked enough real-world data for certain minority groups, we used synthetic data to fill these gaps. For example, early on, our model for detecting cardiovascular events lacked strong data for Hispanic populations. By simulating realistic scenarios of cardiovascular risk based on clinical research and population statistics, we exposed our model to a more racially and ethnically diverse sample.
As we further audited our model for biases, we also discovered inconsistencies in socioeconomic data that had been gathered for members of different ethnic groups. For instance, data for lower-income patients often failed to account for their access to health care, or their medical histories.
To address this, we designed an algorithmic intervention that flagged such problematic data points, and automatically re-weighted our predictions to account for them.
advertisement
Finally, we established guardrails in our algorithms to prevent racially related blind spots, including a fairness audit mechanism that’s triggered whenever the model’s performance for any racial or ethnic group falls below a certain threshold. During one testing phase, for example, we found that the model’s accuracy for Asian populations lagged behind the overall model’s accuracy. Now, when this threshold is crossed, our model automatically diversifies the dataset to ensure fair results for a given population.
Taken as a whole, these efforts have paid off, but with a nuance that’s worth noting for like-minded AI leaders. Namely, before implementing these changes, our overall accuracy was about 98%. After making these changes, we actually observed a roughly 7% drop in accuracy While the initial high accuracy rate seemed impressive, it actually masked critical issues. The algorithm was previously overconfident in certain areas, particularly where biases in the data had skewed its predictions. By addressing these biases, the accuracy metric adjusted downwards, but the model now makes fewer incorrect confident predictions, especially in sensitive cases. For users, this trade-off means a more reliable system that errs less in crucial scenarios, resulting in fairer and more trustworthy decision-making. The experience underscored the ways in which aggregated accuracy benchmarks can mask troubling issues within the data.
The bottom line for my company: We’re now 15-20% more confident in our model’s race-related objectivity, and our model outperforms humans when it comes to matching patients with the appropriate care — though of course our model is designed to allow humans to make the final decision.
That said, no model can ever be perfect. We have provisions for continuous monitoring and improvement, with systems in place to regularly refine our models and detect any unintended biases that may emerge. This ongoing oversight ensures that we treat fairness as an evolving commitment, essential to achieving equitable outcomes in health care.
advertisement
What not enough people understand about generative AI is that companies like mine can actually help solve the problem of racial bias in health care. After all, racial bias is a human problem first, and a technology problem only to the extent that humans design our technologies.
What it comes down to is attention and intention. If health care leaders pay close enough attention to racial biases, they will understand the importance of solving this problem in all areas of their organization, not just in IT. And if they bring the right level of intention to this issue, they will dedicate actual resources and time to solving it.
In the process, they should bear in mind that generative AI is far from the bogeyman that some wish to make it out to be. Rather, with the proper guardrails in place, it can be part of the solution. It’s up to health care leaders to ask the right questions, and it’s up to us as AI leaders to answer all such questions the right way, or else point their technology to less life-or-death venues.
Amber Nigam is the co-founder and CEO of basys.ai, a generative AI-driven health care company. A Harvard-trained health data scientist, Amber is dedicated to eliminating biases in medical technology and advancing equitable health care delivery.