AI assurance labs intended to test health care technology have an equity problem

As leaders across federal agencies swiftly advance regulations for AI in health care, one proposal now seems too big to fail.

That proposal is for the implementation of AI assurance laboratories — places where AI model developers can develop and test AI models according to standard criteria that would be defined with regulators.

advertisement

Many of the biggest names and organizations in health AI have embraced the concept, which was described on stage at the annual Office of the National Coordinator for Health Information Technology meeting, published in a prominent JAMA Special Communication and featured in an exclusive STAT report. The proposal is being advanced by leaders of the Coalition for Health AI (CHAI) with strong support from two top regulators — the National Coordinator for Health IT and the director of the Digital Health Center of Excellence at the Food and Drug Administration. The proposal responds to President Biden’s recent executive order, which calls for the development of AI assurance infrastructure. The proposal in JAMA ends with a request for funding a “small number of assurance labs that experiment with these diverse approaches and gather evidence that the creation of such labs can meet the goals laid out in the Executive Order.”

It makes sense that there’s so much energy behind this concept. AI assurance laboratories can solve a narrow slice of AI challenges faced by health care providers. The proposed network of AI assurance laboratories could differentiate products that perform well across sites from those that don’t. For instance, the assurance laboratory network could test dozens of sepsis prediction models to efficiently identify the AI products that perform best across sites. Health care providers and patients could benefit from increased transparency that prevents adoption of faulty AI products.

But the proposal leaves substantial gaps and would likely worsen the digital divide that prevents many low-resource medical sites from using AI safely and effectively. We say this as AI experts with direct experience vetting and managing AI products in high-resource settings.

advertisement

To begin with, the proposal for AI assurance laboratories fails to address the unequal distribution of AI governance capabilities. STAT has reported that the initial AI assurance laboratories would be at Duke, Mayo Clinic, and Stanford, all large health systems and academic medical centers that already have significant AI expertise. While our high-resource institutions may benefit from AI assurance laboratory funding, we recognize the need to direct attention and resources toward settings unlike ours that are less able to effectively conduct AI assurance. In addition to investing in a small number of AI assurance laboratories, we urge federal regulators to boldly invest in AI capabilities, infrastructure, and technical assistance to advance the safe, effective, and equitable use of AI in low-resource settings.

The proposal makes an unconvincing argument that federal investments in AI assurance laboratories at sites like Duke, Mayo Clinic, and Stanford advances health equity. The team of authors, including prominent regulators, state: “An alternative would be to enable health systems to create their own local assurance labs. While possible for the larger health systems and academic medical centers, this alternative would not scale … having such labs would also exacerbate health system level inequity, with better-resourced systems able to provide stronger protections.” In a country plagued by systemic health inequities, it seems disingenuous to suggest that investing in a small number of elite health care institutions improves AI assurance in low-resource settings more than directly investing in building capabilities within low-resource settings.

Our own work on collaborative governance reveals that immediate accountability for the impacts of AI system use will remain with frontline health care providers, high resource or low. They are the ones ultimately responsible for “last mile” evaluation of AI. Federal regulators and national collaboratives, like CHAI and HAIP, must grapple with the challenge of scale to create programs that equip all health care providers to carry out these critical functions.

There are also two main gaps in the proposal. First, regulations and investments must account for the differences between the priorities and concerns of AI product developers, regulators, and implementers. In practice, the most complex challenges health care providers face lie beyond areas of consensus. For example, in interviews with nearly 90 stakeholders across 10 health care organizations, we found that most organizations prefer to locally validate AI products prior to clinical use. Analyses performed at Stanford, Mayo Clinic, and Duke cannot account for the on-the-ground variation in resources, populations, and operations of the more than 6,000 hospitals in the United States. Even if an AI vendor offers a health care provider validation data from an AI assurance laboratory, the health care provider will want to confirm the analysis locally.

We also found that health care providers often assess AI products in ways that extend beyond the bounds of FDA and Office of the National Coordinator for Health Information Technology regulations. To date, representatives from these two federal agencies are the most prominent proponents of AI assurance laboratories. For example, many health care providers are developing internal processes to assess AI products for impacts on health inequities in response to state attorney general offices and local public health offices. Health care providers are subject to competing pressures from the public, developers, and regulators at different levels, and need tailored support.

The second reason AI assurance laboratories would have limited impact on the frontlines of health care is that AI products cannot be meaningfully evaluated in well-controlled, in silico environments. In his famous book “Deep Medicine,” Eric Topol distinguished in silico testing, like that which would occur in the proposed labs, from prospective clinical studies. Topol emphasized that in silico testing that involves “analyzing an existing dataset is quite different from collecting data in a real clinical environment.” The moment an AI technology is put into use on the frontlines, it becomes a sociotechnical system that alters the way people behave and interact. The performance of the AI solution depends much more on the behaviors of users and changes to the work environment than on quantitative measures that are computable by an AI assurance laboratory.

Numerous studies have borne this out: the Epic sepsis model failed to meet technical performance targets at Michigan Medicine yet reduced mortality in a prospective randomized trial at Metro Health in Ohio; expert heuristics were incorporated into an HIV model to improve trust among clinician users despite the change worsening model specificity; and work at Duke to implement a peripheral artery disease model was designed to address health inequities while failing to identify structural challenges that perpetuate inequities. These studies often require embedding social scientists into the clinical environments where AI is implemented. And these types of learnings are impossible to surface without empowering health care providers to carry out AI assurance activities themselves.

Thankfully, there is a path forward, using a combination of AI assurance laboratories with large investments in technical infrastructure and regional extension centers to provide on-the-ground technical assistance to low-resource health care providers. This approach draws inspiration from previous federal government investments in 2009 totaling over $30 billion to support the implementation of electronic health records across nearly all health care organizations. While there are many valid critiques of EHRs that we do not want to replicate for AI (e.g., poor design, burdensome workflows, limited improvements in patient outcomes), there was one big thing federal agencies got right, and that was enabling the rapid rollout. The funding covered purchase of the technology as well as a network of 62 regional extension centers “to provide on-the-ground technical assistance for individual and small provider practices, medical practices lacking resources to implement and maintain EHRs, and those who provide primary care services in public and critical access hospitals, community health centers, and other settings that mostly serve those who lack adequate coverage or medical care.”

We should approach AI assurance with the same ambition and urgency today as we did EHR implementations 15 years ago. Regulators must invest in a portfolio of programs that move beyond in silico testing to support use of AI in real, diverse clinical environments.

Mark Sendak, M.D., MPP, is the population health and data science lead at Duke Institute for Health Innovation and the co-lead of Health AI Partnership. Nicholson Price, J.D., Ph.D., is professor of law at the University of Michigan. Karandeep Singh, M.D., MPH, is the chief AI officer at University of California, San Diego Health. Suresh Balu is director of the Duke Institute for Health Innovation and the co-lead of Health AI Partnership.