AI has the ability to revolutionize human health. It is used to detect potentially cancerous lesions in medical images, to screen for eye disease, and to predict whether a patient in the intensive care unit could have a brain-damaging seizure. Even your smartwatch has AI built into it; it can estimate your heart rate and detect whether you have atrial fibrillation. But how good are these algorithms generally? The truth is, we just don’t know.
Answering this question is a nightmare. The only way to evaluate AI models — or create them in the first place — is to have a large, diverse, medical dataset. The dataset must include enough patients of all kinds to ensure the AI model behaves well across different groups of people. It must be representative of all the situations in which the model might be used, whether it is in regional hospitals or major medical centers. The dataset also has to include medical outcomes, so an AI model trying to predict these outcomes can be evaluated against the truth.
advertisement
The FDA requires this kind of large-scale testing (and checks on the quality of AI model training), which means the companies that develop these technologies have access to these types of datasets. These datasets, which can come from health care providers, may include data that you produced as part of your own medical care or from clinical trials. However, these data are not accessible to other companies for building even better models, and they are certainly not available for researchers wanting to evaluate these models. Third parties have had to create their own datasets to evaluate them. And, of course, consumers are not able to make informed decisions in choosing products they may one day depend on.
So where does that leave human health? Consider AI within smartwatches: Before smartwatches, people might not have known they had atrial fibrillation (afib) until after a stroke landed them in the hospital. Smartwatches change everything: Now atrial fibrillation can be detected quickly and treated early, and companies are racing to build the most accurate afib detectors. Heart-monitoring algorithms within smartwatches need to be FDA-approved, which means large datasets must be created.
However, none of the massive datasets used in FDA-approved devices are available to researchers, nor can we, as researchers or consumers, ever perform a head-to-head comparison of, for instance, Apple Watch vs. Fitbit vs. the latest academic algorithms. As medical patients, we can’t know what the best smartwatch is for people “like us.” There are a few public datasets for training AI algorithms from sensor data from wearable devices (the WESAD, DaLiA, and Stanford datasets), but those datasets cannot be evaluated with algorithms within actual smartwatches because the algorithms are proprietary. It is possible that the latest academic algorithms are substantially better (or substantially worse) at detecting afib than Fitbits, but we just don’t know.
advertisement
In previous independent evaluations on the Apple Watch, the detection algorithm was unable to classify 27.9% of 30-second ECG heart signals, more than one-quarter of the collected data. It seems to particularly struggle during intensive exercise. This indicates ample room for improvement.
One possible solution is for a federal agency to evaluate algorithms for major health-related applications such as afib detection. This agency would have giant hidden test sets, and publish accuracy results on these datasets for each algorithm, including a breakdown for demographic groups. The FDA would require those test results before approving a new product. The agency would also release a public dataset to help algorithm designers, which would lower the bar to entry, particularly for scientific researchers and small companies who do not have the up-front costs to create their own massive datasets.
There is precedent in having a government agency test AI algorithms for high-stakes applications for the public. The National Institute of Standards and Technology (NIST) runs face recognition vendor tests, which evaluate any developer’s facial recognition software and issues a report. Reports from NIST are publicly available. While more accurate facial recognition systems can improve security systems and save lives, imagine how many more lives could be saved with accurate AI health-related technologies.
How would this agency procure giant hidden test sets? It could partner with ongoing national initiatives such as NIH’s Bridge2AI program where data generation projects are underway to generate ethically sourced, large, diverse human health data. The Medical Device Development Toolkit (MDDT) program from the FDA also seems like a promising way to collect data, but its voluntary nature for vendors to use a qualified MDDT may deter substantial interest.
Some vendors would balk at a government agency evaluating their AI software. They might argue their software can only be used with their own hardware because its sampling rate is different. Heart studies by Apple Watch, Fitbit Watch, and Huawei Watch have tested their algorithms only on their own products. The way to address this is that the government agency’s hidden test database can be built with data from multiple devices, including multiple sampling rates. Data should be collected at the highest resolution so no one can say that the dataset is not good enough.
The AI revolution is upon us. Let’s allow it to go full speed ahead on human health by enabling direct, transparent, and comprehensive evaluations to support agile development of medical AI and allow customers to choose the best methods. Doing so will save lives.
Cynthia Rudin is a computer science professor at Duke University. Zhicheng Guo is a Ph.D. student at Duke University. Cheng Ding is a Ph.D. student at Georgia Tech and Emory. Xiao Hu is a professor at Emory University. They study machine learning in biomedical applications.