THE ABILITY of AI models to infer demographic information from medical images, such as chest X-rays, is well established; however, this led researchers at the Massachusetts Institute of Technology to question whether these demographic predictions affect the fairness of diagnostic conclusions.
The study, led by Marzyeh Ghassemi, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, performed systematic investigations to address this question and determine whether alternative shortcuts could be used to create locally, and globally, optimal models.
Fairness was evaluated by assessing the rate of errors that would be detrimental, such as a false negative in a chest X-ray screening, which would delay treatment for an unwell patient. This was conducted across sex and gender to reveal that the most negatively affected groups were women and Black people. These investigations also demonstrated that models that show the greatest accuracy in predicting a patient’s race, also show the greatest discrepancies in their ability to make an accurate diagnosis in these demographic groups.
To investigate reducing this fairness gap, AI models were trained using publicly available chest X-ray datasets from Beth Isreal Deaconess Medical Center in Boston, Massachusetts, USA. Two training strategies were explored, one optimising for “subgroup robustness” and the other for “group adversarial”. The former aimed to increase the diagnosis accuracy within a particular subgroup and the latter was forced to remove any demographic information from the images to remove this as a source of bias in the conclusions.
These “debiased” models were tested on patient data from five other hospitals and, while their accuracy remained high, the fairness gaps remained. The “group adversarial” models did show slightly more fairness than the “subgroup robustness”; however, the researchers concluded that debiasing the model for one set of patients does not mean that the fairness will be preserved in groups of patients in new hospitals and locations.
These results suggest that a universal, single approach to creating fair AI models is most likely incompatible with the clinical healthcare setting given the significant diversity in patient demographics between hospitals. It was suggested that until a model capable of making fairer predictions on new datasets is developed, it may be beneficial for hospitals using these AI models to first evaluate their patient populations to foresee and avoid any possibly detrimental diagnostic errors.
Katie Wright, EMJ
Reference
Yang Y et al. The limits of fair medical imaging AI in real-world generalization. Nat Med. 2024;DOI:10.1038/s41591-024-03113-4.