Study Finds Age-Related Bias in NLP Tools for Chest X-ray Annotation - EMJ

Study Finds Age-Related Bias in NLP Tools for Chest X-ray Annotation

RECENT research has revealed that four commercially available natural language processing (NLP) tools used for annotating chest x-ray reports show high overall accuracy but exhibit significant age-related biases. The study evaluated CheXpert, RadReportAnnotator, ChatGPT-4, and cTAKES, which achieved accuracies between 82.9% and 94.3% in labelling thoracic diseases from chest x-ray reports. However, all models performed poorly in patients over 80 years old, according to the study team.  

“While NLP tools can facilitate [deep learning] development in radiology, they must be vetted for demographic biases prior to widespread deployment to prevent biased labels from being perpetuated at scale,” the researchers warned. 

NLP is a key technology for automating text analysis, and its integration into medical imaging can help build large datasets for training artificial intelligence (AI) systems. These AI models can be essential for improving diagnostic accuracy and efficiency in healthcare. However, without careful evaluation, biases within NLP models could exacerbate existing disparities in healthcare, particularly those related to age and socioeconomic status. 

The study tested the four NLP tools on two datasets: a subset of the Medical Information Mart for Intensive Care (MIMIC) dataset (balanced for age, sex, race, and ethnicity) with 692 chest x-ray reports, and the Indiana University (IU) chest x-ray dataset, which includes 3,665 reports. Three board-certified radiologists annotated the reports for 14 thoracic disease labels, which were used as the benchmark to evaluate the performance of the NLP tools. 

ChatGPT-4 and CheXpert were the top performers, achieving 94.3% and 92.6% accuracy, respectively, on the IU dataset. RadReportAnnotator and ChatGPT-4 led in the MIMIC dataset with 92.2% and 91.6% accuracy. Despite their high accuracy, all four tools demonstrated significant biases across age groups, with the highest error rates (an average of 15.8%) in patients over 80 years old. 

To address these biases, the researchers suggested diversifying training data and incorporating contemporary demographic trends into NLP models. They also recommended employing techniques like fairness awareness and bias auditing during algorithm training to reduce these biases. Ensuring demographic balance in NLP tools is crucial to prevent biased AI models and improve the fairness and effectiveness of AI in radiology. 

Reference 

Santomartino SM et al. Evaluating the performance and bias of natural language processing tools in labeling chest radiograph reports. Radiology. 2024;313(1):e232746. 

Rate this content's potential impact on patient outcomes

Average rating / 5. Vote count:

No votes so far! Be the first to rate this content.