RESEARCH conducted earlier this year evaluating the performance of large language models (LLMs) in interpreting breast imaging reports has revealed moderate agreement with human experts but concerning levels of discordance that could harm clinical decision-making.
The research assessed the ability of GPT-3.5, GPT-4, and Google’s Bard (now known as Gemini) to assign Breast Imaging Reporting and Data System (BI-RADS) categories based on imaging reports written in Italian, English, and Dutch. BI-RADS categories are critical for guiding clinical decisions in breast cancer diagnosis and screening.
The retrospective study analysed 2,400 reports from three referral centres, encompassing MRI, mammography, and ultrasound cases collected between January 2000 and October 2023. Each report had been previously categorised by board-certified radiologists using BI-RADS criteria. The LLMs were tasked with independently assigning categories based solely on the findings in the reports.
The agreement between LLMs and human radiologists was evaluated using the Gwet agreement coefficient (AC1). Results showed nearly perfect alignment between original radiologists and reviewing radiologists (AC1 = 0.91). However, the LLMs achieved only moderate agreement, with AC1 scores of 0.52 for GPT-4, 0.48 for GPT-3.5, and 0.42 for Bard.
Significantly, the LLMs demonstrated a much higher rate of category misassignments that would alter clinical management. While human reviewers changed clinical management categories in 4.9% of cases, Bard did so in 25.5%, GPT-3.5 in 23.9%, and GPT-4 in 18.1%. Instances of downgraded or upgraded BI-RADS categories that could negatively affect patient care were particularly troubling. These occurred in 1.5% of cases for human readers, but rates soared to 18.1% for Bard, 14.3% for GPT-3.5, and 10.6% for GPT-4.
The authors concluded that while LLMs hold promise for enhancing clinical workflows, their current performance in assigning BI-RADS categories introduces risks that outweigh benefits. “LLMs achieved moderate agreement with human-assigned BI-RADS categories but also yielded a high percentage of discordant assignments that could negatively impact clinical management,” the study authors noted.
These findings underlined the need for further refinement of LLMs before they can be reliably integrated into complex clinical tasks, such as radiology interpretation, where accuracy is paramount.
Reference
Cozzi A et al. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology. 2024:311(1):e232133.