Large language models (LLM) are commended for offering support and ease of workload in healthcare settings. A recent study presented at the European Respiratory Society Congress (ERS) in Vienna, in fact demonstrated that ChatGPT significantly outperformed trainee doctors in diagnosing complex respiratory conditions in children.
The study group assessed the performance of AI tools ChatGPT (OpenAI, San Francisco, California), Bard (now Gemini; Google AI, Mountain View, California) and Bing (Microsoft, Redmond, Washington) against trainee doctors in providing responses to complex pediatric respiratory scenarios such as cystic fibrosis, asthma, and chest infections.
Six clinical scenarios were presented to the LLMs (respondents) and the trainee doctors. The doctors were given a 1-hour time limit and internet access that excluded LLMs. Responses were randomised and scored overall by six experts against the criteria correctness, comprehensiveness, utility, plausibility, coherence, and ‘humanness’. Differences in scores between respondents and pairwise differences were tested for statistical significance.
ChatGPT (version 3.5) outperformed the trainee doctors and achieved an average score of 7 out of 9, and its responses were often mistaken for human-generated by experts. Trainee doctors averaged a score of 5. Bard scored 6, while Bing matched the trainee doctors with an average score of 4. The experts were able to identify Bard and Bing responses and non-human. No fabricated facts were present in LLM responses.
The study was led by Manjith Narayanan, consultant in paediatric pulmonology at the Royal Hospital for Children and Young People, Edinburgh, UK. He stated: “Our study is the first, to our knowledge, to test LLMs against trainee doctors in situations that reflect real-life clinical practice.”
Narayanan highlighted the potential of LLMs in supporting clinicians with complex cases. By allowing trainee doctors to access the internet, mimicking real-world conditions, the study avoided testing simple memory recall, focusing instead on clinical reasoning and problem-solving. The research highlights the solutions that can be provided by LLMs, “… this study shows us another way we could be using LLMs and how close we are to regular day-to-day clinical application.” Experts also stressed the need for extensive testing to ensure clinical safety, accuracy, and fairness before adopting AI in routine care.
Reference: Narayan M et al. Clinical scenarios in paediatric pulmonology: Can large language models fare better than trainee doctors? Abstract OA2762. European Respiratory Society (ERS) Congress, September 9, 2024.
Anaya Malik | AMJ