Understanding the limitations of medical reasoning in large language models

Bill Cai; Xiaogang Wang; Ujjwal Ratan; Yash Shah

Publication

Understanding the limitations of medical reasoning in large language models

By Bill Cai, Xiaogang Wang, Ujjwal Ratan, Yash Shah

2025

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Large language models demonstrate impressive performance on standardized healthcare benchmarks, yet their deployment readiness for real-world environments remains poorly understood. Current medical benchmarks present idealized scenarios that misrepresent the complexity of actual clinical data. We systematically evaluate LLM robustness by introducing clinician-validated perturbations to MedQA that mirror authentic healthcare settings: medically irrelevant information (red herrings), clinical writing styles, and standard medical abbreviations. Our comprehensive evaluation across nine models reveals substantial fragility, with diagnostic accuracy dropping up to 9.4%. Notably, semantic distractions pose the greatest threat, while some models demonstrate relative resilience to stylistic variations and medical abbreviations. Our paper addresses a gap between benchmark performance and clinical deployment readiness, while providing a systematic framework for assessing AI robustness that can be generalized to other healthcare domains.

Understanding the limitations of medical reasoning in large language models

Latest news

Work with us