Evaluation of General Large Language Models in Understanding Clinical Concepts Extracted from Adult Critical Care Electronic Health Record Notes
41 Pages Posted: 29 Feb 2024
Abstract
Objective: The field of healthcare has increasingly turned its focus towards Large Language Models (LLMs) due to their remarkable performance in various applications. While LLMs have shown promise in standardized medical exams, their performance in actual clinical applications has been underexplored. We sought to evaluate the performance of LLMs in the complex clinical context of adult critical care medicine using systematic and comprehensible analytic methods, including clinician annotation and adjudication.Methods: We investigated the performance of three widely available general LLMs (GPT-4, GPT-3.5, and text-davinci-003) in understanding and processing real-world clinical notes. Text from 150 clinical notes was mapped with MetaMap to standardized medical concepts aligned with the Unified Medical Language System (UMLS) and then adjudicated by 9 clinicians. Each LLM's proficiency was evaluated by identifying the temporality and negation of these concepts using a range of different prompts for an in-depth analysis. The performance of fine-tuned LLaMA 2 was compared with other LLMs in zero-shots. The performance of LLMs was also tested using 6 different qualitative performance metrics.Results: We developed a dataset featuring 2,288 clinical concepts annotated and adjudicated by 9 multidisciplinary clinicians. In 3 different tasks, GPT-4 showed overall superior performance compared to other LLMs. In comparing different prompting strategies, GPT-4 demonstrated consistently high performance across all prompts. This finding indicates that such an advanced model does not require extensive prompt engineering to achieve optimal results. Across all 6 dimensions of qualitative assessment, GPT-4 also showed superior performance, reaching 98.4% overall comprehensibility across all tasks. The GPT family models have demonstrated approximately 18 times faster than human experts while incurring only a quarter of the costs.Conclusion: A comprehensive qualitative performance evaluation framework for LLMs is developed and operationalized in this context. This framework goes beyond singular performance aspects like relevance or correctness, encompassing metrics such as factuality, relevance, completeness, logicality, clarity, and overall comprehensiveness. Enhanced with expert annotations, this methodology not only validates LLMs’ capabilities in processing complex medical data but also establishes a benchmark for future LLM evaluations across specialized domains.
Note:
Funding declaration: Research reported in this manuscript was partially supported by the Office of the
Director, National Institutes of Health [OT award number OT2OD032701]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. SP was additionally supported by NIH R01 [grant numbers NS131606, NS129760]. JY was supported by NIH K23 [grand number GM138984].
Conflict of Interests: All authors declare no financial or non-financial competing interests.
Keywords: large language model, natural language processing, electronic health record, clinical note, GPT-3.5, GPT-4, LLaMA 2, text-davinci-003
Suggested Citation: Suggested Citation