Evaluation of General Large Language Models in Understanding Clinical Concepts Extracted from Adult Critical Care Electronic Health Record Notes

41 Pages Posted: 29 Feb 2024

See all articles by Darren Liu

Darren Liu

Emory University

Cheng Ding

Georgia Institute of Technology

Delgersuren Bold

Emory University

Monique Bouvier

Emory University

Jiaying Lu

Emory University

Benjamin Shickel

University of Florida

Craig S. Jabaley

Emory University School of Medicine - Department of Anesthesiology

Wenhui Zhang

Emory University

Soojin Park

Columbia University - Division of Hospital and Critical Care Neurology

Michael Young

Department of Neurology, Massachusetts General Hospital

Mark S. Wainwright

University of Washington

Gilles Clermont

University of Pittsburgh

Parisa Rashidi

University of Florida

Eric S. Rosenthal

Massachusetts General Hospital

Laurie Dimisko

Emory University

Xiao Ran

Emory University

JooHeung Yoon

University of Pittsburgh

Carl Yang

Emory University

Xiao Hu

Emory University

Abstract

Objective: The field of healthcare has increasingly turned its focus towards Large Language Models (LLMs) due to their remarkable performance in various applications. While LLMs have shown promise in standardized medical exams, their performance in actual clinical applications has been underexplored. We sought to evaluate the performance of LLMs in the complex clinical context of adult critical care medicine using systematic and comprehensible analytic methods, including clinician annotation and adjudication.Methods: We investigated the performance of three widely available general LLMs (GPT-4, GPT-3.5, and text-davinci-003) in understanding and processing real-world clinical notes. Text from 150 clinical notes was mapped with MetaMap to standardized medical concepts aligned with the Unified Medical Language System (UMLS) and then adjudicated by 9 clinicians. Each LLM's proficiency was evaluated by identifying the temporality and negation of these concepts using a range of different prompts for an in-depth analysis. The performance of fine-tuned LLaMA 2 was compared with other LLMs in zero-shots. The performance of LLMs was also tested using 6 different qualitative performance metrics.Results: We developed a dataset featuring 2,288 clinical concepts annotated and adjudicated by 9 multidisciplinary clinicians. In 3 different tasks, GPT-4 showed overall superior performance compared to other LLMs. In comparing different prompting strategies, GPT-4 demonstrated consistently high performance across all prompts. This finding indicates that such an advanced model does not require extensive prompt engineering to achieve optimal results. Across all 6 dimensions of qualitative assessment, GPT-4 also showed superior performance, reaching 98.4% overall comprehensibility across all tasks. The GPT family models have demonstrated approximately 18 times faster than human experts while incurring only a quarter of the costs.Conclusion: A comprehensive qualitative performance evaluation framework for LLMs is developed and operationalized in this context. This framework goes beyond singular performance aspects like relevance or correctness, encompassing metrics such as factuality, relevance, completeness, logicality, clarity, and overall comprehensiveness. Enhanced with expert annotations, this methodology not only validates LLMs’ capabilities in processing complex medical data but also establishes a benchmark for future LLM evaluations across specialized domains.

Note:
Funding declaration: Research reported in this manuscript was partially supported by the Office of the Director, National Institutes of Health [OT award number OT2OD032701]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. SP was additionally supported by NIH R01 [grant numbers NS131606, NS129760]. JY was supported by NIH K23 [grand number GM138984].

Conflict of Interests: All authors declare no financial or non-financial competing interests.

Keywords: large language model, natural language processing, electronic health record, clinical note, GPT-3.5, GPT-4, LLaMA 2, text-davinci-003

Suggested Citation

Liu, Darren and Ding, Cheng and Bold, Delgersuren and Bouvier, Monique and Lu, Jiaying and Shickel, Benjamin and Jabaley, Craig S. and Zhang, Wenhui and Park, Soojin and Young, Michael and Wainwright, Mark S. and Clermont, Gilles and Rashidi, Parisa and Rosenthal, Eric S. and Dimisko, Laurie and Ran, Xiao and Yoon, JooHeung and Yang, Carl and Hu, Xiao, Evaluation of General Large Language Models in Understanding Clinical Concepts Extracted from Adult Critical Care Electronic Health Record Notes. Available at SSRN: https://ssrn.com/abstract=4734730 or http://dx.doi.org/10.2139/ssrn.4734730

Darren Liu

Emory University ( email )

201 Dowman Drive
Atlanta, GA 30322
United States

Cheng Ding

Georgia Institute of Technology ( email )

Atlanta, GA 30332
United States

Delgersuren Bold

Emory University ( email )

Monique Bouvier

Emory University ( email )

201 Dowman Drive
Atlanta, GA 30322
United States

Jiaying Lu

Emory University ( email )

Benjamin Shickel

University of Florida ( email )

PO Box 117165, 201 Stuzin Hall
Gainesville, FL 32610-0496
United States

Craig S. Jabaley

Emory University School of Medicine - Department of Anesthesiology ( email )

201 Dowman Drive
Atlanta, GA 30332
United States
404-778-7777 (Phone)

Wenhui Zhang

Emory University

Soojin Park

Columbia University - Division of Hospital and Critical Care Neurology ( email )

Michael Young

Department of Neurology, Massachusetts General Hospital ( email )

55 Fruit Street Boston
Wang Ambulatory Care Center, 8th Floor, Suite 835
Boston, MA 02114
United States

Mark S. Wainwright

University of Washington ( email )

Seattle, WA 98195
United States

Gilles Clermont

University of Pittsburgh ( email )

135 N Bellefield Ave
Pittsburgh, PA 15260
United States

Parisa Rashidi

University of Florida ( email )

PO Box 117165, 201 Stuzin Hall
Gainesville, FL 32610-0496
United States

Eric S. Rosenthal

Massachusetts General Hospital ( email )

Laurie Dimisko

Emory University ( email )

201 Dowman Drive
Atlanta, GA 30322
United States

Xiao Ran

Emory University ( email )

201 Dowman Drive
Atlanta, GA 30322
United States

JooHeung Yoon

University of Pittsburgh ( email )

Carl Yang

Emory University ( email )

Xiao Hu (Contact Author)

Emory University ( email )

201 Dowman Drive
Atlanta, GA 30322
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
126
Abstract Views
386
Rank
461,061
PlumX Metrics