Abstract
Background: Recent advancements in large language models (LLMs) like ChatGPT and LLaMA have shown promise in revolutionizing medical applications, though their performance in medical language understanding still requires enhancement. This study aims to develop foundational medical LLMs by training open-source LLaMA models with large-scale, domain-specific datasets to enhance their efficacy across a variety of medical text analysis tasks and a medical diagnosis task.
Methods: We developed Me-LLaMA, a new medical LLM family that includes foundation models – Me-LLaMA 13/70B, and their chat-enhanced versions – Me-LLaMA 13/70B-chat, through continual pre-training and instruction tuning of LLaMA2 using both biomedical literature and clinical notes. Me-LLaMA utilized the largest and most comprehensive medical data, including 129B pre-training tokens and 214K instruction tuning samples from diverse biomedical and clinical data sources, and it took substantial computing resources, e.g., over 100,000 A100 GPU hours for training the 70B models. We then applied Me-LLaMA to six important biomedical text analysis tasks (Question Answering, Named Entity Recognition, Relation Extraction, Text Classification, Text Summarization, and Natural Language Inference) and evaluated its performance on 12 benchmark datasets. To further assess Me-LLaMA’s potential clinical utility, we also evaluated Me-LLaMA models on the complex clinical case diagnosis task and compared its performance with other commercial LLMs, using both automatic and human evaluation.
Findings: Our extensive evaluation shows that Me-LLaMA models outperform LLaMA, as well as other existing open-source medical LLMs in both zero-shot and supervised learning settings for most of text analysis tasks. With task-specific instruction tuning, Me-LLaMA models also surpass leading commercial LLMs, including ChatGPT, on 7 out of 8 datasets, and GPT-4 on 5 out of 8 datasets. Moreover, for the task of diagnosing complex clinical cases, Me-LLaMA’s performance is comparable to ChatGPT and GPT-4.
Interpretation: Domain-specific data is important for building medical foundation LLMs that can improve diverse downstream text analysis tasks and medical applications. Computing costs associated with training medical foundation models are substantial and require careful considerations when selecting different training strategies (i.e., pre-training vs. fine tuning). Me-LLaMA models are now publicly available through appropriate user agreements, making it a valuable resource for medical AI applications.
Funding: National Institutes of Health (NIH); Patient-Centered Outcomes Research Institute (PCORI).
Declaration of Interest: The authors have no financial or non-financial conflicts of interest to disclose.
Xie, Qianqian and Chen, Qingyu and Chen, Aokun and Peng, Cheng and Hu, Yan and Lin, Fongci and Peng, Xueqing and Huang, Jimin and Zhang, Jeffrey and Keloth, Vipina K. and Zhou, Xinyu and Qian, Lingfei and He, Huan and Shung, Dennis and Ohno-Machado, Lucila and Wu, Yonghui and Xu, Hua and Bian, Jiang, Me-LLaMA: Medical Foundation Large Language Models for Comprehensive Text Analysis and Beyond. Available at SSRN:
https://ssrn.com/abstract=4943761 or
http://dx.doi.org/10.2139/ssrn.4943761