Preprints with The Lancet is part of SSRN´s First Look, a place where journals identify content of interest prior to publication. Authors have opted in at submission to The Lancet family of journals to post their preprints on Preprints with The Lancet. The usual SSRN checks and a Lancet-specific check for appropriateness and transparency have been applied. Preprints available here are not Lancet publications or necessarily under review with a Lancet journal. These preprints are early stage research papers that have not been peer-reviewed. The findings should not be used for clinical or public health decision making and should not be presented to a lay audience without highlighting that they are preliminary and have not been peer-reviewed. For more information on this collaboration, see the comments published in The Lancet about the trial period, and our decision to make this a permanent offering, or visit The Lancet´s FAQ page, and for any feedback please contact preprints@lancet.com.
Retrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness
45 Pages Posted: 2 Jul 2024
More...Abstract
Purpose: Large Language Models (LLMs) offer potential for medical applications, but often lack the specialized knowledge needed for clinical tasks. Retrieval Augmented Generation (RAG) is a promising approach, allowing for the customization of LLMs with domain-specific knowledge, well-suited for healthcare. We focused on assessing the accuracy, consistency and safety of RAG models in determining a patient’s fitness for surgery and providing additional crucial preoperative instructions.
Methods: We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses, with a total of 3682 responses evaluated. Clinical documents were processed, stored, and retrieved using Llamaindex. Ten LLMs (GPT3.5, GPT4, GPT4-o, Llama2-7B, Llama2-13B, LLama2-70b, LLama3-8b, LLama3-70b, Gemini-1.5-Pro and Claude-3-Opus) were evaluated with 1) native model, 2) with local and 3) international preoperative guidelines. Fourteen clinical scenarios were assessed, focusing on 7 aspects of preoperative instructions. Established guidelines and expert physician judgment determined correct responses. Human-generated answers from senior attending anesthesiologists and junior doctors served as a comparison. Comparative analysis was conducted using Fisher’s exact test and agreement for inter-rater agreement within human and LLM responses.
Results: The LLM-RAG model demonstrated good efficiency, generating answers within 20 seconds, with guideline retrieval taking less than 5 seconds. This performance is faster than the 10 minutes typically estimated by clinicians. Notably, the LLM-RAG model utilizing GPT4 achieved the highest accuracy in assessing fitness for surgery, surpassing human-generated responses (96.4% vs. 86.6%, p=0.016). The RAG models demonstrated generalizable performance, exhibiting similarly favorable outcomes with both international and local guidelines. Additionally, the GPT4 LLM-RAG model exhibited an absence of hallucinations and produced correct preoperative instructions that were comparable to those generated by clinicians.
Conclusions: This study successfully implements LLM-RAG models for preoperative healthcare tasks, emphasizing the benefits of grounded knowledge, upgradability, and scalability for effective deployment in healthcare settings.
Funding: The authors have not declared a specific grant for this research from any funding agency in the public, commercial, or not-for-profit sectors.
Declaration of Interest: None declared.
Ethical Approval: This study involved the analysis of de-identified patient data. As the study did not involve the collection, use, or disclosure of identifiable private information and since the data were not collected through interaction or intervention with individuals specifically for research purposes, it was determined that IRB oversight was not required. All data used were accessed in compliance with applicable privacy laws and institutional policies
Keywords: Large language model, Artificial intelligence, Retrieval-augmented generation
Suggested Citation: Suggested Citation