Preprints with The Lancet is a collaboration between The Lancet Group of journals and SSRN to facilitate the open sharing of preprints for early engagement, community comment, and collaboration. Preprints available here are not Lancet publications or necessarily under review with a Lancet journal. These preprints are early-stage research papers that have not been peer-reviewed. The usual SSRN checks and a Lancet-specific check for appropriateness and transparency have been applied. The findings should not be used for clinical or public health decision-making or presented without highlighting these facts. For more information, please see the FAQs.
Development and Testing of Retrieval Augmented Generation in Large Language Models
27 Pages Posted: 9 Feb 2024
More...Abstract
Purpose: Large Language Models (LLMs) hold significant promise for medical applications. Yet, their practical implementation often falls short in incorporating current, guideline-grounded knowledge specific to clinical specialties and tasks. Additionally, conventional accuracy-enhancing methods like fine-tuning pose considerable computational challenges.
Retrieval Augmented Generation (RAG) emerges as a promising approach for customizing domain knowledge in LLMs, particularly well-suited for needs in healthcare implementations. This case study presents the development and evaluation of an LLM-RAG pipeline tailored for healthcare, focusing specifically on preoperative medicine. The accuracy and safety of the responses generated by the LLM-RAG system were evaluated as primary endpoints.
Methods: We developed an LLM-RAG model using 35 preoperative guidelines and tested it against human-generated responses, with a total of 1260 responses evaluated (336 human-generated, 336 LLM-generated, and 588 LLM-RAG-generated).
The RAG process involved converting clinical documents into text using Python-based frameworks like LangChain and Llamaindex, and processing these texts into chunks for embedding and retrieval. Vector storage techniques and selected embedding models to optimize data retrieval, using Pinecone for vector storage with a dimensionality of 1536 and cosine similarity for loss metrics. LLMs including GPT3.5, GPT4.0, Llama2-7B, llama2-13B, and their LLM-RAG counterparts were evaluated.
We evaluated the system using 14 de-identified clinical scenarios, focusing on six key aspects of preoperative instructions. The correctness of the responses was determined based on established guidelines and expert panel review. Human-generated answers, provided by junior doctors, were used as a comparison. Comparative analysis was conducted using Cohen’s H and chi-square tests.
Results: The LLM-RAG model generated answers within an average time of 15-20 seconds, significantly faster than the 10 minutes typically required by humans. Among the basic LLMs, GPT4.0 exhibited the best accuracy of 80.1%. This accuracy was further increased to 91.4% when the model was enhanced with RAG. Compared to the human-generated instructions, which had an accuracy of 86.3%, the performance of the GPT4.0-RAG model demonstrated non-inferiority (p=0.610).
Conclusions: In this case study, we demonstrated a LLM-RAG model for healthcare implementation. The model can generate complex preoperative instructions across different clinical tasks with accuracy non-inferior to humans and low rates of hallucination. The pipeline shows the advantages of grounded knowledge, upgradability, and scalability as important aspects of healthcare LLM deployment.
Funding: The authors have not declared a specific grant for this research from any funding agency in the public, commercial, or not-for-profit sectors.
Declaration of Interest: None to declare.
Keywords: Large language model, artificial intelligence, Retrieval-augmented generation
Suggested Citation: Suggested Citation