Prompt Design for Medical Question Answering with Large Language Models
11 Pages Posted: 8 Mar 2025
Abstract
Large language models (LLMs) are increasingly being evaluated in the medical domain. Given the lack of datasets and difficulties evaluating outputs represented by free text, datasets with multiple-choice questions are often used for such studies. We evaluated six large LLMs (belonging to LLM families such as Claude 3.5 Sonnet, Gemini 1.5-pro, Llama 3.1, Mistral) and six smaller models(originating from families such as Gemma 2B, Mistral Nemo, Llama 3.1, Gemini 1.5-flash) across five prompting techniques on neuro-oncology exam questions. Using the established MedQA datasetand a novel neuro-oncology question set, we compared basic prompting, chain-of-thought reasoning, and more complex agent-based methods incorporating external search capabilities. Results showedthat the Reasoning and Acting (ReAct) approach combined with giving LLM access to Google Search performed best on large models like Claude 3.5 Sonnet (81.7% accuracy). However, the performanceof prompting techniques varies across different foundational models. While large models significantly outperformed smaller open-source ones on the MedQA dataset (79.3% vs 51.2% accuracy), complexagentic patterns like Language Agent Tree Search provided minimal benefits despite 5x higher latency. We recommend practitioners keep experimenting with various techniques given their specific usecase and a chosen foundational model and favor simple prompting patterns with large models, as they offer the best balance of accuracy and efficiency.
Note:
Funding Information: Funding agency and/or the grant number: We haven't used any external funding or grants
Conflict of Interests: All listed authors must declare any competing interests: Authors do not have any competing interests.
Keywords: Large Language Models, Generative AI, Medical Question-Answering, Agentic AI
Suggested Citation: Suggested Citation