Prompt Design for Medical Question Answering with Large Language Models

11 Pages Posted: 8 Mar 2025

See all articles by Leonid Kuligin

Leonid Kuligin

affiliation not provided to SSRN

Jacqueline Lammert

affiliation not provided to SSRN

Aleksandr Ostapenko

affiliation not provided to SSRN

Maximilian Tschochohei

Google - Google Germany

Martin Boeker

affiliation not provided to SSRN

Keno Bressem

affiliation not provided to SSRN

Abstract

Large language models (LLMs) are increasingly being evaluated in the medical domain. Given the lack of datasets and difficulties evaluating outputs represented by free text, datasets with multiple-choice questions are often used for such studies. We evaluated six large LLMs (belonging to LLM families such as Claude 3.5 Sonnet, Gemini 1.5-pro, Llama 3.1, Mistral) and six smaller models(originating from families such as Gemma 2B, Mistral Nemo, Llama 3.1, Gemini 1.5-flash) across five prompting techniques on neuro-oncology exam questions. Using the established MedQA datasetand a novel neuro-oncology question set, we compared basic prompting, chain-of-thought reasoning, and more complex agent-based methods incorporating external search capabilities. Results showedthat the Reasoning and Acting (ReAct) approach combined with giving LLM access to Google Search performed best on large models like Claude 3.5 Sonnet (81.7% accuracy). However, the performanceof prompting techniques varies across different foundational models. While large models significantly outperformed smaller open-source ones on the MedQA dataset (79.3% vs 51.2% accuracy), complexagentic patterns like Language Agent Tree Search provided minimal benefits despite 5x higher latency. We recommend practitioners keep experimenting with various techniques given their specific usecase and a chosen foundational model and favor simple prompting patterns with large models, as they offer the best balance of accuracy and efficiency.

Note:
Funding Information: Funding agency and/or the grant number: We haven't used any external funding or grants

Conflict of Interests: All listed authors must declare any competing interests: Authors do not have any competing interests.

Keywords: Large Language Models, Generative AI, Medical Question-Answering, Agentic AI

undefined

Suggested Citation

Kuligin, Leonid and Lammert, Jacqueline and Ostapenko, Aleksandr and Tschochohei, Maximilian and Boeker, Martin and Bressem, Keno, Prompt Design for Medical Question Answering with Large Language Models. Available at SSRN: https://ssrn.com/abstract=5165960 or http://dx.doi.org/10.2139/ssrn.5165960

Leonid Kuligin (Contact Author)

affiliation not provided to SSRN ( email )

No Address Available

Jacqueline Lammert

affiliation not provided to SSRN ( email )

No Address Available

Aleksandr Ostapenko

affiliation not provided to SSRN ( email )

No Address Available

Maximilian Tschochohei

Google - Google Germany ( email )

Martin Boeker

affiliation not provided to SSRN ( email )

No Address Available

Keno Bressem

affiliation not provided to SSRN ( email )

No Address Available

0 References

    0 Citations

      Do you have a job opening that you would like to promote on SSRN?

      Paper statistics

      Downloads
      56
      Abstract Views
      759
      Rank
      800,026
      PlumX Metrics