
Preprints with The Lancet is a collaboration between The Lancet Group of journals and SSRN to facilitate the open sharing of preprints for early engagement, community comment, and collaboration. Preprints available here are not Lancet publications or necessarily under review with a Lancet journal. These preprints are early-stage research papers that have not been peer-reviewed. The usual SSRN checks and a Lancet-specific check for appropriateness and transparency have been applied. The findings should not be used for clinical or public health decision-making or presented without highlighting these facts. For more information, please see the FAQs.
Evaluation of LLMs Accuracy and Application in Oncology Principles and Practice
20 Pages Posted: 10 Mar 2025
More...Abstract
Background: In recent years, large language models (LLMs) have offered physicians and patients a new avenue for tumor diagnosis and treatment, showcasing distinctive potential. Our study assessed 16 large language models (LLMs), including ChatGPT, DeepSeek, Claude, Grok and Llama, with particular focus on their diagnostic precision and answer comprehensibility in oncology-related inquiries, while investigating performance variations among these models.
Methods: We developed 549 single‐choice/true‐false and 10 short‐answer questions to evaluate clinical oncology knowledge based on standard textbooks, guidelines, and literature. Our study enrolled five participant groups for diagnostic and treatment testing in oncology and thoracic oncology, including attending physicians,resident physicians, academic professionals, the general public, and LLMs. We applied sixteen generative LLMs adopting oncologist personas to answer the questions independently. Readability was measured with the Flesch Reading Ease Score (FRES). Three consultant‐level oncology specialists independently rated the LLMs' responses to the short‐answer questions on a 3‐point accuracy scale.
Findings: The study revealed GPT o1, DeepSeek-R1, and GPT o3-mini achieved top accuracy (90.16%-89.44%) in an overall accuracy evaluation, while Llama-3.2-1B performed lowest at 32.42%. GPT o1 demonstrated highest accuracy in management of tumor complications and emergencies and hematologic tumor (97.50%, 94.44%), while DeepSeek-R1 excelled in molecular biology of cancer (94.44%) and achieved perfect accuracy (100%)in cancer pain management. LLMs group outperformed all other groups with an accuracy of 89.39% in comprehensive test and achieved comparable performance to attending physicians in thoracic oncology testing (91.6% vs. 89.4%, p-value = 0.5887). The Flesch Reading Ease Score (FRES) evaluation revealed that DeepSeek-R1 had the highest response readability. Additionally,Grok3 and DeepSeek R1 outperformed other models in quality of response, garnering 50% “excellent” ratings.
Interpretation: These results guide oncology-specific LLM selection through clinical evaluation of capabilities and limitations, with future refinements enhancing utility in optimizing cancer care efficiency and accuracy.FundingNone.
Keywords: Large Language Models, Oncology, Diagnostic Accuracy, Answer Readability
Suggested Citation: Suggested Citation