A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare expands beyond just accuracy and traditional quantitative metrics is needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus & Context, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications.
Keywords: large language models, healthcare, evaluation, generative artificial intelligence
Tan, Ting Fang and Elangovan, Kabilan and Ong, Jasmine Chiat Ling and Lee, Aaron and Shah, Nigam H. and Sung, Joseph J. Y. and Wong, Tien Yin and Lan, Xue and Liu, Nan and Wang, Haibo and Kuo, Chang-Fu and Chesterman, Simon and Yeong, Zee Kin and Ting, Daniel Shu Wei and Administrator, Sneak Peek, A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models – Safety, Consensus & Context, Objectivity, Reproducibility and Explainability. Available at SSRN: https://ssrn.com/abstract=5029562 or http://dx.doi.org/10.2139/ssrn.5029562
This version of the paper has not been formally peer reviewed.