Preprints with The Lancet is a collaboration between The Lancet Group of journals and SSRN to facilitate the open sharing of preprints for early engagement, community comment, and collaboration. Preprints available here are not Lancet publications or necessarily under review with a Lancet journal. These preprints are early-stage research papers that have not been peer-reviewed. The usual SSRN checks and a Lancet-specific check for appropriateness and transparency have been applied. The findings should not be used for clinical or public health decision-making or presented without highlighting these facts. For more information, please see the FAQs.
Robust and Interpretable Machine Learning Assessment of Variable Importance with Moderate to Small Sample Sizes: A Study of Survival after Out-Of-Hospital Cardiac Arrest
23 Pages Posted: 20 Apr 2023
More...Abstract
Background: There is an increasing interest to update regression-based evidence on variable importance by using advanced machine learning (ML) methods. However, findings from black-box ML methods may not align well with clinical understanding, and both ML and regression approaches have deteriorated performance with small sample sizes. We introduce an alternative method, the Shapley variable importance cloud (ShapleyVIC), that is less restricted by sample size.
Methods: ShapleyVIC integrates regression-based approach with ML techniques for interpretable and robust variable importance assessment. By analyzing an ensemble of regression models, ShapleyVIC explicitly accounts for uncertainties in variable importance to reduce biased and improve resistant to sampling variabilities over conventional inference on a single model. In a study of 30-day survival after out-of-hospital cardiac arrest (OHCA), we compared ShapleyVIC with logistic regression and two commonly used ML methods (random forest and XGBoost) for assessing variable importance from the full cohort (n=7490) and reproducing the findings using smaller subsets (n=2500 and n=500).
Findings: Both ShapleyVIC and conventional logistic regression identified important factors previously reported in the literature, but the low importance of race and moderate importance of three prehospital interventions found by ShapleyVIC was more plausible than the opposite found in the regression analysis. The random forest and XGBoost generated questionable variable rankings from the full cohort and were not applied to smaller subsets. ShapleyVIC was generally consistent in shortlisting important variables when n=2500 and n=500, whereas the logistic regression had attenuated statistical power and only consistently identified two variables when n=500.
Interpretation: ShapleyVIC is an interpretable and robust alternative to regression-based analyses and commonly used ML approaches for assessing variable importance in clinical applications with varying sample sizes.
Funding: This research received support from SingHealth Duke- NUS ACP Programme Funding (15/FY2020/P2/06-A79), National Medical Research Council, Clinician Scientist Award, Singapore (NMRC/CSA/024/2010, NMRC/CSA/0049/2013 and NMRC/CSA-SI/0014/2017) and Ministry of Health, Health Services Research Grant, Singapore (HSRG/0021/2012). YN is supported by the Khoo Postdoctoral Fellowship Award (project no. Duke-NUS-KPFA/2021/0051) from the Estate of Tan Sri Khoo Teck Puat.
Declaration of Interest: ll other authors have no conflict of interests to declare.
Ethical Approval: This study was approved by the Centralised Institutional Review Board (2013/604/C) and the Domain Specific Review Board (2013/00929).
Keywords: interpretable machine learning, variable importance, out-of-hospital cardiac arrest
Suggested Citation: Suggested Citation