A Knowledge-Informed Spectral Variable Interval Identification and Combination Based On the Hierarchical Clustering for Robust and Interpretable Analysis
21 Pages Posted: 29 Apr 2024
Abstract
Variable selection in the context of regression problems can be considered as an optimization process. However, such data-driven methods may ignore the physically relevant variables or feature structures, which can be exploited to enhance the robustness and interpretability of variable selection results. In this paper, we propose a knowledge-informed spectral variable hierarchical-clustering and optimal interval combination (HCIC) strategy to capture and exploit underlying correlations among spectra wavelengths. In the first step, spectral variable hierarchical-clustering (SVHC) is employed to determine the correlation between adjacent variables and then generate a number of non-uniform intervals. These intervals are designed to distinguish patterns or structural regions arising from infrared-light and chemical bonds reactions, enabling the exploitation of physically relevant characteristics. In the subsequent step, a bayesian linear regression based optimal interval combination (BLR-OIC) strategy is introduced with weighted bootstrap sampling (WBS) to search for the most effective solutions. This strategy aims to emulate the synergy effect among functional bands or group functions in the spectral data. We conduct extensive experiments on public available and private databases with various spectra techniques to verify the efficacy of the proposed algorithm. The results not only manifest improved prediction performance and robustness compared to benchmark methods but also demonstrate interpretability and consistent selection results.
Keywords: Variable Interval Selection, Bayesian Linear Regression, Hierarchical-Clustering, Multivariate calibration, chemometrics
Suggested Citation: Suggested Citation