Application of Data Quality Optimization and Machine Learning Techniques to Predict the Dissolved Oxygen Concentration in the Tide-Sensitive Tanjiang River
46 Pages Posted: 8 Nov 2023
Abstract
The dissolved oxygen (DO) concentration is an important index for evaluating the quality of the water environment. Here, we constructed three models (SVR, MIC-SVR, and WT-MIC-SVR) based on data quality optimization techniques (the maximum information coefficient (MIC), wavelet transform (WT) data noise reduction) and machine learning techniques (support vector regression (SVR)) to predict the DO concentration in the Tanjiang River (a typical tide-sensitive river in southern China) and establish the optimal model. Specifically, the MIC was used to screen the main environmental factors affecting DO changes, the wavelet transform technique was used for data noise reduction, and a hybrid model was constructed to predict DO concentrations with SVR. The MIC technique can effectively screen the main environmental factors affecting the DO concentration, and the application of the WT technique did not improve the predictive performance of the model. The best performance was achieved by the MIC-SVR model; the RMSE was 4.46% lower and the R2 and NSE were 23.26% and 45.85% higher, respectively, for the MIC-SVR model compared with the SVR model. In addition, study of kernel function selection revealed that considering as many kernel functions as possible is necessary for improving the performance of the SVR model. The main factors affecting the DO changes in different regions (basins) were basically the same as those observed in previous studies; however, the secondary factors differed. Screening of these secondary factors as a supplement to the model input variables can improve the predictive performance of the model to a certain extent. Overall, the proposed MIC-SVR model can be used to analyze the relationship between DO and environmental factors in order to find the main causes of low DO, and accurately predict the DO concentration in the Tanjiang River (especially in its tidally sensitive reaches), thus providing assistance in water environment protection and water resource management.
Keywords: DO prediction, Support vector machine, Maximum information coefficient, Kernel function selection, Data quality optimization
Suggested Citation: Suggested Citation