A Systematic Method for Selecting Molecular Descriptors as Features When Training Models for Predicting Physiochemical Properties

29 Pages Posted: 20 Dec 2021

See all articles by Ana E. Comesana

Ana E. Comesana

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab)

Tyler Huntington

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab); University of California, Berkeley - Biosciences Area; University of California, Berkeley - Biological Systems and Engineering Division

Corinne D. Scown

University of California, Berkeley - Biological Systems and Engineering Division; University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab)

Kyle E. Niemeyer

Oregon State University

Vi Rapp

Lawrence Berkeley Laboratory

Abstract

Machine learning has proven to be a powerful tool for accelerating biofuel development. Although numerous models are available to predict a range of properties using chemical descriptors, there is a trade-off between interpretability and performance. Neural networks provide predictive models with high accuracy at the expense of some interpretability, while simpler models such as linear regression often lack in accuracy. In addition to model architecture, feature selection is also critical for developing interpretable and accurate predictive models. We present a method for systematically selecting molecular descriptor features and developing interpretable machine learning models without sacrificing accuracy. Our method simplifies the process of selecting features by reducing feature multicollinearity and enables discoveries of new relationships between global properties and molecular descriptors. To demonstrate our approach, we developed models for predicting melting point, boiling point, flash point, yield sooting index, and net heat of combustion with the help of the Tree-based Pipeline Optimization Tool (TPOT). For training, we used publicly available experimental data for up to 8351 molecules. Our models accurately predict various molecular properties for organic molecules (mean absolute percent error (MAPE) ranging from 3.3% to 10.5%) and provide a set of features that are well-correlated to the property. This method enables researchers to explore sets of features that significantly contribute to the prediction of the property, offering new scientific insights. To help accelerate early stage biofuel research and development, we also integrated the data and models into a open-source, interactive web tool.

Keywords: Chemical Descriptors, Machine Learning, biofuels, TPOT

Suggested Citation

Comesana, Ana E. and Huntington, Tyler and Scown, Corinne D. and Niemeyer, Kyle E. and Rapp, Vi, A Systematic Method for Selecting Molecular Descriptors as Features When Training Models for Predicting Physiochemical Properties. Available at SSRN: https://ssrn.com/abstract=3990072 or http://dx.doi.org/10.2139/ssrn.3990072

Ana E. Comesana

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab) ( email )

United States

Tyler Huntington

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab) ( email )

University of California, Berkeley - Biosciences Area ( email )

United States

University of California, Berkeley - Biological Systems and Engineering Division ( email )

Corinne D. Scown

University of California, Berkeley - Biological Systems and Engineering Division ( email )

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab) ( email )

United States

Kyle E. Niemeyer

Oregon State University ( email )

Bexell Hall 200
Corvallis, OR 97331
United States

Vi Rapp (Contact Author)

Lawrence Berkeley Laboratory ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
22
Abstract Views
197
PlumX Metrics