Edmcv: An Unsupervised Method for Classifying Thermophilic Proteins Based on Sequences

23 Pages Posted: 19 Aug 2024

See all articles by Lin Cheng

Lin Cheng

Dali University

Qimeng Du

Dali University

Zhuohang Yu

Dali University

Anqi Mao

Dali University

Junpeng Zhang

Dali University

Chong Peng

Tianjin University - Tianjin University of Science and Technology

Xin Wu

affiliation not provided to SSRN

Chi-Chun Zhou

Dali University

Le Gao

affiliation not provided to SSRN

Abstract

Thermophilic proteins, characterized by unique thermal stability, offer valuable insights into protein design and optimization. Correct classification of this property enhances the catalytic efficiency of the enzyme and serves as a prerequisite for exploring the relationship between sequence and thermal stability. Traditional supervised learning methods depend on expensive experimental annotations to classify properties, increase expenses and restrict the analysis of unannotated sequence data. Here, we introduce the ESM-based deep multi-clustering voting (EDMCV) method, an unsupervised method that achieves high accuracy in thermophilic proteins classification without relying on annotated data. EDMCV involves three key steps: 1) initial encoding of thermophilic proteins sequences using the feature extraction capabilities of ESM; 2) optimization through deep clustering algorithms; and 3) introduction of a multi-clustering voting strategy to ensure reliable results. EDMCV achieved clustering accuracies of 96.04%, 96.23% and 95.16% on three publicly available thermophilic proteins datasets, surpassing or matching existing supervised classification approaches. Furthermore, analysis of the clustering results enabled the identification of sequence motifs closely related to thermal stability within the classified thermophilic protein sequences. EDMCV provides a new unsupervised method for classifying thermophilic proteins represented by thermal stability, which offers a low-cost way to classify the properties of unlabeled sequences, thereby improving the mining of unknown sequence data. In addition, mining feature motifs from classified thermophilic protein sequences provides important guidance for building a knowledge base and further optimizing protein property designs. The efficiency and scalability of this method lay a solid foundation for research into thermophilic proteins and their applications.

Keywords: Thermophilic proteins, Unsupervised methods, Deep clustering, Protein sequence classification

Suggested Citation

Cheng, Lin and Du, Qimeng and Yu, Zhuohang and Mao, Anqi and Zhang, Junpeng and Peng, Chong and Wu, Xin and Zhou, Chi-Chun and Gao, Le, Edmcv: An Unsupervised Method for Classifying Thermophilic Proteins Based on Sequences. Available at SSRN: https://ssrn.com/abstract=4929891

Lin Cheng

Dali University ( email )

Dali
China

Qimeng Du

Dali University ( email )

Dali
China

Zhuohang Yu

Dali University ( email )

Dali
China

Anqi Mao

Dali University ( email )

Dali
China

Junpeng Zhang

Dali University ( email )

Dali
China

Chong Peng

Tianjin University - Tianjin University of Science and Technology ( email )

China

Xin Wu

affiliation not provided to SSRN ( email )

No Address Available

Chi-Chun Zhou (Contact Author)

Dali University ( email )

Dali
China

Le Gao

affiliation not provided to SSRN ( email )

No Address Available

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
38
Abstract Views
152
PlumX Metrics