Edmcv: An Unsupervised Method for Classifying Thermophilic Proteins Based on Sequences
23 Pages Posted: 19 Aug 2024
Abstract
Thermophilic proteins, characterized by unique thermal stability, offer valuable insights into protein design and optimization. Correct classification of this property enhances the catalytic efficiency of the enzyme and serves as a prerequisite for exploring the relationship between sequence and thermal stability. Traditional supervised learning methods depend on expensive experimental annotations to classify properties, increase expenses and restrict the analysis of unannotated sequence data. Here, we introduce the ESM-based deep multi-clustering voting (EDMCV) method, an unsupervised method that achieves high accuracy in thermophilic proteins classification without relying on annotated data. EDMCV involves three key steps: 1) initial encoding of thermophilic proteins sequences using the feature extraction capabilities of ESM; 2) optimization through deep clustering algorithms; and 3) introduction of a multi-clustering voting strategy to ensure reliable results. EDMCV achieved clustering accuracies of 96.04%, 96.23% and 95.16% on three publicly available thermophilic proteins datasets, surpassing or matching existing supervised classification approaches. Furthermore, analysis of the clustering results enabled the identification of sequence motifs closely related to thermal stability within the classified thermophilic protein sequences. EDMCV provides a new unsupervised method for classifying thermophilic proteins represented by thermal stability, which offers a low-cost way to classify the properties of unlabeled sequences, thereby improving the mining of unknown sequence data. In addition, mining feature motifs from classified thermophilic protein sequences provides important guidance for building a knowledge base and further optimizing protein property designs. The efficiency and scalability of this method lay a solid foundation for research into thermophilic proteins and their applications.
Keywords: Thermophilic proteins, Unsupervised methods, Deep clustering, Protein sequence classification
Suggested Citation: Suggested Citation