Large Language Model for Horizontal Transfer of Resistance Gene: From Resistance Gene Prevalence Detection to Plasmid Conjugation Rate Evaluation

34 Pages Posted: 20 Feb 2024

See all articles by Jiabin zhang

Jiabin zhang

Harbin Institute of Technology

Lei Zhao

Harbin Institute of Technology

Wei Wang

Harbin Institute of Technology

Quan Zhang

Harbin Institute of Technology

Xueting Wang

Harbin Institute of Technology

Defeng Xing

Harbin Institute of Technology

Nanqi Ren

Harbin Institute of Technology - State Key Laboratory of Urban Water Resource and Environment

Duu-Jong Lee

National Taiwan University - Department of Chemical Engineering

Chuan Chen

Harbin Institute of Technology - State Key Laboratory of Urban Water Resource and Environment

Abstract

The burgeoning issue of plasmid-mediated resistance genes (ARGs) dissemination poses a significant threat to environmental integrity. However, the prediction of ARGs prevalence is overlooked, especially for emerging ARGs that are potentially evolving gene exchange hotspot. Here, we explored to classify plasmid or chromosome sequences and detect resistance gene prevalence by using DNABERT. Initially, the DNABERT fine-tuned in plasmid and chromosome sequences followed by multilayer perceptron (MLP) classifier could achieve 0.764 AUC (Area under curve) on external datasets across 23 genera, outperforming 0.02 AUC than traditional statistic-based model. Furthermore, Escherichia, Pseudomonas single genera based model were also be trained to explore its predict performance to ARGs prevalence detection. By intergrating K-mer frequency attributes, our model could boost the performance to predict the prevalence of ARGs in an external dataset in Escherichia with 0.0281~0.0615 AUC and Pseudomonas with 0.0196~0.0928 AUC. Finally, we established a random forest model aimed at forecasting the relative conjugation transfer rate of plasmids with 0.7956 AUC, drawing on data from existing literature. It identifies the plasmid's repression status, cellular density, and temperature as the most important factors influencing transfer frequency. With these two models combined, they provide useful reference for quick and low-cost integrated evaluation of resistance gene transfer.

Keywords: Deep Learning, Large language model, BERT, ARGs prevalence prediction, plasmid conjugation rate

Suggested Citation

zhang, Jiabin and Zhao, Lei and Wang, Wei and Zhang, Quan and Wang, Xueting and Xing, Defeng and Ren, Nanqi and Lee, Duu-Jong and Chen, Chuan, Large Language Model for Horizontal Transfer of Resistance Gene: From Resistance Gene Prevalence Detection to Plasmid Conjugation Rate Evaluation. Available at SSRN: https://ssrn.com/abstract=4732229 or http://dx.doi.org/10.2139/ssrn.4732229

Jiabin Zhang

Harbin Institute of Technology ( email )

92 West Dazhi Street
Nan Gang District
Harbin, 150001
China

Lei Zhao

Harbin Institute of Technology ( email )

92 West Dazhi Street
Nan Gang District
Harbin, 150001
China

Wei Wang

Harbin Institute of Technology ( email )

92 West Dazhi Street
Nan Gang District
Harbin, 150001
China

Quan Zhang

Harbin Institute of Technology ( email )

92 West Dazhi Street
Nan Gang District
Harbin, 150001
China

Xueting Wang

Harbin Institute of Technology ( email )

92 West Dazhi Street
Nan Gang District
Harbin, 150001
China

Defeng Xing

Harbin Institute of Technology ( email )

92 West Dazhi Street
Nan Gang District
Harbin, 150001
China

Nanqi Ren

Harbin Institute of Technology - State Key Laboratory of Urban Water Resource and Environment ( email )

China

Duu-Jong Lee

National Taiwan University - Department of Chemical Engineering ( email )

Taipei
Taiwan

Chuan Chen (Contact Author)

Harbin Institute of Technology - State Key Laboratory of Urban Water Resource and Environment ( email )

China

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
44
Abstract Views
180
PlumX Metrics