Publication : USDA ARS

ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Healthy Processed Foods Research » Research » Publications at this Location » Publication #404207

Research Project: New Sustainable Processes, Preservation Technologies, and Product Concepts for Specialty Crops and Their Co-Products

Location: Healthy Processed Foods Research

Title: pLM4ACE: A protein language model-based machine learning predictor for screening peptides with high antihypertensive activity

Author

	DU, ZHENJIAO - Kansas State University
	DING, XINGJIAN - Kansas State University
	HSU, WILLIAM - Kansas State University
	MUNIR, ARSLAN - Kansas State University
	Xu, Yixiang
	LI, YONGHUI - Kansas State University

Submitted to: Food Chemistry
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 8/13/2023
Publication Date: 8/14/2023
Citation: Du, Z., Ding, X., Hsu, W., Munir, A., Xu, Y., Li, Y. 2023. pLM4ACE: A protein language model-based machine learning predictor for screening peptides with high antihypertensive activity . Food Chemistry. 431. Article 137162. https://doi.org/10.1016/j.foodchem.2023.137162.
DOI: https://doi.org/10.1016/j.foodchem.2023.137162

Interpretive Summary: Hypertension, a critical risk factor for cardiovascular diseases, affects approximately 1 billion people worldwide. Angiotensin-I converting enzyme (ACE) is a key enzyme regulating the renin-angiotensin system and drug targets in clinical hypertension treatment. Considering the high cost, low efficiency, and reliance on advanced techniques and experienced technicians during wet experiments, bioinformatics has emerged as a promising approach to reverse this traditional workflow and to guide efficient bioactive peptide screening. Protein language models (pLM) have recently been successfully applied for predicting the structure and function of proteins, while confident learning theory has been proposed for real-world data cleaning. These advancements have a great potential for building a practical and powerful ACE inhibitory peptide classification model. This study aims to develop a pLM-based classification model with evolutionary scale modeling (ESM-2) embeddings for the screening of peptides with strong ACE inhibitory activity, totally trained on experimental data. To our best knowledge, this is the first high ACE inhibitory activity peptide classification model that was built fully on experimental datasets. UMAP results confirmed the validity of confident learning for data cleaning and ESM-2 for peptide embeddings. Five machine learning methods were employed to build models based on the peptide embeddings generated from ESM-2 in comparison with benchmark model performance, where twelve popular features were selected for peptide representation and modeling with the same machine learning methods. The results showed that logistic regression, support vector machine, and multilayer perceptron perform very well with ESM-2-generated embeddings. Furthermore, the model performance was significantly higher than other feature-based models, which demonstrated the uperiority of ESM-2 for peptide representation. The model performance improvement in state-of-the-art models also supports this deduction. The original scripts are also provided for the reproduction and usage of this method for other peptide bioactivity prediction tasks.

Technical Abstract: Protein language models (pLM) have been successfully applied for predicting the structure and function of proteins, while confident learning theory has been proposed for real-world data cleaning. These advancements have a great potential for building a practical and powerful Angiotensin-I converting enzyme (ACE) inhibitory peptide classification model. This study aims to develop a pLM-based classification model with evolutionary scale modeling (ESM-2) embeddings for the screening of peptides with strong ACE inhibitory activity, totally trained on experimental data. Twelve conventional peptide embedding approaches were also tested as benchmark features for performance comparison with the ESM-2-based embeddings when combined with five machine learning (ML) methods for modeling. Among the 65 classifiers, logistic regression with ESM-2 embeddings showed the best performance, with balanced accuracy (BACC), Matthews correlation coefficient (MCC), and area under the curve of 0.883 ± 0.017, 0.77 ± 0.032, and 0.96 ± 0.009, respectively. In addition, multilayer perceptron and support vector machine also exhibited great compatibility with ESM-2 embeddings. The usage of pLM with simple ML methods resulted in much better prediction performance compared to the latest models relying on feature selection and complicated ML methods. To our best knowledge, this is the first classification model fully developed using experimental data with pLM for peptide embeddings for high ACE inhibitory activity peptide prediction.

U.S. DEPARTMENT OF AGRICULTURE

Healthy Processed Foods Research: Albany, CA