Skip to main content
ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Healthy Processed Foods Research » Research » Publications at this Location » Publication #409514

Research Project: New Sustainable Processes, Preservation Technologies, and Product Concepts for Specialty Crops and Their Co-Products

Location: Healthy Processed Foods Research

Title: pLM4Alg: Protein language model-based predictors for allergenic proteins and peptides

Author
item DU, ZHENJIAO - Kansas State University
item Xu, Yixiang
item LIU, CHANGQI - San Diego State University
item LI, YONGHUI - Kansas State University

Submitted to: Journal of Agricultural and Food Chemistry
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 12/8/2023
Publication Date: 12/19/2023
Citation: Du, Z., Xu, Y., Liu, C., Li, Y. 2023. pLM4Alg: Protein language model-based predictors for allergenic proteins and peptides. Journal of Agricultural and Food Chemistry. 72(1):752-760. https://doi.org/10.1021/acs.jafc.3c07143.
DOI: https://doi.org/10.1021/acs.jafc.3c07143

Interpretive Summary: Allergy is an abnormal immune response to otherwise harmless foreign substances (e.g., foods, dust mites, pollens, chemicals, etc.). The majority of allergens are protein/peptides. In order to improve efficiency in allergen identification and accelerate allergenic risk evaluation, computational approaches have been continuously proposed and have made significant progress in the past decades. Recently, pre-trained protein language models (pLMs) have successfully predicted protein structure and function. However, to our best knowledge, they have not been used for predicting allergenic proteins/peptides. Therefore, this study aims to develop robust models for allergenic protein/peptide prediction using five pLMs of varying sizes and systematically assess their performance through fine-tuning with a convolutional neural network. The developed pLM4Alg models have achieved state-of-the-art performance with accuracy, Matthews correlation coefficient, and area under the curve scoring 94.1-95.5%, 0.882-0.911, and 98.3-99%, respectively. Moreover, pLM4Alg is the first model capable of handling prediction tasks involving residue-missed sequences and sequences containing non-standard amino acid residues. To facilitate easy access, a user-friendly web server (https://f6wxpfd3sh.us-east-1.awsapprunner.com/) has been established. pLM4Alg is expected to become the leading machine learning-based prediction model for allergenic peptides and proteins. Its collaboration with other predictors holds a great promise in accelerating allergy research.

Technical Abstract: The rising prevalence of allergy demands efficient and accurate bioinformatic tools to expedite allergen identification and risk assessment while also reducing wet experiment expenses and time. Recently, pre-trained protein language models (pLMs) have successfully predicted protein structure and function. However, to our best knowledge, they have not been used for predicting allergenic proteins/peptides. Therefore, this study aims to develop robust models for allergenic protein/peptide prediction using five pLMs of varying sizes and systematically assess their performance through fine-tuning with a convolutional neural network. The developed pLM4Alg models have achieved state-of-the-art performance with accuracy, Matthews correlation coefficient, and area under the curve scoring 94.1-95.5%, 0.882-0.911, and 98.3-99%, respectively. Moreover, pLM4Alg is the first model capable of handling prediction tasks involving residue-missed sequences and sequences containing non-standard amino acid residues. To facilitate easy access, a user-friendly web server (https://f6wxpfd3sh.us-east-1.awsapprunner.com/) has been established. pLM4Alg is expected to become the leading machine learning-based prediction model for allergenic peptides and proteins. Its collaboration with other predictors holds a great promise in accelerating allergy research.