Location: Healthy Processed Foods Research
Title: pLM4Alg: Protein language model-based predictors for allergenic proteins and peptidesAuthor
DU, ZHENJIAO - Kansas State University | |
Xu, Yixiang | |
LIU, CHANGQI - San Diego State University | |
LI, YONGHUI - Kansas State University |
Submitted to: Journal of Agricultural and Food Chemistry
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 12/8/2023 Publication Date: 12/19/2023 Citation: Du, Z., Xu, Y., Liu, C., Li, Y. 2023. pLM4Alg: Protein language model-based predictors for allergenic proteins and peptides. Journal of Agricultural and Food Chemistry. 72(1):752-760. https://doi.org/10.1021/acs.jafc.3c07143. DOI: https://doi.org/10.1021/acs.jafc.3c07143 Interpretive Summary: Allergy is an abnormal immune response to otherwise harmless foreign substances (e.g., foods, dust mites, pollens, chemicals, etc.). The majority of allergens are protein/peptides. In order to improve efficiency in allergen identification and accelerate allergenic risk evaluation, computational approaches have been continuously proposed and have made significant progress in the past decades. Recently, pre-trained protein language models (pLMs) have successfully predicted protein structure and function. However, to our best knowledge, they have not been used for predicting allergenic proteins/peptides. Therefore, this study aims to develop robust models for allergenic protein/peptide prediction using five pLMs of varying sizes and systematically assess their performance through fine-tuning with a convolutional neural network. The developed pLM4Alg models have achieved state-of-the-art performance with accuracy, Matthews correlation coefficient, and area under the curve scoring 94.1-95.5%, 0.882-0.911, and 98.3-99%, respectively. Moreover, pLM4Alg is the first model capable of handling prediction tasks involving residue-missed sequences and sequences containing non-standard amino acid residues. To facilitate easy access, a user-friendly web server (https://f6wxpfd3sh.us-east-1.awsapprunner.com/) has been established. pLM4Alg is expected to become the leading machine learning-based prediction model for allergenic peptides and proteins. Its collaboration with other predictors holds a great promise in accelerating allergy research. Technical Abstract: The rising prevalence of allergy demands efficient and accurate bioinformatic tools to expedite allergen identification and risk assessment while also reducing wet experiment expenses and time. Recently, pre-trained protein language models (pLMs) have successfully predicted protein structure and function. However, to our best knowledge, they have not been used for predicting allergenic proteins/peptides. Therefore, this study aims to develop robust models for allergenic protein/peptide prediction using five pLMs of varying sizes and systematically assess their performance through fine-tuning with a convolutional neural network. The developed pLM4Alg models have achieved state-of-the-art performance with accuracy, Matthews correlation coefficient, and area under the curve scoring 94.1-95.5%, 0.882-0.911, and 98.3-99%, respectively. Moreover, pLM4Alg is the first model capable of handling prediction tasks involving residue-missed sequences and sequences containing non-standard amino acid residues. To facilitate easy access, a user-friendly web server (https://f6wxpfd3sh.us-east-1.awsapprunner.com/) has been established. pLM4Alg is expected to become the leading machine learning-based prediction model for allergenic peptides and proteins. Its collaboration with other predictors holds a great promise in accelerating allergy research. |