Skip to main content
ARS Home » Pacific West Area » Albany, California » Western Regional Research Center » Crop Improvement and Genetics Research » Research » Publications at this Location » Publication #381929

Research Project: GrainGenes: Enabling Data Access and Sustainability for Small Grains Researchers

Location: Crop Improvement and Genetics Research

Title: Predicting tissue-specific mRNA and protein abundance in maize: A machine learning approach

item CHO, KYOUNG TAK - Iowa State University
item Sen, Taner
item Andorf, Carson

Submitted to: Frontiers in Artificial Intelligence
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 4/26/2022
Publication Date: 5/26/2022
Citation: Cho, K., Sen, T.Z., Andorf, C.M. 2022. Predicting tissue-specific mRNA and protein abundance in maize: A machine learning approach. Frontiers in Artificial Intelligence. 5. Article 830170.

Interpretive Summary: The availability of high-quality genome assemblies and gene predictions has rapidly advanced plant genomics research. To fully utilize these genomes, recent studies have identified the ability to use machine learning approaches to predict when genes create a functional product such as mRNA (messenger ribonucleic acid) or proteins. Two major limitations exist with these approaches. First, they rely on limited experimental data which can be costly and time consuming to generate, and therefore not always available. Second, these methods fail to identify which plant tissues or under what conditions genes create functional products. The condition-specific information is important to link genes back to agronomic traits. To address these problems, we developed a machine learning approach that makes condition-specific predictions based on sequence alone. Our approach, which was tested against experimental data, achieved high classification accuracy for predicting when genes make high levels of mRNAs and proteins across 23 different maize tissues. Our machine learning approach allows researchers to predict when and in which tissue genes are expressed, even in the absence of experimental data. These predictions can be used to better understand the relationship between when the genes are activated in a plant and the traits observed in farmers’ fields.

Technical Abstract: Background: Machine learning and modeling approaches have been used to classify protein sequences for a broad set of tasks including predicting protein function, structure, expression, and localization. Recent studies have identified the ability to predict if genes are expressed or even translated to proteins, but there remains the challenge to identify condition-specific expression. Results: To address this challenge, we developed a Markov model approach that predicts tissue-specific gene expression in maize based on sequence alone. To demonstrate the utility of these classifiers, we systematically explored using various methods and combinations of k-mer sequences using both DNA promoter and protein sequences across 23 different maize tissues. To generate class labels, we defined both high and low expression levels for mRNA and protein abundance. A two-phase approach had the best performance. In the first phase, we built a feature vector using the predictions from selected tissues, and the feature vector was used for final classification in the second phase using a Bayesian network. Conclusions: Our experimental results show that these methods can achieve high classification accuracy for predicting gene expression for individual tissues. By relying on sequence alone, our method works in settings where costly experimental data are unavailable to obtain useful insights into the functional, evolutionary, and regulatory characteristics of genes.