Location: Plant, Soil and Nutrition ResearchTitle: Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize
|FEREBEE, TAYLOR - Cornell University|
|Buckler, Edward - Ed|
Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 5/14/2023
Publication Date: 5/14/2023
Citation: Ferebee, T.H., Buckler IV, E.S. 2023. Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize. bioRxiv. 05/14/2023. https://doi.org/10.1101/2023.05.11.540406.
Interpretive Summary: Scientists want to improve the way they predict gene expression in crops in order to quickly identify targets for genomic selection and gene editing. Current prediction models are limited because they require a lot of computing power, specific training data, and do not consider multiple species. In this study, we used three different machine learning methods, ranging from basic to complex structure, to predict gene expression within and between corn, rice, and sorghum. The most basic model gave the most accurate predictions within the same species, with a correlation of 0.75. The more complex machine learning methods did not perform as well for within species predictions. In the most challenging prediction scenario, where we predicted gene expression in unobserved experiments, the mildly complex model averaged around 0.65. Contrary to our initial expectations, the study found that simple models like the basic one worked well for predicting gene expression within the same species. However, when predicting gene expression between different species, more complex models that consider the regulatory network structure and information from other studies were more effective. This research helps scientists decide what kind of models can be used to ultimately lead to better crop breeding and gene editing techniques.
Technical Abstract: Genomic selection and gene editing in crops could be enhanced by multi-species, mechanistic models predicting effects of changes in gene regulation. Current expression abundance prediction models require extensive computational resources, hard-to-measure species-specific training data, and often fail to incorporate data from multiple species. We hypothesize that gene expression prediction models that harness the regulatory network structure of Arabidopsis thaliana transcription factor-target gene interactions will improve on the present maize models. To this end, we collect 147 Oryza sativa and 99 Sorghum bicolor gene expression assays and assign them to maize family-based orthologous groups. Using three popular graph-based machine learning frameworks, including a shallow graph convolutional autoencoder, a deep graph convolutional autoencoder, and the inductive GraphSage strategy, we encode an Arabidopsis thaliana integrated gene regulatory network (iGRN) structure and TF gene expression values to predict gene expression both within and between species. We then evaluate the network methods against a partial least-squares baseline. We find that the baseline gives the best predictions within species, with Spearman correlations averaging between 0.74 and 0.78. The graph autoencoder methods were more variable with correlations between -0.1 and 0.65. In particular, the GraphSage and deep autoencoders performed the worst, and the shallow autoencoders performed the best. In the most challenging prediction context, where predictions were in new species and on genes that were not seen, we found that the shallow graph autoencoder framework averaged around 0.65. Unlike initial thoughts about preserved network structure improving gene expression predictions, this study shows that within-species predictions only need simple models, such as partial least squares, to capture expression variations. In cross-species predictions, the best model is often a more complex strategy utilizing regulatory network structure and other studies’ expressions.