Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #417827

Research Project: Enabling Mechanistic Allele Mining to Accelerate Genomic Selection for New Agro-Ecosystems

Location: Plant, Soil and Nutrition Research

Title: Current genomic deep learning architectures generalize across grass species but not alleles

Author
item WRIGHTSMAN, TRAVIS - Cornell University
item FEREBEE, TAYLOR - Cornell University
item ROMAY, M CINTA - Cornell University
item AUBUCHON-ELDER, TAYLOR - Donald Danforth Plant Science Center
item PHILLIPS, ALYSSA - University Of California, Davis
item SYRING, MICHAEL - Iowa State University
item KELLOGG, ELIZABETH - (NCE, CECR)networks Of Centres Of Exellence Of Canada, Centres Of Excellence For Commercilization A
item Buckler, Edward - Ed

Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 4/13/2024
Publication Date: 4/13/2024
Citation: Wrightsman, T., Ferebee, T.H., Romay, M., Aubuchon-Elder, T., Phillips, A.R., Syring, M., Kellogg, E.A., Buckler Iv, E.S. 2024. Current genomic deep learning architectures generalize across grass species but not alleles. bioRxiv. https://doi.org/10.1101/2024.04.11.589024.
DOI: https://doi.org/10.1101/2024.04.11.589024

Interpretive Summary: o Recent advancements in deep learning have greatly improved our Recent advancements in deep learning have greatly improved out understanding of non-coding regions in plant genomes, which are crucial for determining how genetic variations affect plant traits. This study benchmarks four genomic deep learning models using RNA sequencing data from 18 grass species related to maize and sorghum. These species have evolved diverse adaptations over millions of years, providing a rich dataset for training models. The researchers tested the models' ability to predict gene expression and found that while the models could generalize well across different species, they struggled to distinguish variations within individual species. Among the models, DanQ consistently performed well, but overall, all models showed similar results despite differences in complexity. The study emphasizes the need for more diverse and extensive datasets to improve model accuracy and highlights the release of their data and code for future research .

Technical Abstract: Non-coding regions of the genome are just as important as coding regions for understanding the mapping from genotype to phenotype. Interpreting deep learning models trained on RNA-seq is an emerging method to highlight functional sites within non-coding regions. Most of the work on RNA abundance models has been done within humans and mice, with little attention paid to plants. Here, we benchmark four genomic deep learning model architectures with genomes and RNA-seq data from 18 species closely related to maize and sorghum within the Andropogoneae. The Andropogoneae are a tribe of C4 grasses that have adapted to a wide range of environments worldwide since diverging 18 million years ago. Hundreds of millions of years of evolution across these species has produced a large, diverse pool of training alleles across species sharing a common physiology. As model input, we extracted 1,026 base pairs upstream of each gene’s translation start site. We held out maize as our test set and two closely related species as our validation set, training each architecture on the remaining Andropogoneae genomes. Within a panel of 26 maize lines, all architectures predict expression across genes moderately well but poorly across alleles. DanQ consistently ranked highest or second highest among all architectures yet performance was generally very similar across architectures despite orders of magnitude differences in size. This suggests that state-of-the-art supervised genomic deep learning models are able to generalize moderately well across related species but not sensitively separate alleles within species, the latter of which agrees with recent work within humans. We are releasing the preprocessed data and code for this work as a community benchmark to evaluate new architectures on our across-species and across-allele tasks.