Project : USDA ARS

ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Research Project #434435

Research Project: Improving Crop Efficiency Using Genomic Diversity and Computational Modeling

Location: Plant, Soil and Nutrition Research

2020 Annual Report

Objectives
Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A]

Approach
Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing.

Progress Report
During this year we have been using the pipeline developed last year to continue the assembly of maize genomes, starting with an additional 9 genomes from the Genomes to Fields initiative. As sequencing technologies keep updating and improving, the project worked closely with USDA scientists in Stoneville, Mississippi, to implement a high-throughput genome assembly pipeline for maize using the newest high quality, long reads technology. To complement the public release of 26 very high-quality maize assemblies, we selected an additional 25 maize genomes to capture the diversity that was missing in them, and we have assembled 15 of those using this new system, with the other 10 ongoing and expected to be done in the next few months. We have used all this information to generate an improved maize practical haplotype graph database that captures most of maize diversity. In addition, working with our collaborators, we have finished the collection of live clones of Andropogoneae species, wild relatives of maize and sorghum, and started genome sequencing and assembly at scale. At this point we have 15 more genomes of different species available, and 5 more are coming in the next few months. All these newly assembled genomes combined with the information in the maize graph will improve our understanding of the functionally constrained regions of the genome and our current estimates of the effects of allele changes in yield. The prevailing hypothesis for hybrid vigor is that it is the product having two alleles of each gene and provides either complementation of deleterious mutations or total expression that is closer to the optimal expression levels. For individual genes and pathways, we know that both mechanisms are in operation. This general question surrounds how does the dosage and dominance of a gene relate to phenotype? At the whole genome scale, we tested a range of hypotheses to relate SNP variation to prediction hybrid yield and other traits. This showed the contributors to predicting yield were dosage, genetic variation close to genes, and mutations in conserved nucleotides. In another series of experiments, that related the dosage of RNA expression to whole-plant performance, we found that substantial improvements in prediction accuracy could be made if models were trained on 5000 different genotypic observations, but datasets with only a few hundred distinct genotypes provide no improvement over standard genome-wide prediction. This suggests that given the over 30,000 genes in the genome, that large studies are needed to make progress on integration of the dosage from whole plant phenotype directly. An alternative to building large empirical datasets that connect genotype or expression to phenotype is to develop models that are built on mechanistic processes that are well parameterized with specific datasets. We have made progress on training models to predict expression in maize and Arabidopsis, transcription factor binding in maize and Arabidopsis, transfer of expression and transcription factor models between maize and Arabidopsis, and through several collaborations have just collected data on protein levels and expression of a control gene against nearly every maize promoter (STARR-seq). These models are starting to approach the accuracy where they could rival measured expression in the next couple years. There has also been tremendous progress by other groups in using machine learning for protein structure prediction. While we have extensively tested these models, at present they do not seem sensitive enough to make actionable predictions. Machine learning methods for protein structure are evolving very rapidly in the biomedical context, and we will continue to evaluate these models as they are developed to test for applicability to crop improvement. Overall, we are seeing evidence for good transferability of models among plants, and while protein models have insufficient resolution currently, we expect this to change soon. Within the next several years, mechanistic models that work across eukaryotes seems likely. While the project originally targeted integration of its software with Spark, a platform that facilitates large-scale, parallel computation, it became apparent that the genomics and plant breeding community was not adopting that platform. Instead, two important trends have been the widespread use of R and software containers. R is a computing environment for statistics and graphics. Containers encapsulate complex software environments, making complex packages much easier to distribute and use. We also began development of the Practical Haplotype Graph (PHG) software for organizing pan-genomes and using them for imputation. The project has recently made publicly available R interfaces to both TASSEL and PHG, called, rTASSEL and rPHG. These interfaces provide the ability to run analyses, export results, and take advantage of R packages for downstream analysis and visualization. The PHG software is being distributed as a Docker image, which can be downloaded from a website called DockerHub. As another approach to making analysis methods in TASSEL and PHG more easily accessible and more performant, the project has begun investigating the use of GraalVM, a remarkable computing environment that both makes it easier to combine programming languages and provides faster execution than the leading JVM's. A JVM is a Java Virtual Machine, which is the software engine that runs Java and Java compliant programs. TASSEL and PHG are written in a combination of Java and Kotlin languages, both of which run on JVM's. All of these efforts reflect the project's longer-term objective to stop developing graphical user interfaces and instead to take advantage of widely used notebook-style software for managing complex research workflows. Breeding Insight (BI) is an ARS initiative to increase adoption of genomics, phenolics, and analytics tools (including data management software) in ARS specialty crop and animal breeding programs, which have lagged behind major crop and animal breeding programs. BI is currently in year 2 of a pilot phase focused on building support services for 6 ARS breeding programs (blueberry, table grape, sweet potato, alfalfa, rainbow trout, and North American Atlantic salmon), with the future goal of expansion out to all ARS specialty crops, animal, and natural resource breeding programs. In year 2, we completed most of BI's hiring. The first focus was on understanding the needs of the various breeding programs, the commonalities and the differences, which was accomplished with location visits and monthly or more frequent meetings. While there are a wide range of informatic platforms to assist breeders, these collaborations facilitate the development of clear tools that are workflow based with high quality interfaces. The software team is about half way done in creating these initial workflows. BI's first significant accomplishment is the release of open-source software code that allows seamless data transfer between the leading field data collection platform and the leading open source database system using BrAPI (Breeding API)I. BI worked with the US grape breeding community to deploy this, and they are using it to improve their day-to-day work to improve efficiency and accuracy. In the past year, BI had completed genome sequencing on ARS alfalfa and blueberry to create a set of markers for breeding efforts and provided a set of 100K markers to create a North American Atlantic salmon genotyping platform available to the public. BI has also supported genotyping and evaluating 4000 grape varieties. Breeding Insight is off to a strong start, but the rest of year will be key in completing the initial version.

Accomplishments
1. The genomic toolbox for regulating genes is shared across flowering plants and crops. Flowering plants and crops have 20,000 to 60,000 genes, but those genes are controlled by a smaller set of two thousand regulator genes called transcription factors. Are the patterns for how these regulator genes bind DNA and turn on genes consistent across plants? In two large studies, ARS researchers in Ithaca, New York, along with collaborators, have shown that the interaction between regulator genes and DNA is evolutionarily consistent across flowering plants. The tremendous diversity of plants is the product of combining these regulator gene-DNA interactions into numerous new combinations. This suggests that plant scientists should work across species to develop a single model for the regulation of plant genes. Long term this will allow advanced genomic models to be applied to all crops.

2. Breeding Insight starts supporting ARS specialty crop and animal breeders. While specialty crops and animals are a large portion of gross US agricultural revenue, individually these small programs have not had access to innovations that benefited major crop and animal breeding programs and thus have lagged behind. ARS specialty breeders are often the sole source of publicly available new crop varieties for farmers and growers across the US and elsewhere. Breeding Insight is currently in a pilot phase focused on building support services for 6 ARS breeding programs (blueberry, table grape, sweet potato, alfalfa, rainbow trout, and North American Atlantic salmon), with the future goal of expansion to all ARS specialty crops, animal, and natural resource breeding programs. The project has identified the key workflows common to these diverse programs, and initiated the development of extensive software and genomics to support these efforts. A key early success was integration of the leading field data collection tool with the community’s leading database. Genomic support was delivered for all programs. Providing powerful information and genomic tools to ARS’s excellent specialty crop and animal breeders is helping to improve breeding decisions, meet public demands for more nutritious and flavorful foods, and improve food security for the US and its trade partners.

U.S. DEPARTMENT OF AGRICULTURE

Plant, Soil and Nutrition Research: Ithaca, NY