Project : USDA ARS

ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Research Project #434608

Research Project: Database Tools for Managing and Analyzing Big Data Sets to Enhance Small Grains Breeding

Location: Plant, Soil and Nutrition Research

2019 Annual Report

Objectives
Objective 1: Develop methods and analyses on the Triticeae Toolbox (T3) database that use data stored there to assign likelihood to genome segments of carrying trait associated variants. Sub-objective 1.A. Improve T3 upload, download, and quality control tools. Sub-objective 1.B. Implement the Genomics and Open-source Breeding Informatics Initiative (GOBII) genotype data storage on T3. Sub-objective 1.C. Automate imputation to high-density genotyping platforms. Sub-objective 1.D. Automate genome-wide association study implementation. Objective 2: Improve linkages between diversity data stored in T3 and knowledge gleaned from the literature based on biological experimentation. Sub-objective 2.A. Develop new linkages with KNetMiner. Sub-objective 2.B. Implement analyses to estimate between-trait genetic correlations using the whole database as the reference population. Objective 3: Enhance T3 facilities to analyze and manage multi-omic data and data from multi-state cooperative nurseries. Sub-objective 3.A. Functions for search and analysis of transcriptomic and metabolomic data. Sub-objective 3.B. Clustering and prediction using multi-omic data.

Approach
ARS develop text file input methods and will implement Mendelian error checking when both parents and an offspring have marker data. Upon upload of a high-dimensional phenotype dataset, a relationship matrix will be constructed from it and compared to the marker-based and pedigree-based relationship matrices. It will be important to scale each phenotype according to the information it carries about the genotype, namely, its heritability. The method on both transcriptomic and metabolomic datasets will be developed and tested. The Genomics and Opensource Breeding Informatics Initiative (GOBII, www.gobiiproject.org) genotype data management system will be incorporated using the Breeding application program interface (BrAPI, www.brapi.org). For imputation, Beagle4 has been tested. We will collaborate with another ARS lab in bringing the Practical Haplotype Graph (PHG) to wheat. When new lines are uploaded to T3 with genotype data of adequate density, they will be imputed. Genome-wide association study (GWAS) analyses using imputed scores will take the reliability of those scores into account. For traits assayed in multiple trials, results are combined by meta-analysis. Genes will be sorted by cumulative evidence of association and automated links are made to external databases, and we will populate a JBrowse track with GWAS hits. T3 users will want to access the KNetMiner network after an association analysis in T3: having identified a variant associated with a trait, KNetMiner will provide access to information from the literature about it. KNetMiner has developed a beta application program interface (API) that takes a gene and a trait and displays the knowledge network connected to those will be used. Traits will be linked by a co-located association. In a focal dataset, users will query all associations that pass a user-defined threshold. Physical distance between associations in prior and focal datasets will be ranked and presented to enable the user to determine which traits she wants to link to. Traits will also be linked by the overall genetic correlation between them by correlating genomic predictions to traits measured in the focal dataset. All expression data of tens of thousands of genes will be stored in “materialized view” tables. JBrowse tracks will be created allowing gene expression of sets of individuals to be displayed. Clicking on a transcript will open a window with a link back to T3 enabling the selection of the transcript as a phenotype. The transcriptome sequences will be added as a T3 BLAST database. The challenge of metabolomics is that most metabolites detected in mass spectroscopy (MS) experiments are of unknown chemical composition. Metabolomic databases other than T3 allow metabolite identities to be explored. Metabolomic data will be stored in formats compatible with those databases to enable sharing. Users will be able to link back to T3 from them. As for gene expression, metabolites will be searchable based on the genetic correlation of their levels with other phenotypes.

Progress Report
This is the first report for project 8062-21000-045-00D "Database Tools for Managing and Analyzing Big Data Sets to Enhance Small Grains Breeding.” As such, beyond our specific objectives and milestones, it is important to ask if researchers have adequately pivoted toward addressing the research problems identified in our five-year project plan. First, are researchers enabling small breeding programs without a strong information technology infrastructure to take advantage of large breeding datasets, and second, are researchers facilitating access to breeding data to the broader scientific community so that it can contribute to the testing of mechanistic hypotheses? Relative to the first problem, the most important work has involved moving the underlying code base of The Triticeae Toolbox (T3, triticeaetoolbox.org) to using software of BreedBase, which is at the heart of databases used by a number of crops (Cassavabase, Yambase, Musabase, Sweetpotatobase, and Solgenomics). Researchers in Ithaca, New York, are about halfway through our one-year timeline to make this transition. The benefit of the transition is two-fold. First, BreedBase is in use by applied breeding programs such that it has implemented more functionality for applied breeding. It has experimental design pages, pages managing barcodes, communication with tablets used for data capture (through the Android Field Book), and tracking software for submission of samples for DNA marker analysis. Once existing T3 datasets transition to BreedBase, all these functions will be available to small grains breeders. More importantly, the reason BreedBase will be good code for T3 is that it has greater support for software development than T3 currently has. First, the BreedBase team itself is supported by the Bill and Melinda Gates foundation, as it serves as breeding management software for a number of priority crops in sub-Saharan Africa. Second, BreedBase is the code currently being adopted by the Breeding Insight project. Breeding Insight is working for USDA-ARS to improve software for the many small specialty crop breeding programs supported by ARS. While this is a new project, they are in the process of hiring a half-dozen developers to work on this underlying software so that it facilitates the workflows of ARS breeding programs. For T3, researchers see an opportunity both to leverage that investment and to contribute to it. While transitioning T3 to BreedBase is a big lift involving substantial effort from both USDA-ARS computational biologists and Cornell T3 data curators, the collaborations the study has with BreedBase and Breeding Insight are already generating positive synergy. For example, researchers have been able to identify priorities within BreedBase to address and begin work. Researchers have improved trait ontology editing, an issue of concern for Breeding Insight as the work to develop ontologies for their five pilot specialty crop species. In turn, these groups are provided support to move the data. Other areas in which T3 has made progress as a tool facilitating access to bioinformatics tools for breeders can be listed briefly. Researchers have developed and improved pages for the design of PCR and KASP marker primers. In relation to that, researchers maintain a page that identifies and exports all loci polymorphic between a set of wheat lines. Reseachers have been able to take advantage of a new API to KNetMiner(https://knetminer.rothamsted.ac.uk/KnetMiner/) to create links between results on T3 and knowledge base resource. Researchers have created new search pages that use information from gene, protein, and biochemical pathway data websites that facilitate the transition from analyzing marker data on T3 and understanding the physiological functions of the genes they are close to. Perhaps most importantly, and in response to wheat breeding student feedback, researchers have created and improved the set of tutorials for T3. With respect to the second problem of providing the broader scientific community with better access to the data in T3, researchers continue to support the Breeding application program interface, BrAPI (www.brapi.org). This API enables other databases or computer scripts to access T3 independently of its graphical user interface. In effect, it exposes all of the data on T3 and allows that data to be joined to data from other repositories. Thus, despite the modesty of our effort, researchers contribute to a larger whole. The largest data addition T3 made over this reporting period is the variant call format (VCF) file of the 1,000 exomes sequenced and published recently in Nature Genetics. Publishing that dataset has made subsetting it and handling it much easier for researchers and students looking at diversity in specific areas of the wheat genome. In terms of the overall functioning of T3, now that the first gold-standard wheat reference sequence is published, researchers are improving T3 by consistently using that sequence as the coordinate system for all analyses that relate to the genome (in effect, anything having to do with marker polymorphisms or genes). Thus, all genetic variants on T3 now have a name synonym that specifies the markers positions. This effort has enabled to identify many polymorphisms typed on different marker systems but that are in fact tagging the same variant. This work will also be important for imputation efforts. Researchers plan to be able to impute sequence-level variants on any wheat line that is assayed with sufficient density. Researhers recently intensively explored the “Practical Haplotype Graph” from another ARS lab for this purpose. Imputation is an ongoing effort, as described in the milestones of our project plan. Researchers are also conscious that the most valuable data are still phenotypic data: traits measured in the field. But for that data to maintain its value, it needs to remain connected to current breeding populations, either through pedigrees or through marker data. In some sense, while a given wheat line has a finite lifespan, the alleles that it carries maintain their relevance much longer. Thus, researchers continue to ask breeders for pedigrees (which also reach into the past, making past phenotypes relevant) and to encourage the assay and deposit of marker data for as many experimental breeding lines as possible. Thus, researchers started the public oat genotyping initiative, which scored hundreds of thousands of markers on just under 2,000 oat lines. In oat, researchers have also integrated the Pedigree of Oat Lines (POOL) database into T3.

Accomplishments
1. Statistical models to identify gene interactions in wheat. As a crop, wheat evolved through the hybridization of three components species, retaining the genomes of all of those species. In the development of wheat, genes interact across genomes, which is expected to have important effects on variety performance, and which, with the proper tools, breeders can manipulate to provide better varieties for farmers. Researchers in Ithaca, New York, developed and applied new statistical models to investigate gene interactions across and within chromosomes in wheat. With these models, it was shown that while interactions between genomes are important, they are less important than other interactions between genes. The new statistical models and the detailed knowledge of interactions in wheat that they enable are available to wheat researchers through documentation in three publications in the high impact journals Genetics and G3:Genes, Genomes, Genetics. The ability to better identify gene interactions in wheat will help public and private wheat breeders manipulate them to accelerate improvement of this important crop, to the benefit of wheat farmers and the wheat-consuming public.

2. Genomic selection optimization for disease resistance in sub-Saharan Africa. Cassava is a major source of calories for over 700 million people, primarily in sub-Saharan Africa. Cassava is beset by cassava brown streak virus (CBSV) that is now limited to East Africa, but threatens to jump to West Africa where reliance on cassava is highest. In collaboration with researchers in Uganda and Nigeria, crop geneticists in Ithaca, New York, conducted a comprehensive study of breeding methods using cutting-edge DNA marker analyses to optimize disease resistance breeding in East Africa and plan for pre-emptive breeding for resistance in West Africa. The study showed that evaluations in cassava seedlings are predictive of performance for clonally-propagated cassava. This finding opens the way for progeny testing of West African cassavas in East Africa to enable pre-emptive disease resistance breeding, an approach now adopted by the Nigerian National Root Crops Research Institute. While cassava brown streak virus has not yet reached West Africa, it is not a question of if but when. Our research has increased the chance that West African farmers will be ready with resistant varieties when the time comes.

3. Multi-trait analysis of oat seed composition traits. Oats are known for their health-promoting composition, including soluble dietary fiber, antioxidants, and healthful unsaturated oils. Many of these components are linked by biochemical pathways so that analyzing them all together in a multi-trait analysis can provide greater insight and identification of genes controlling composition. Researchers in Ithaca, New York, genotyped a large diverse panel of oats and measured their seed composition in two environments. They contrasted single- and multi-trait analyses of these data and showed that the latter identified twice as many genes affecting composition as the former. Both the new types of analyses and the genetic positions of the genes affecting loci are now available to oat researchers and breeders to make more rapid progress. This accomplishment contributes to the Ithaca locations mandate to develop better quantitative methods in service of public sector small grains breeding. By facilitating the improvement of oat seed composition, this research maintains oat as a viable crop for farmers who use it to increase the sustainability of American agriculture.

U.S. DEPARTMENT OF AGRICULTURE

Plant, Soil and Nutrition Research: Ithaca, NY