Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Research Project #434608

Research Project: Database Tools for Managing and Analyzing Big Data Sets to Enhance Small Grains Breeding

Location: Plant, Soil and Nutrition Research

2020 Annual Report

Objective 1: Develop methods and analyses on the Triticeae Toolbox (T3) database that use data stored there to assign likelihood to genome segments of carrying trait associated variants. Sub-objective 1.A. Improve T3 upload, download, and quality control tools. Sub-objective 1.B. Implement the Genomics and Open-source Breeding Informatics Initiative (GOBII) genotype data storage on T3. Sub-objective 1.C. Automate imputation to high-density genotyping platforms. Sub-objective 1.D. Automate genome-wide association study implementation. Objective 2: Improve linkages between diversity data stored in T3 and knowledge gleaned from the literature based on biological experimentation. Sub-objective 2.A. Develop new linkages with KNetMiner. Sub-objective 2.B. Implement analyses to estimate between-trait genetic correlations using the whole database as the reference population. Objective 3: Enhance T3 facilities to analyze and manage multi-omic data and data from multi-state cooperative nurseries. Sub-objective 3.A. Functions for search and analysis of transcriptomic and metabolomic data. Sub-objective 3.B. Clustering and prediction using multi-omic data.

ARS develop text file input methods and will implement Mendelian error checking when both parents and an offspring have marker data. Upon upload of a high-dimensional phenotype dataset, a relationship matrix will be constructed from it and compared to the marker-based and pedigree-based relationship matrices. It will be important to scale each phenotype according to the information it carries about the genotype, namely, its heritability. The method on both transcriptomic and metabolomic datasets will be developed and tested. The Genomics and Opensource Breeding Informatics Initiative (GOBII, genotype data management system will be incorporated using the Breeding application program interface (BrAPI, For imputation, Beagle4 has been tested. We will collaborate with another ARS lab in bringing the Practical Haplotype Graph (PHG) to wheat. When new lines are uploaded to T3 with genotype data of adequate density, they will be imputed. Genome-wide association study (GWAS) analyses using imputed scores will take the reliability of those scores into account. For traits assayed in multiple trials, results are combined by meta-analysis. Genes will be sorted by cumulative evidence of association and automated links are made to external databases, and we will populate a JBrowse track with GWAS hits. T3 users will want to access the KNetMiner network after an association analysis in T3: having identified a variant associated with a trait, KNetMiner will provide access to information from the literature about it. KNetMiner has developed a beta application program interface (API) that takes a gene and a trait and displays the knowledge network connected to those will be used. Traits will be linked by a co-located association. In a focal dataset, users will query all associations that pass a user-defined threshold. Physical distance between associations in prior and focal datasets will be ranked and presented to enable the user to determine which traits she wants to link to. Traits will also be linked by the overall genetic correlation between them by correlating genomic predictions to traits measured in the focal dataset. All expression data of tens of thousands of genes will be stored in “materialized view” tables. JBrowse tracks will be created allowing gene expression of sets of individuals to be displayed. Clicking on a transcript will open a window with a link back to T3 enabling the selection of the transcript as a phenotype. The transcriptome sequences will be added as a T3 BLAST database. The challenge of metabolomics is that most metabolites detected in mass spectroscopy (MS) experiments are of unknown chemical composition. Metabolomic databases other than T3 allow metabolite identities to be explored. Metabolomic data will be stored in formats compatible with those databases to enable sharing. Users will be able to link back to T3 from them. As for gene expression, metabolites will be searchable based on the genetic correlation of their levels with other phenotypes.

Progress Report
This year has been dominated by two efforts. The first to get The Triticeae Toolbox (T3) transitioned over to a different code base, that of BreedBase (, and the second to get a specific DNA marker imputation engine up and running, the Practical Haplotype Graph (PHG). The year ends without fully accomplishing either effort, though important strides have been made in both cases and early evidence is showing that the efforts will be valuable in the end. Transitioning to BreedBase. The rationale is two-fold: 1. BreedBase implements tools useful for the process of breeding that the previous T3 code base did not. For example, barcode management, seed inventory management, greater experimental design options, and improved connectivity with mobile phenotype recording applications. We believe that these tools will make the use of BreedBase more attractive to small grains breeders in North America. 2. BreedBase is supported by more developers than T3’s previous code base. The development leader is a colleague at the Boyce Thompson Institute along with three full-time and two half-time developers. Importantly, the USDA-ARS Breeding Insight project funded through Cornell is also using BreedBase as its primary code base, and bringing along five more full-time developers. BreedBase also collaborates with a project at the University of Tokyo, developing software tools for small private Japanese breeders. Adding our own strength to this consortium is leading to a very strong critical mass for development of this codebase. We do have evidence that our transition to BreedBase will bear fruit for helping small breeding programs that do not have strong IT infrastructure to benefit from breeding informatics. In particular, we are collaborating with three wheat breeding programs located at USDA-ARS, Kansas; University of Illinois; and Virginia Tech, that are actively using instances of T3/Wheat in their breeding programs. We have set up independent instances of the database for each program. These independent instances will be able to submit data directly to T3 with a simple authorization from the breeders. Such an approach will greatly facilitate the accumulation and sharing of data through T3, while also providing efficient breeding data management tools to breeders. While we believe that our approach will go a long way toward meeting two objectives of providing informatics tools to breeders and enabling large-scale joint analyses on T3, the transition has not been without difficulties. An advantage of BreedBase is that it uses interoperable trait definitions in the form of ontologies. That is beneficial from the perspective of sharing data but has required a painstaking translation of all traits present in the previous code base of T3 to common ontologies. While BreedBase is well organized for breeding, the original T3 code base also hosted a number of features useful for research. We have had to port those features over to BreedBase individually. Finally, the sheer quantity of data existing on T3 and needing to be ported over has been a challenge. Yet we are on the cusp of having wheat, barley, and oat all supported by a T3/BreedBase instance. Connecting Historical Data to Contemporary Data - Across wheat, oat, and barley, T3 stores 5,600 trials encompassing 1,800,000 phenotypic data points on over 30,000 lines with marker data. This data represents a significant resource for discovering genomic segments affecting traits and testing genomic hypotheses. A challenge is to connect this historical data to current relevant breeding germplasm: the wheat, oat, and barley lines represented in the database are often no longer present in the most elite breeding populations. In contrast, they carry alleles that still segregate in current populations. Thus, the T3 mandate is to transform the unit of evaluation from the traditional breeding line to the allele. To accomplish this transformation, we need the marker genotypes for evaluated lines over a consistent set of markers. In turn, since different breeding programs often use different marker platforms, we need an imputation system that predicts marker scores on a uniform set of markers for all lines, regardless of the marker platform used for their genotyping. The imputation system we have chosen for this work is the Practical Haplotype Graph (PHG) under active development in another ARS laboratory. The PHG has three distinct advantages from our point of view: 1. The reference for the PHG is a gold-standard reference sequence that the community will agree on. Thus, it simplifies exporting results in a coordinate system that will be appropriate for all users. 2. The PHG identifies and groups multiple variable markers into single haplotypes. Those haplotypes in turn, are uniquely identified. This approach compresses the data making it possible to store whole-genome level polymorphisms for all individuals in the database efficiently. 3. The haplotypes that PHG defines can be more easily traced through pedigrees than single markers. They therefore also explicitly help to connect historical data with current populations. This year we worked with the other ARS laboratory to host a hackathon in February at Cornell to plan and work on creating a PHG for wheat. We discussed the work done at several labs on this effort. We worked with the developers in the Buckler Lab to add features that will speed up the PHG functions. A major challenge with small grains is their very large genome sizes (5.3, 12.3, and 17 Gbp for barley, oat, and wheat, respectively). These sizes mean that genome-wide computations of any sort in these species is slow and efforts to accelerate them are key. The use of high-performance computing is also, and we are exploring SCINet to host imputation functions for T3. Other contributions - The Pedigree Of Oat Lines (POOL) database has been a long-standing extensively curated resource for oat pedigrees ( The data in that database has now been transferred over to We have developed an accession synonym search tool ( As any breeder will tell you, accessing historical data is made more difficult by the fact that there are often many small variants on germplasm names, in addition to the fact that breeding lines usually change names when they become varieties. This tool alleviates the difficulty for T3 crops.

1. Historical introgressions from a wild relative of modern cassava affect multiple traits. Introgression of alleles from a wild cassava relative, made initially provide disease resistance, has been assumed to be adaptive in modern cassava breeding. However, a complete assessment of the effects of such introgressions has not been available. In a large panel of genotyped cassava, we identified DNA markers that were diagnostic of introgression from the wild relative Manihot glaziovii. We found significant effects of introgressions on dry matter content, root number, disease resistance, and harvest index. We also found that clones homozygous for introgressions tend to be eliminated by selection, suggesting these introgressions provide heterozygous advantage. Heterzygous advantage and suppressed recombination, however, may have increased the accumulation of deleterious mutations in these introgressions. Findings from this study have increased breeder's motivation to generate and evaluate recombinations in the introgression blocks we detected.

2. Heritable temporal gene expression patterns correlate with metabolomic seed content in developing hexaploid oat seed. Oat is prized for its healthful seed composition but breeders have few genomic tools to rapidly modify it. As a proof-of-concept that gene transcription levels and metabolomics can provide such tools to breeders, we quantified gene expression during seed development from 22 diverse lines across six time points and subjected their mature seeds to untargeted mass spectroscopy metabolomics. We showed that transcripts could be grouped by the temporal dynamics of their expression and by the impact of oat line on expression. A majority of such groups showed high heritability of gene expression providing a first genomic tool to breeders. We further found that metabolite levels of mature seeds correlated more strongly to gene expression levels in these groups than a null model expectation. These results pave the way for the use of transcriptomics and metabolomics to identify genomic segments that have large influence on oat seed composition, an approach we are currently applying.

3. Improving root characterization for genomic prediction in cassava. Cassava is cultivated due to its drought tolerance and high carbohydrate-containing storage roots. The lack of uniformity and irregular shape of storage roots poses constraints on harvesting and postharvest processing. We performed image analysis of the cassava storage roots of a large breeding population at the International Institute of Tropical Agriculture in Nigeria. This population was also genotyped at high density. We identified genomic segments affecting many aspects of cassava root size and shape. Importantly, we also identified genomic segments affecting the variation within a clone for root shape. Mechanical harvest and processing of cassava roots makes uniformity of size and shape more important. This proof-of-concept showed that image-based phenotyping and genomic-aided selection can improve these traits.

Review Publications
Yabe, S., Iwata, H., Jannink, J. 2018. Impact of mislabeling on genomic selection in cassava breeding. Crop Science. 58:1470-1480. doi: 10.2135/cropsci2017.07.0442
Lozano, R., Booth, G.T., Omar, B.Y., Li, B., Buckler IV, E.S., Lis, J.T., Jannink, J., Pino Del Carpio, D. 2018. RNA polymerase mapping in plants identifies enhancers enriched in causal variants. bioRxiv.
Clohessy, J.W., Pauli, D., Kreher, K.M., Buckler IV, E.S., Armstrong, P.R., Wu, T., Hoekenga, O.A., Jannink, J., Sorrells, M.E., Gore, M.A. 2018. A low-cost automated system for high-throughput phenotyping of single oat seeds. The Plant Phenome Journal. 1(1):1-13.
Hu, H., Gutierrez-Gonzalez, J.L., Liu, X., Yeats, T.H., Garvin, D.F., Hoekenga, O.A., Sorrels, M.E., Gore, M.A., Jannink, J. 2020. A new oat seed transcriptome identifies heritable temporal gene expression patterns in developing seeds of hexaploid oat. Plant Biotechnology Journal. 18(5):1211-1222.
Blake, V.C., Woodhouse, M.R., Lazo, G.R., Odell, S.G., Wight, C.W., Tinker, N.A., Wang, Y., Gu, Y.Q., Birkett, C.L., Jannink, J., Matthews, D.E., Hane, D.L., Michel, S.L., Yao, E., Sen, T.Z. 2019. GrainGenes: centralized small grain resources and digital platform for geneticists and breeders. Database: The Journal of Biological Databases and Curation. 2019.
Santantonia, N., Jannink, J., Sorrels, M. 2019. Prediction of subgenome additive and interaction effects in allohexaploid wheat. Genes, Genomes, Genetics. 9(3):685-698.