Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Research Project #434608

Research Project: Database Tools for Managing and Analyzing Big Data Sets to Enhance Small Grains Breeding

Location: Plant, Soil and Nutrition Research

2021 Annual Report


Objectives
Objective 1: Develop methods and analyses on the Triticeae Toolbox (T3) database that use data stored there to assign likelihood to genome segments of carrying trait associated variants. Sub-objective 1.A. Improve T3 upload, download, and quality control tools. Sub-objective 1.B. Implement the Genomics and Open-source Breeding Informatics Initiative (GOBII) genotype data storage on T3. Sub-objective 1.C. Automate imputation to high-density genotyping platforms. Sub-objective 1.D. Automate genome-wide association study implementation. Objective 2: Improve linkages between diversity data stored in T3 and knowledge gleaned from the literature based on biological experimentation. Sub-objective 2.A. Develop new linkages with KNetMiner. Sub-objective 2.B. Implement analyses to estimate between-trait genetic correlations using the whole database as the reference population. Objective 3: Enhance T3 facilities to analyze and manage multi-omic data and data from multi-state cooperative nurseries. Sub-objective 3.A. Functions for search and analysis of transcriptomic and metabolomic data. Sub-objective 3.B. Clustering and prediction using multi-omic data.


Approach
ARS develop text file input methods and will implement Mendelian error checking when both parents and an offspring have marker data. Upon upload of a high-dimensional phenotype dataset, a relationship matrix will be constructed from it and compared to the marker-based and pedigree-based relationship matrices. It will be important to scale each phenotype according to the information it carries about the genotype, namely, its heritability. The method on both transcriptomic and metabolomic datasets will be developed and tested. The Genomics and Opensource Breeding Informatics Initiative (GOBII, www.gobiiproject.org) genotype data management system will be incorporated using the Breeding application program interface (BrAPI, www.brapi.org). For imputation, Beagle4 has been tested. We will collaborate with another ARS lab in bringing the Practical Haplotype Graph (PHG) to wheat. When new lines are uploaded to T3 with genotype data of adequate density, they will be imputed. Genome-wide association study (GWAS) analyses using imputed scores will take the reliability of those scores into account. For traits assayed in multiple trials, results are combined by meta-analysis. Genes will be sorted by cumulative evidence of association and automated links are made to external databases, and we will populate a JBrowse track with GWAS hits. T3 users will want to access the KNetMiner network after an association analysis in T3: having identified a variant associated with a trait, KNetMiner will provide access to information from the literature about it. KNetMiner has developed a beta application program interface (API) that takes a gene and a trait and displays the knowledge network connected to those will be used. Traits will be linked by a co-located association. In a focal dataset, users will query all associations that pass a user-defined threshold. Physical distance between associations in prior and focal datasets will be ranked and presented to enable the user to determine which traits she wants to link to. Traits will also be linked by the overall genetic correlation between them by correlating genomic predictions to traits measured in the focal dataset. All expression data of tens of thousands of genes will be stored in “materialized view” tables. JBrowse tracks will be created allowing gene expression of sets of individuals to be displayed. Clicking on a transcript will open a window with a link back to T3 enabling the selection of the transcript as a phenotype. The transcriptome sequences will be added as a T3 BLAST database. The challenge of metabolomics is that most metabolites detected in mass spectroscopy (MS) experiments are of unknown chemical composition. Metabolomic databases other than T3 allow metabolite identities to be explored. Metabolomic data will be stored in formats compatible with those databases to enable sharing. Users will be able to link back to T3 from them. As for gene expression, metabolites will be searchable based on the genetic correlation of their levels with other phenotypes.


Progress Report
The milestone summaries for this Annual Report are admittedly disappointing. This Report comes nevertheless with a sense from our team that The Triticeae Toolbox (T3) has never been more useful to public sector small grains breeders in the United States - our primary stakeholders. In the previous reporting period, we discussed the move of T3 from a code base that we had developed over the years to the Breedbase codebase. Breedbase is funded by the Bill and Melinda Gates Foundation, is used on multiple crops, and receives development support from a large team. Consequently, it has many features that are useful to breeders. The move is now complete and we think it was beneficial to our stakeholders. Nevertheless, we did not anticipate the move when we developed the five year Project Plan that these annual reports are anchored to, and it has not come without cost to other objectives in the Project Plan. To ensure the value of T3 to the Breeding Community, we have installed three separate instances of T3 that are being used directly by breeders at USDA-ARS in Manhattan, Kansas; University of Illinois in Champaign-Urbana, Illinois; and Virginia Tech in Blacksburg, Virginia. We regularly get feedback from these beta-tester users. This feedback has increased the functionality of our database but also takes time to respond to. One of the consistent subjects is the need for better seedlot management within the software. We have made improvements to seedlot features within T3 and have also submitted a grant proposal to develop this functionality at scale. We have not lost track of our original objectives. We still believe the functionalities we discussed almost four years ago in writing up the Project Proposal will help breeders extract more information out of the phenotypic and genotypic data that the invest in collecting. The functionalities include Genome-wide association study (GWAS) on imputed data, estimation of correlation coefficients across high-dimensional phenotypic traits, and filtering, curating, and estimating relationships using high-dimensional phenotypes. We have laid foundations to implement these functionalities. Imputation itself has proved surprisingly challenging, but we are close to having it and being able to use it for GWAS. We have developed a new efficient storage data structure for high dimensional phenotypes, including transcriptomics and metabolomics. We will use this feature in downstream analyses of correlation coefficients and estimation of relationship coefficients. We feel positive about the improvements we have brought to the Breedbase codebase that T3 has adopted, and our prospects for making it better for our small grains breeder stakeholders first, and the breeders of the number of other crops that Breedbase serves. We have continued the work of curating and making available to researchers data from public-sector small grains breeding programs across the nation. Across wheat, oat, and barley, T3 stores 6,400 trials encompassing 2,300,000 phenotypic data points on over 37,000 lines with marker data, respective increases of 800 trials, 500,000 phenotypic data points, and 7,000 genotyped lines. This data represents a significant resource for discovering genomic segments affecting traits and testing genomic hypotheses.


Accomplishments


Review Publications
Veenstra, L.D., Poland, J., Jannink, J., Sorrells, M.E. 2020. Recurrent genomic selection for wheat grain fructans. Crop Science. 60(3):1499-1512. https://doi.org/10.1002/csc2.20130.
Yonis, B., Pino Del Carpio, D., Wolfe, M., Jannink, J., Kulakow, P., Ismail, R. 2020. Improving root characterisation for genomic prediction in cassava. Scientific Reports. https://doi.org/10.1038/s41598-020-64963-9.
Morais, P., Akdemir, D., Rogerio Braatz De Andrade, L., Jannink, J., Fritsche-Neto, R., Borem, A., Alvez, F.C., Lyra, D.H., Granato, I.S. 2020. Using public databases for genomic prediction of tropical maize lines. Plant Breeding. 139(4):697-707. https://doi.org/10.1111/pbr.12827.
Ikeogu, U.N., Akdemir, D., Wolfe, M.D., Okeke, U.G., Chinedozi, A., Jannink, J., Egesi, C.N. 2019. Genetic correlation, genome-wide association and genomic prediction of portable NIRS predicted carotenoids in cassava roots. Frontiers in Plant Science. https://doi.org/10.3389/fpls.2019.01570.
Kaya, H.B., Akdemir, D., Lozano, R., Cetin, O., Kaya, H.S., Sahin, M., Smith, J.L., Bahattin, Y., Jannink, J. 2019. Genome wide association study of 5 agronomic traits in olive (Olea europaea L.). Scientific Reports. 9:18764. https://doi.org/10.1038/s41598-019-55338-w.
Jordan, K., Bradbury, P., Miller, Z., Nyine, M., He, F., Guttieri, M.J., Brown Guedira, G.L., Buckler Iv, E.S., Jannink, J., Akhunov, E., Ward, B.P., Bai, G., Bowden, R.L., Fiedler, J.D., Faris, J.D. 2021. Development of the Wheat Practical Haplotype Graph Database as a Resource for Genotyping Data Storage and Genotype Imputation. G3 Genes/Genomes/Genetics. https://doi.org/10.1101/2021.06.10.447944.
Somo, M., Kulembeka, H., Mtunda, K., Mrema, E., Salum, K., Wolfe, M., Rabbi, I., Egesi, C., Kawuki, R., Jannink, J., Ozimati, A., Lozano, R. 2020. Genomic prediction and quantitative trait locus discovery in a cassava training population constructed from multiple breeding stages. Crop Science. 60(2):896-913. https://doi.org/10.1002/csc2.20003.
Wolfe, M.D., Bauchet, G.J., Chan, A.W., Lozano, R., Ramu, P., Egesi, C., Kawuki, R., Kulakow, P., Rabbi, I., Jannink, J. 2019. Historical introgressions from a wild relative of modern cassava improved important traits and may be under balancing selection. Genetics. 213(4):1237-1253. https://doi.org/10.1534/genetics.119.302757.
Mao, X., Augyte, S., Huang, M., Hare, M.P., Bailey, D., Umanzor, S., Marty-Rivera, M., Robbins, K.R., Yarish, C., Lindell, S., Jannink, J. 2020. Population genetics of sugar kelp in the Northwest Atlantic region using genome-wide markers. Frontiers in Marine Science. 7:694. https://doi.org/10.3389/fmars.2020.00694.
Rabbi, I., Kayondo, S., Bauchet, G., Yusuf, M., Aghogho, C., Ogunpaimo, K., Uwugiaren, R., Smith, I., Peteri, P., Agbona, A., Parkes, E., Lydia, E., Wolfe, M., Jannink, J., Egesi, C., Kulakow, P. 2020. Genome-wide association analysis reveals new insights into the genetic architecture of defensive, agro-morphological and quality-related traits in cassava. Plant Molecular Biology. https://doi.org/10.1007/s11103-020-01038-3.