Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Research Project #434608

Research Project: Database Tools for Managing and Analyzing Big Data Sets to Enhance Small Grains Breeding

Location: Plant, Soil and Nutrition Research

2022 Annual Report


Objectives
Objective 1: Develop methods and analyses on the Triticeae Toolbox (T3) database that use data stored there to assign likelihood to genome segments of carrying trait associated variants. Sub-objective 1.A. Improve T3 upload, download, and quality control tools. Sub-objective 1.B. Implement the Genomics and Open-source Breeding Informatics Initiative (GOBII) genotype data storage on T3. Sub-objective 1.C. Automate imputation to high-density genotyping platforms. Sub-objective 1.D. Automate genome-wide association study implementation. Objective 2: Improve linkages between diversity data stored in T3 and knowledge gleaned from the literature based on biological experimentation. Sub-objective 2.A. Develop new linkages with KNetMiner. Sub-objective 2.B. Implement analyses to estimate between-trait genetic correlations using the whole database as the reference population. Objective 3: Enhance T3 facilities to analyze and manage multi-omic data and data from multi-state cooperative nurseries. Sub-objective 3.A. Functions for search and analysis of transcriptomic and metabolomic data. Sub-objective 3.B. Clustering and prediction using multi-omic data.


Approach
ARS develop text file input methods and will implement Mendelian error checking when both parents and an offspring have marker data. Upon upload of a high-dimensional phenotype dataset, a relationship matrix will be constructed from it and compared to the marker-based and pedigree-based relationship matrices. It will be important to scale each phenotype according to the information it carries about the genotype, namely, its heritability. The method on both transcriptomic and metabolomic datasets will be developed and tested. The Genomics and Opensource Breeding Informatics Initiative (GOBII, www.gobiiproject.org) genotype data management system will be incorporated using the Breeding application program interface (BrAPI, www.brapi.org). For imputation, Beagle4 has been tested. We will collaborate with another ARS lab in bringing the Practical Haplotype Graph (PHG) to wheat. When new lines are uploaded to T3 with genotype data of adequate density, they will be imputed. Genome-wide association study (GWAS) analyses using imputed scores will take the reliability of those scores into account. For traits assayed in multiple trials, results are combined by meta-analysis. Genes will be sorted by cumulative evidence of association and automated links are made to external databases, and we will populate a JBrowse track with GWAS hits. T3 users will want to access the KNetMiner network after an association analysis in T3: having identified a variant associated with a trait, KNetMiner will provide access to information from the literature about it. KNetMiner has developed a beta application program interface (API) that takes a gene and a trait and displays the knowledge network connected to those will be used. Traits will be linked by a co-located association. In a focal dataset, users will query all associations that pass a user-defined threshold. Physical distance between associations in prior and focal datasets will be ranked and presented to enable the user to determine which traits she wants to link to. Traits will also be linked by the overall genetic correlation between them by correlating genomic predictions to traits measured in the focal dataset. All expression data of tens of thousands of genes will be stored in “materialized view” tables. JBrowse tracks will be created allowing gene expression of sets of individuals to be displayed. Clicking on a transcript will open a window with a link back to T3 enabling the selection of the transcript as a phenotype. The transcriptome sequences will be added as a T3 BLAST database. The challenge of metabolomics is that most metabolites detected in mass spectroscopy (MS) experiments are of unknown chemical composition. Metabolomic databases other than T3 allow metabolite identities to be explored. Metabolomic data will be stored in formats compatible with those databases to enable sharing. Users will be able to link back to T3 from them. As for gene expression, metabolites will be searchable based on the genetic correlation of their levels with other phenotypes.


Progress Report
The milestone summaries for this Annual Report are disappointing. There are two primary culprits. First, in our project, there has been a mismatch between the resources we have available to work on T3 and the resources that we have available through grants to work on statistical genetic and breeding projects that, over the long term will contribute to T3 but that over the short term are peripheral to it. Evidence of this mismatch is that the project overall has contributed a good deal over the past year (see Accomplishments below). These accomplishments, however, have not translated directly into improvements in T3 features as was hoped when the Project Plan from which the Milestones were take was written up five years ago. The second culprit was simply overly ambitious Milestones in the Project Plan. It has become evident that research in an area and its application to a single dataset is a far cry from the robust code that is needed to implement a feature in a database that stores many heterogeneous datasets. We hope these lessons learned will contribute to a more successful Project Plan for the coming five-year period. This Report comes nevertheless with a sense from our team that The Triticeae Toolbox (T3) has never been more useful to public sector small grains breeders in the United States – our primary stakeholders. Evidence of this usefulness is in the retention of T3 by the Wheat Coordinated Agricultural Proposal (WheatCAP) as the primary mechanism for sharing data among WheatCAP breeders and the eventual dissemination of this data to wheat researchers globally. This collaboration with WheatCAP has the potential to move T3 forward dramatically given the feedback we will get from breeders on T3 interfaces and the greater familiarity breeders will have with T3 functions generally. This collaboration will lead to both increase ease of use and increased use of T3 as a service. Through the greater use of T3, we now get more regular feedback. We need to respond to this feedback quickly, given that it usually pertains to bottlenecks that impede breeder use of the service. Perhaps obviously, the feedback rarely coincides with features represented in our Project Plan from years past. Our main activities, therefore, frequently do not contribute to the achievement of Milestones. We have not lost track of our original Objectives. We still believe the functionalities we discussed five years ago in writing up the Project Proposal will help breeders extract more information out of the phenotypic and genotypic data that the invest in collecting. The functionalities include Genome-wide association study (GWAS) on imputed data, and filtering, curating, and estimating relationships using high-dimensional phenotypes. We have laid foundations to implement these functionalities. Imputation itself has proved surprisingly challenging. Using software from the Buckler lab (the Practical Haplotype Graph), we are close to having it and being able to use it for GWAS. We have developed a new efficient storage data structure for high dimensional phenotypes, including transcriptomics and metabolomics. We feel positive about the improvements we have brought to the Breedbase codebase that T3 has adopted, and our prospects for making it better for our small grains breeder stakeholders first, and the breeders of the number of other crops that Breedbase serves. We have continued the work of curating and making available to researchers data from public-sector small grains breeding programs across the nation. For example, solely across cooperative nurseries (trials among elite lines that are tested in multiple environments), T3 stores sixteen different cooperative nurseries, represented by a total of 2,800 trials (nursery * year * location combinations), with an average of 14 locations per trial and 11 years per trial. Combined with available genotypes, this data represents a significant resource for discovering genomic segments affecting traits and testing genomic hypotheses.


Accomplishments
1. Global cassava research collaborations. We continue to collaborate with Cassava researchers globally, improving experimental methodology in the area of genotype by environment interaction for cassava breeding and providing guides to causal loci in the cassava genome. This work has been ongoing for a number of years now. A major synthesis of these efforts is due, though it is not clear who might perform this work.

2. Genomic mate selection method development. Also stemming from our cassava work, but of broader impact for multiple crops, we have developed and made available through well-documented software methods to select pairs of individuals to cross for species that are outcrossing and therefore heterozygous. We believe that these methods should play an important role going forward in optimizing breeding for both short and long-term gain.

3. Simulation to optimize sugar kelp breeding. Our simulation and statistical genetic efforts continue to help the sugar kelp breeding community develop efficient breeding tools for this organism with a bi-phasic lifecycle. Kelps have free-living diploid and haploid life stages. Selection is possible at both stages. We have simulated selection in this new setting to determine the most effective strategies to leverage this lifecycle and begun to develop the statistical genetic tools needed, for example, to select haploid individuals on the basis of phenotypes from their diploid relatives.

4. Quantitative genetics of metabolomics. We continue to innovate in the analysis of metabolomics data using genome-wide analysis methods. The challenge of metabolomics is that they represent high-dimensional data. Using oat as a model system, we have explored two approaches to address this dimensionality. First, to focus in on specific pathways within the overall metabolomics dataset. In the case of oat, avenanthramides have been our target of choice because of the importance of this metabolite both for oat disease resistance and for its human health properties. Second, we have broken metabolites down into classes, thereby reducing dimensionality. These two approaches have borne fruit in the context of improved prediction accuracy and understanding of changes in oat secondary metabolism over its breeding history.

5. Equity in plant breeding: a guide. An effort originating from our lab that we are very proud of, is work to consider equity in plant breeding efforts. Since the murder of George Floyd in May 2020, our lab has held regular meetings seeking to understand our role in moving the United States towards greater equity. One outcome of this work was an original article laying out a framework and exploration guide for breeders to consider the equity impacts of their work. We hope plant breeders will find inspiration and thought stimulation from this article.

6. Improved database resources for plant breeding. Finally, we remain very engaged in the “Database for plant breeding” space, collaborating actively with other groups who move software technology further in this space. Collaborations have led to co-authorships on publications related to Grain Genes and Breedbase.


Review Publications
Augyte, S., Jannink, J., Mao, X., Huang, M., Robbins, K., Hare, M., Umanzor, S., Marty-Rivera, M., Li, Y., Yarish, C., Lindell, S., Bailey, D. 2020. Kelp, Saccharina spp, population genetics in New England, US, for guiding a breeding program of thermally resilient strains. Bulletin of Fisheries Research Agency. 50:135-139.
Wolfe, M.D., Jannink, J., Kantar, M.B., Santantonio, N. 2021. Multi-species genomics-enabled selection for improving agroecosystems across space and time. Frontiers in Plant Science. 12:665349. https://doi.org/10.3389/fpls.2021.665349.
Ozimati, A.A., Esuma, W., Alicai, T., Jannink, J., Egesi, C., Kawuki, R. 2021. Outlook of cassava brown streak disease assessment: Perspectives of the screening methods of breeders and pathologists. Frontiers in Plant Science. 12:648436. https://doi.org/10.3389/fpls.2021.648436.
Umanzor, S., Li, Y., Bailey, D., Augyte, S., Huang, M., Marty-Rivera, M., Jannink, J., Yarish, C., Lindell, S. 2021. Comparative analysis of morphometric traits of farmed sugar kelp and skinny kelp, Saccharina spp., strains from the Northwest Atlantic. Journal of the World Aquaculture Society. 2021;1-10. https://doi.org/10.1111/jwas.12783.
Lozano, R., Booth, G.T., Omar, B., Li, B., Buckler IV, E.S., Lis, J.T., Pino Del Carpio, D., Jannink, J. 2021. RNA polymerase mapping in plants identifies intergenic regulatory elements enriched in causal variants. Genes, Genomes, Genetics. jkab273. https://doi.org/10.1093/g3journal/jkab273.
Campbell, M.T., Hu, H., Yeats, T.H., Caffe-Treml, M., Gutierrez, L., Smith, K.P., Sorrells, M.E., Gore, M.A., Jannink, J. 2021. Translating insights from the seed metabolome into improved prediction for lipid-composition traits in oat (Avena sativa L.). Genetics. 217(3):iyaa043. https://doi.org/10.1093/genetics/iyaa043.
Campbell, M.T., Hu, H., Yeats, T.H., Bzozowski, L.J., Caffe-Treml, M., Gutierrez, L., Smith, K.P., Sorrells, M.E., Gore, M.A., Jannink, J. 2021. Improving genomic prediction for seed quality traits in oat (Avena sativa L.) using trait-specific relationship matrices. Frontiers in Genetics. 12:643733. https://doi.org/10.3389/fgene.2021.643733.
Hu, H., Gutierrez-Gonzalez, J.J., Liu, X., Yeats, T.H., Garvin, D.F., Hoekenga, O.A., Sorrells, M.E., Gore, M.A., Jannink, J. 2019. Heritable temporal gene expression patterns correlate with metabolomic seed content in developing hexaploid oat seed. Plant Biotechnology Journal. 18:1211-1222. https://doi.org/10.1111/pbi.13286.
Yao, E., Blake, V.C., Cooper, L., Wight, C.P., Michel, S., Cagirici, H.B., Lazo, G.R., Birkett, C., Waring, D.J., Jannink, J., Holmes, I., Waters, A.J., Eickholt, D.P., Sen, T.Z. 2022. GrainGenes: A data-rich repository for small grains genetics and genomics. Database: The Journal of Biological Databases and Curation. 2022. Article baac034. https://doi.org/10.1093/database/baac034.
Morales, N., Ogbonna, A.C., Ellerbrock, B.J., Bauchet, G.J., Tantikanjana, T., Tecle, I.Y., Powell, A.F., Lyon, D., Naama, M., Simoes, C.C., Saha, S., Hosmani, P., Flores, M., Panitz, N., Preble, R.S., Agbona, A., Rabbi, I., Kulakow, P., Peteti, P., Kawuki, R., Esuma, W., Kanaabi, M., Chelagant, D.M., Uba, E., Olojede, A., Onyeka, J., Shah, T., Karanja, M., Egesi, C., Tufan, H., Paterne, A., Asfaw, A., Jannink, J., Wolfe, M., Birkett, C.L., Hershberger, J.M., Gore, M.A., Robbins, K.R., Rife, T., Chaney, C., Poland, J., Arnaud, E., Laporte, M., Waring, D.J., Brown, A., Bayo, S., Uwimana, B., Akech, V., Yencho, C., De Boeck, B., Campos, H., Swennen, R., Edwards, J., Mueller, L.A., Kulembeka, H., Salum, K., Mrema, E. 2022. Breedbase: a digital ecosystem for modern plant breeding. G3, Genes/Genomes/Genetics. https://doi.org/10.1093/g3journal/jkac078.