Skip to main content
ARS Home » Midwest Area » Ames, Iowa » Corn Insects and Crop Genetics Research » Research » Research Project #444068

Research Project: SoyBase and the Legume Information System - Information Infrastructure and Research for Legume Crop Improvement

Location: Corn Insects and Crop Genetics Research

2024 Annual Report


Objectives
Objective 1: Provide stewardship of soybean and other major legume genetic, genomic, and phenotypic datasets, to support research and crop improvement. Enable efficient curation of the high volume of data from research communities. Structure and store the data to allow open, standardized data exchange mechanisms to enhance database interoperability and collaboration and facilitate research discoveries through integration and comparison. Objective 2: Provide analysis and visualization tools to utilize the range of available legume data, integrating genotype and phenotype information to facilitate research and breeding work for legume crops. Provide resources and tools that enable exploration of diversity, variation, and phenotype data, in the context of genomic coordinates, and across evolutionary timespans. Objective 3: Serve as an organizing center for legume breeding information by collecting and storing phenotypic values for germplasm in USDA and state uniform variety trials. Objective 4: Support crop improvement by collecting and curating information about genes that underlie important agronomic, nutritional, and stress-response traits. Enable translation of genetic information among different crop species. Objective 5: Provide community support and research coordination services for the research and breeding communities for soybean and other legumes, and expand outreach activities through workshops, web-based tutorials, and other communications.


Approach
Modern crop improvement methods make extensive use of genetic and genomic information - for example, using genetic markers for marker-assisted or genomic selection, or using predicted genes and diversity data to identify the genetic basis for important agronomic traits. The SoyBase and the Legume Information System project will collect and curate major genetic and genomic data sets for species in the legume family and prepare and store this data in structured formats to support this type of research and improvement in legume crops (Objective 1). The project will also develop and deploy analysis and visualization tools to provide useful access to the assembled genetic data. The data collected under Objective 1 will be stored to enable programmatic access for software systems maintained by the project, providing users of the soybase.org and legumeinfo.org resources to access and investigate the information in intuitive ways (Objective 2). The project will also collect, organize, and provide information about soybean trait (phenotype) information, from USDA and state variety trials (Objective 3). To help researchers and breeders more efficiently select for desired traits, the project will collect and incorporate published information about genes with established associations with traits of interest (Objective 4). Lastly, the project will continue to provide support to research communities for soybean and other legumes, through support of workshops, online tutorials, and other communication methods (Objective 5).


Progress Report
The SoyBase and Legume Clade Database project in Ames, Iowa, focuses on development of two online databases: SoyBase (soybase.org) and the Legume Information System (LIS) (legumeinfo.org). LIS development is a collaboration between ARS staff in Ames, Iowa, and developers at the National Center for Genome Resources (NCGR) in Santa Fe, New Mexico. Software development is shared by these database projects to improve efficiency and the user experience. SoyBase and LIS are used by plant breeders and researchers throughout the United States and worldwide, as a source of information about the genetic basis for traits important in legume crops. The legume family includes crops such as soybean, chickpea, beans of various types, lentils, peanuts, alfalfa, and many others. By identifying the genetic elements responsible for traits such as flowering time, growth habit, or disease resistance, plant breeders can more directly target and select for desired features. The goal of Objective 1 is to provide stewardship for major genetic and genomic data sets. In support of Objective 1, SoyBase and LIS have incorporated 13 new genome assemblies and associated annotation sets for various legume species (four for soybean and nine for other legumes). These are available via the common Data Store that is used by SoyBase and LIS (https://data.legumeinfo.org). Considering data collections of all types (genetic studies, genome assemblies, diversity data sets, etc.), the number of collections in the Data Store has grown by 150 in the last year, increasing from 803 to 953. This curation work involves close evaluation of the data, often with corrections to bring the data into conformity with accepted standards. This work makes these important data sets more accessible to researchers and to other online biological projects. SoyBase holds 57 soybean genome assemblies,including those of wild relatives, and LIS holds more than 100 genome assemblies for other crop and model legumes. These are integrated with other resources such as genetic markers, information about gene function and expression, and genotype information about many thousands of accessions in the U.S. National Plant Germplasm System. Because LIS and SoyBase have curated genetic and genomic data for Glycine max (soybean) and other legume species, these databases are heavily utilized by researchers and breeders worldwide. In the last calendar year, SoyBase was cited in 235 scientific articles and 10 U.S. patents and LIS data was cited in at 134 articles, indicating the scientific and commercial utility of these databases and resources. Objective 2 focuses on developing and maintaining the computational infrastructure needed to house, integrate, and provide access to genetic and genomic data. The major focus in both the SoyBase and LIS online databases over the last fiscal year has been on deployment of these websites and databases using new web and database technologies, to accommodate the rapidly increasing pace of genomic data generation. Goals of the database and website redevelopment include: greater modularity of code, to allow for rapid changes and additions to particular sections of the databases and sites; increased use of software container technologies, to facilitate portable software development environments, reproducible and secure production deployments, and easier adoption of project-developed applications by the community; updated web development languages and frameworks, to give more responsive user experiences; and updated database schemas, to allow for increased data scaling and to improve database integrity. The transition to the new web framework is mostly complete for LIS. The transition is well underway for SoyBase, with a development instance of the new website publicly available for review by collaborators. The project team plans to replace the “legacy” SoyBase website with the development instance later in calendar-year 2024, after which the legacy website will remain available in an archive state. In addition, we have added and updated several online tools, allowing sequence-based comparison searches, annotation of user submitted sequences, comparisons of gene order and composition between related species, and comparison of marker-trait associations between legume species. To accommodate increasing numbers of genome assemblies and predicted genes, we have developed software (Pandagma, https://github.com/legumeinfo/pandagma) to identify corresponding gene sets (sometimes called pan-genes) and representative genes across any number of assemblies and collections of predicted genes. This software is in use at SoyBase, LIS, and MaizeGDB to calculate pan-gene sets for species available in these databases. In support of Objective 3, SoyBase has served as an organizing center for legume breeding information by collecting and storing phenotypic values for germplasm in USDA and state uniform variety trials. Results of the 2023 Northern Uniform Soybean Test (NUST) were collected and added to the SoyBase database. This database is composed of the results of trials from 1989 to the present. These data document the progress of breeding activities from the public breeders from Missouri and the upper Midwest. These data can also be used as training and testing data for Machine Learning (ML) programs to predict the performance of future testing strains. The ability to predict the performance of progeny from a genetic cross will greatly speed up the creation of soybean lines resistant to both biotic and abiotic stresses like insect attack and changes in precipitation. Since most of Midwest growers do not irrigate their fields and are dependent on rain to grow their crops, drought resistance is a priority. Additionally, 519 new strain pedigrees have been added to the SoyBase Soybean Parentage database. These data document the breeding history of many of the improved soybean strains released by Midwest public breeders since 1943 and allows researchers to easily reconstruct family relationships in the Midwestern soybean germplasm. Family relationships can confound many advanced genetic analyses of soybean such as Genome Wide Association Studies (GWAS), which is used to identify genes responsible for plant traits of interest. The ability to easily search for these family relationships at a central site will improve the accuracy of GWAS studies. Objective 4 supports crop improvement by collecting and incorporating information about genes that underlie important agronomic, nutritional, and stress-response traits. Work on this objective requires collecting information about experimentally supported gene functions in crop and model legumes. In the past fiscal year, more than 150 such genes and associated studies have been collected, and a schema for incorporating this information into the database has been designed. The data are available at the SoyBase and LIS Data Store but are not yet fully integrated into the web interfaces. Outreach related to research and community support included responses to over 35 requests for information/data from the SoyBase database. The project group participated in several working groups of the AgBioData Research Coordination Network (RCN), including working groups for Generic Feature Format (GFF3) Specification, Ontology, Data Federation, Data Standards for Genetic Variation, scRNAseq Biocuration, Pan-Genome and Scientific Literature Biocuration. Other research community activities include membership on the Soybean Genetics Committee, Plant Cell Atlas Working Group, and the REE Data Stewards Community of Practice. SoyBase personnel presented workshops on the use of the database to help identify genes responsible for traits of interest. This will facilitate the identification of genes responsible for important crop traits and provide information to design alleles to improve resistance to biotic and abiotic stress. SoyBase personnel conducted a short survey during a meeting with stakeholders to obtain more information about future data curation. Outreach to communicate project research include three manuscripts published in FY24, including manuscripts describing use of SoyBase and LIS to identify agronomically important traits that correspond between legume crop species, a manuscript describing an important peanut variety in the United States, and a manuscript outlining challenges and opportunities in translating from genomic information to crop improvement.


Accomplishments
1. Addition of new genome assemblies and annotations into the USDA-ARS Soybean Genetics and Genomics database (SoyBase) and the Legume Information System (LIS). Genome sequences describe the order and content of the DNA in all the chromosomes of an organism and serve as a common framework for much of the work done by breeders and other researchers. The framework identifies genes, genetic markers, and traits and their chromosomal locations. SoyBase (SoyBase.org) currently holds genome assemblies and annotations for more than 50 soybean varieties and for six wild relatives of soybean. LIS (LegumeInfo.org) holds genome assemblies and annotations for more than 40 other legume species, including food and forage crops, model species, and other species of interest for timber or other uses. In the last year, ARS researchers in Ames, Iowa have incorporated genome assemblies and annotations for four soybean accessions into SoyBase; and for nine other legume species, into LIS, including wild peanut (Arachis stenosperma), three tree species (Cercis, Phanera, Acacia), three forage species (Vicia, Trifolium, Medicago), and two cultivated bean species (scarlet runner bean and common bean). These data will be used by plant breeders and biologists working to understand the genes involved in adaptation to various physical environments, such as tolerance to increased heat, drought, or salinization. Breeding and research on legume crops impact people worldwide, as legumes provide protein and other nutrients for a large portion of the global population.

2. Identifying genes that affect flowering time in mung bean. Controlling flowering time is crucial to optimize plant growth in crops. Plants that flower too early or too late exhibit reduced yield. Mung bean (Vigna radiata (L.) Wilczek) is an important crop world-wide and is gaining popularity in the U.S., but relatively little is known about the genetics of this species. ARS researchers in Ames, Iowa, and Iowa State University collaborators grew 482 diverse mung bean accessions in Boone, Iowa, over two years with the onset of flowering noted for each accession. These data were used to conduct a genome wide association study examining days to flowering, which identified two genetic markers that account for 25% of the differences in flowering time. The corresponding genes are similar to known flowering genes from soybean and other species. Gene E3 delays flowering in soybean. The gene FERONIA regulates flowering time in Arabidopsis. Thirteen copies of FERONIA were found to be near one of the flowering markers in the genome. Four other genes, known to be important in regulating flowering time in Arabidopsis were also found to be near the two markers in the genome. This information may be used by plant breeders to develop new varieties of mung bean and related crops that are better adapted to northern growing conditions.

3. Contributions of wild peanut to one of the primary peanut varieties grown in the United States. The Bailey II peanut variety, used for food such as trail mix and in-shell products, is a Virginia-type peanut, the second largest market class of peanut cultivated in the United States. Virginia-type peanut varieties were developed using wild peanut relatives for strategic sources of genetic diversity for breeding. ARS Researchers in in Raleigh, North Carolina, Ames, Iowa, and Stoneville, Mississippi, created the first high-quality genome assembly for Bailey II. The genome demonstrates that multiple genomic regions in Bailey II were introduced from wild peanut through genetic crosses carried out in the late 1960s. Some of the wild peanut incorporations found in this study provide resistance to root knot nematode and early- and late leaf spot diseases. This work will be used by plant breeders to more efficiently develop new varieties to provide improved food security economic and environmental benefits.

4. Species splits and mergers contribute to the origin of the legume plant family. The legumes are the third largest plant family, with more than 20,000 species. Those species include oil seeds such as soybean and peanut, forage crops such as clover and alfalfa, and tree crops such as tamarind and carob. Determining the origins and diversification patterns of this family is important for understanding the relationships among the diverse species within the family and how agriculturally important characteristics have evolved. ARS researchers in Ames, Iowa have described the order and timing of early divisions of the legume family into six subfamilies, as well as the relative timing of genome doubling events that occurred in the same timeframe. A key conclusion is that at least one of the subfamilies (containing species such as honey locust and Kentucky coffee tree) resulted from a merger of two early legume species that had likely diverged for several million years. This kind of merger of distinct species, called allopolyploidy, may help explain some of the unusual diversity seen in this subfamily. This basic understanding is important for understanding other allopolyploid crop species such as soybean, peanut, cotton, canola, and wheat.


Review Publications
Newman, C.S., Andres, R.J., Youngblood, R.C., Campbell, J.D., Simpson, S.A., Cannon, S.B., Scheffler, B.E., Oakley, A.T., Hulse-Kemp, A.M., Dunne, J.C. 2023. Initiation of genomics-assisted breeding in Virginia-type peanuts through the generation of a de novo reference genome and informative markers. Frontiers in Plant Science. 13.Article 1073542. https://doi.org/10.3389/fpls.2022.1073542.
Tuggle, C.K., Clarke, J.L., Murdoch, B.M., Lyons, E., Scott, N.M., Mckay, S., Lipka, A., Fulton, J., Hess, A., Lubberstedt, T., Fragomeni, B., Rowan, T., Mccarthy, F., Guadagno, C., Goddard, E., Das Choudhury, S., Sheehan, M., Kramer, L., Feldman, M.J., Daigle, C., Steibel, J.P., Benes, B., Murray, S., Riggs, P., Thompson, A., Hagen, D., Thornton-Kurth, K., Van Tassell, C.P., Campbell, J.D., Dorea, J., Chung, H., Dekkers, J.C., Ertl, D., Lawrence-Dill, C.A., Schnable, P.S. 2024. Current challenges and future of agricultural genomes to phenomes in the USA. Genome Biology. 25:8. https://doi.org/10.1186/s13059-023-03155-w.
Chiteri, K.O., Rairdin, A., Sandu, K., Redsun, S., Farmer, A., O'Rourke, J.A., Cannon, S.B., Singh, A. 2024. Combining GWAS and comparative genomics to fine map candidate genes for days to flowering in mung bean. BMC Genomics. 25. Article 270. https://doi.org/10.1186/s12864-024-10156-x.