Skip to main content
ARS Home » Midwest Area » Ames, Iowa » Corn Insects and Crop Genetics Research » Research » Research Project #434522

Research Project: SoyBase and the Legume Clade Database

Location: Corn Insects and Crop Genetics Research

2020 Annual Report

Objective 1: Accelerate trait analyses, germplasm analyses, genetic studies, and breeding of soybean and other economically important legume crops through stewardship of genomes, genetic data, genotype data, and phenotype data. Objective 2: Develop an infrastructure that enhances the integration of genotype and phenotype information and corresponding data sets with query and visualization tools to facilitate efficient plant breeding for soybean and select legume crops. Objective 3: Collaborate with database developers and plant researchers to develop improved methods and mechanisms for open, standardized data and knowledge exchange to enhance database utility and interoperability. Objective 4: Provide support and research coordination services for the soybean and other legume research and breeding communities; train new scientists and expand outreach activities through workshops, web-based tutorials, and other communications.

Incorporate revised primary reference genome sequence for soybean into SoyBase. House and provide access to genome sequences for other soybean accessions, haplotype data, and related annotations. Incorporate revised gene models and annotations into SoyBase. Install or implement web-based tools for curation and improvement of soybean gene models and gene annotations. Incorporate available legume genome sequences and annotations. Working with collaborators, collect and add genetic map and QTL data for crop legumes. Extend web-based tools for navigation among biological sequence data across the legumes. Extend and develop methods and storage capacity for accepting genomic data sets for soybean and other legume species. Develop a complete set of descriptors (ontologies) for soybean biology (anatomy, traits, and development), and for other significant crop legumes as needed. Work with the relevant ontology communities-of-practice to incorporate these descriptors into broadly accessible ontologies. Develop web tutorials for important typical uses of SoyBase and the Legume Clade Database. Present and train about features at relevant conferences and workshops. Regularly seek feedback from users about desired features and usability.

Progress Report
Work in the SoyBase and Legume Clade Database project in Ames, Iowa, focuses on development of two online databases: SoyBase ( and the Legume Information System or LIS ( LIS development is a collaboration between ARS staff in Ames, Iowa, and developers at the National Center for Genome Resources (NCGR) in Santa Fe, New Mexico. Additionally, the National Science Foundation “Legume Federation” project (Federated Plant Database Initiative for the Legumes) contributed to development of several of the software and database development efforts, including the Genome Context Viewer (GCV) and the InterMine instances, described below. Software development is often shared by these databases to improve efficiency and the user experience of these resources. The report below will first describe SoyBase and then LIS, noting software in common where appropriate. SoyBase development in the last year included creation of the Expression Explorer tool to visualize gene expression in various soybean tissues. Genome assemblies for four reference-quality soybean genome assemblies and gene model annotations have been incorporated as SoyBase BLAST targets and in genome browsers. SoyBase continues to utilize LIS gene family annotations to link to similar gene sequences across many crop legume species, allowing users to leverage sequence data and phenotypic data from the multiple legume species available at LIS and SoyBase. The Genotype Comparison Visualization Tool tool (GCViT), for exploring genetic diversity data in a crop, has been integrated into SoyBase. This tool provides users with graphical access about genetic variation for the 20,000 unique soybean lines in the USDA soybean germplasm, as well as for other specialized collections of varieties and wild soybean accessions. A video tutorial on its use was produced and added to the SoyBase tutorial collection. SoyBase displays have incorporated links to SoyMine data at LIS to allow users access to other legume homolog information. An installation of GCV, displaying the soybean pan gene collection for three G. max and two G. soja genomes, is in the process of being deployed at SoyBase to allow the visualization of the gene complement of soybean. SoyBase maintains and improves Excel templates that have been developed for genome wide association studies (GWAS), quantitative trait loci (QTL) and gene data collection for entry by curators and researchers. Seven large genetic variation datasets for soybean have been collected and incorporated into GCViT for display and the LIS datastore for distribution. SoyBase has moved forward with the adoption of the new CMap-js software (developed in this project) for the display of genetic maps. Full incorporation of genetic maps into CMap-js is planned for the coming year. Programmatic access to SoyBase data has been accomplished through programming interfaces (APIs) inherent in SoyMine. SoyMine implements the InterMine user interface for accessing SoyBase data. A SoyBase tutorial session was held in conjunction with the 2020 Soybean Breeders Workshop attended by Soybean Breeders in the US and Canada, highlighting the use of the new Expression Explorer tool for candidate gene identification of GWAS and biparental QTLs. A SoyBase update was also presented at the meeting. A video tutorial explaining the Expression Explorer was produced and added to the SoyBase video tutorial collection. SoyBase and LIS are participating in designing GRIN-Global APIs to access GRIN soybean data. SoyBase continues to be a central hub for the soybean research community and the curator was called upon over 80 times by users needing information or explanations of soybean genetics and genomics since 1 Oct. 2019. In the same time frame, 289 scientific articles and 25 US patents utilized SoyBase data in their analyses or patent applications. In that vain, SoyBase has formed a collaboration with the North Central Regional Soybean Research Program funded SoyGen2 project to collect the USDA funded Northern Uniform Soybean Test data and create a searchable database of the results from the 2018 and later tests. The GCViT tool, described above, has been implemented for common bean, chickpea and peanut at LIS. The program enables users to visualize similarity and difference between varieties. This information can be used to track genetic regions of interest in breeding projects, or to identify accessions that may be unexpectedly divergent or similar. The program is also freely available for use by the public, so we expect it to be adopted by other projects and used by researchers and breeders working in many crops. SoyBase and LIS have incorporated 24 major genetic variation data sets from published literature, as well as new genome sequences for soybean, pea, common bean, lupin, and cowpea, into the SoyBase and LegumeInfo Data Store ( and This curation work typically requires close evaluation of the data, reformatting it to conform with accepted standards. The work also requires careful description of the data and its source and other characteristics. This work makes these important data sets more accessible to researchers and makes them accessible via efficient machine-access from web tools for use by researchers. The project group has also worked out methods for combining, comparing, and analyzing genomic data from multiple lines within a species, or for collections of related species. This produces a pan-gene set, which is a collection of all genes for a set of related genomes, with corresponding genes (highly similar genes from similar genomic vicinities) grouped into families. These families of similar genes are useful for identifying differences that may be related to traits of interest. The project has developed a method for robustly calculating pan-gene sets and updating them as new genomes become available. The project has also (through the LIS subordinate project with NCGR) developed a pan-gene viewer, the Genome Context Viewer (GCV), which is in place both at LIS for viewing a broad collection of legume species, and at SoyBase for viewing the genomes and genes of available soybean cultivars (cultivated and wild). These methods and tools will help researchers track changes between various cultivars in a species, and among various related species, and also to explore the genomic vicinities around genes of interest. LIS development has also focused on incorporating legume genetic data, for both soybean and other legume crops, into the InterMine web software. InterMine is an open-source data warehouse built for the integration and analyses of complex biological data. Originally developed in the early 2000s to handle genetic data for a model fly species, it has since been used to house data for several dozen species. InterMine instances have been generated for common bean, chickpea, cowpea, peanut, soybean, and Medicago truncatula (an alfalfa relative). These “legume Mines” enable users to compose powerful queries to help identify genes that underlie traits that are important for crop improvement. The InterMine instances provide both alternative interfaces and tools for accessing the data at SoyBase and LIS, and regularized means of accessing the information by other programs and websites, through application programming interfaces. These advantages support their inclusion in the current Project Plan for the SoyBase and Legume Clade Database project.

1. White paper describing needs and priorities for legume genetic data. ARS researchers in Ames, Iowa, worked with a group of two dozen international collaborators to publish a white paper, "The future of legume genetic data resources: challenges, opportunities, and priorities" (LGDWG, 2019). The whitepaper presents the conclusions of an international working group of legume researchers about objectives and methods in legume genomic science. The working group consisted of researchers who were convened by this ARS project and by the associated National Science Foundation Legume Federation project in March, 2019, to evaluate the needs for data management and methodologies, in order to best utilize this information for crop improvement and for basic research in this important group of species. The workshop identified various needs and recommendations: (a) Develop strategies to effectively store, integrate, and relate genetic resources collected in different projects. (b) Leverage information collected across many legume species by standardizing data formats and terms, improving the state of information about datasets, and increasing use of the FAIR data principles (FAIR: data should be Findable, Accessible, Interoperable, Reusable). (c) Advocate for the critical role that curators exercise in integrating complex datasets into databases and adding high value information that enable downstream analytics and facilitate practical applications. (d) Implement standardized software and database development practices to best leverage limited developer time and expertise gained from the various legume (and other) species. (e) Develop tools and databases that can manage genetic information for the world's plant genetic resources, enabling efficient incorporation of important traits into breeding programs. (f) Centralize information on databases, tools, and training materials and establish funding streams to support training and outreach.This white paper is expected to help guide and inform research priorities and activities for the coming decade.

2. Genome sequence assemblies for three soybean accessions. Genome sequence assemblies for three soybean accessions. ARS researchers in Ames, Iowa, and Beltsville, Maryland, along with other soybean researchers in the U.S. and internationally, published the genome sequence for three accessions of soybean. Plant breeders and scientists work to identify what genes are responsible for important traits (yield, nutrition, etc.) and where these genes are located within the species of interest’s DNA. Sequencing a species’ genome, down to the level of individual DNA bases, helps researchers link genes with traits. The soybean genome sequence, developed from one variety, has been available for the last ten years and has enabled many discoveries about gene function. However, more rapid progress could be made if multiple genome sequences, for distinct soybean varieties, could be examined to see how DNA changes alter particular traits. The work reported here describes the complete, high-resolution sequence of approximately one billion DNA bases for two widely used soybean cultivars and for one wild soybean (Glycine soja). These assemblies and annotations have been integrated into SoyBase, making them available to researchers for browsing and searching. Having the genome sequences for these two widely used soybean accessions will be helpful in identifying genes that are important in identifying the genetic control of traits such as early maturity. The genome sequence for the wild soybean accession will help researchers to determine how changes during domestication occurred, and to find genes that may not have been transferred from wild to cultivated soybean. This work will assist breeders and other scientists to more rapidly develop improved soybean varieties, to benefit farmers and consumers worldwide.

Review Publications
Bauchet, G., Bett, K.E., Cameron, C.T., Campbell, J.D., Cannon, E., Cannon, S.B., Carlson, J., Chan, A., Cleary, A., Close, T., Cook, D., Cooksey, A., Coyne, C.J., Dash, S., Dickstein, R., Farmer, A., Fernandez-Baca, D., Hokin, S., Jones, E., Kang, Y., Monteros, M., Munoz-Amatriain, M., Mysore, K., Pislariu, C., Richards, C.M., Shi, A., Town, C., Udvardi, M., Wettberg, E., Young, N., Zhao, P. 2019. The future of legume genetic data resources: Challenges, opportunities, and priorities. Legume Science. 1(1):e16.
Valliyodan, B., Cannon, S.B., Bayer, P.E., Shu, S., Brown, A.V., Ren, L., Jenkins, J., Chung, C.Y.L., Chan, T.F., Daum, C.G., Plott, C., Hastie, A., Baruch, K., Barry, K.W., Huang, W., Gunvant, P., Varshney, R.K., Hu, H., Batley, J., Yuan, Y., Song, Q., Stupar, R.M., Goodstein, D.M., Stacey, G., Lam, H.M., Jackson, S.A., Schmutz, J., Grimwood, J., Edwards, D., Nguyen, H.T. 2019. Construction and comparison of three new reference-quality genome assemblies for soybean. Plant Journal. 100(5):1066-1082.