Skip to main content
ARS Home » Midwest Area » Ames, Iowa » Corn Insects and Crop Genetics Research » Research » Research Project #434522

Research Project: SoyBase and the Legume Clade Database

Location: Corn Insects and Crop Genetics Research

2021 Annual Report

Objective 1: Accelerate trait analyses, germplasm analyses, genetic studies, and breeding of soybean and other economically important legume crops through stewardship of genomes, genetic data, genotype data, and phenotype data. Objective 2: Develop an infrastructure that enhances the integration of genotype and phenotype information and corresponding data sets with query and visualization tools to facilitate efficient plant breeding for soybean and select legume crops. Objective 3: Collaborate with database developers and plant researchers to develop improved methods and mechanisms for open, standardized data and knowledge exchange to enhance database utility and interoperability. Objective 4: Provide support and research coordination services for the soybean and other legume research and breeding communities; train new scientists and expand outreach activities through workshops, web-based tutorials, and other communications.

Incorporate revised primary reference genome sequence for soybean into SoyBase. House and provide access to genome sequences for other soybean accessions, haplotype data, and related annotations. Incorporate revised gene models and annotations into SoyBase. Install or implement web-based tools for curation and improvement of soybean gene models and gene annotations. Incorporate available legume genome sequences and annotations. Working with collaborators, collect and add genetic map and QTL data for crop legumes. Extend web-based tools for navigation among biological sequence data across the legumes. Extend and develop methods and storage capacity for accepting genomic data sets for soybean and other legume species. Develop a complete set of descriptors (ontologies) for soybean biology (anatomy, traits, and development), and for other significant crop legumes as needed. Work with the relevant ontology communities-of-practice to incorporate these descriptors into broadly accessible ontologies. Develop web tutorials for important typical uses of SoyBase and the Legume Clade Database. Present and train about features at relevant conferences and workshops. Regularly seek feedback from users about desired features and usability.

Progress Report
Work in the SoyBase and Legume Clade Database project in Ames, Iowa, focuses on development of two online databases: SoyBase ( and the Legume Information System or LIS ( LIS development is a collaboration between ARS staff in Ames, Iowa, and developers at the National Center for Genome Resources (NCGR) in Santa Fe, New Mexico. Software development is often shared by these databases to improve efficiency and the user experience of these resources. The essence of Objective 1 is to provide stewardship for major genetic and genomic data sets, while Objective 2 focuses on developing and maintaining the computational infrastructure needed to house, integrate, and provide access to genetic and genomic data. The major focus in both the SoyBase and LIS online databases over the last fiscal year has been on preparation for fundamental redevelopment of these websites and databases, to accommodate the rapidly increasing pace of genomic data generation. Site and database redevelopment will enable both sub-projects to make use of new technologies that offer improved capacities for handling large and numerous new genomic data sets. Goals of the database and website redevelopment include: greater modularity of code, to allow for rapid changes and additions to particular sections of the databases and sites; increased use of computational container technologies, to make the websites less dependent on particular hardware configurations; updates web development languages and frameworks, to give more responsive user experiences; and updated database schemas, to allow for increased data scaling and to improve database integrity. The status of the redevelopment efforts for SoyBase and LIS is that technologies have been selected and tested for all main components: database, software containerization, and website front-end. Development prototypes have been made and are being tested for both web/database projects. We anticipate that full implementation will span much of the coming fiscal year. In the meantime, the project has maintained and extended the current public facing SoyBase and LIS websites. Below, we describe additions and improvements from the current year. In support of Objective 1, SoyBase and LIS have incorporated 27 genetic variation data sets from published literature, as well as 10 new genome assemblies for soybean and 15 new genome assemblies for eight other legume species. These are available via the common Data Store that is used by SoyBase and LIS ( and This curation work involves close evaluation of the data, often with corrections to bring the data into conformity with accepted standards. This work makes these important data sets more accessible to researchers and to other online biological projects. SoyBase and LIS are used by plant breeders and researchers throughout the U.S. and worldwide, as a source of information about the genetic basis for traits important in legume crops. The legume family includes crops such as soybean, chickpea, beans of various types, lentils, peanuts, alfalfa, and many others. By identifying the genetic elements responsible for traits such as flowering time, growth habit, or disease resistance, plant breeders can more directly target and select for desired features. SoyBase holds 17 reference-quality soybean genome assemblies, and LIS holds 63 genome assemblies for other crop and model legumes. These are integrated with other resources such as genetic markers, information about gene function and expression, and genotype information about many thousands of accessions in the U.S. National Plant Germplasm System. Because LIS and SoyBase have curated genetic and genomic data for Glycine max (soybean) and other legume species, these databases are heavily utilized by researchers and breeders world-wide. In the last calendar year, SoyBase was cited in 392 scientific articles and 12 U.S. patents and LIS data was cited in 107 articles and in one U.S. patent indicating the scientific and commercial utility of the databases data. In support of Objective 2, we have added a new online tool, the Genotype Comparison Visualization Tool (GCViT), to enable users to interactively explore these large variation data sets - for example, to identify where a set of varieties or accessions have genomic regions in common or exhibiting differences. This tool, GCViT, was developed in the ARS project, and is described in a recent publication of the group (Wilkey et al., 2020). This tool provides users with graphical access about genetic variation for the 20,000 unique soybean lines in the USDA soybean germplasm, as well as for other specialized collections of varieties and wild soybean accessions. At LIS, The GCViT tool has been implemented for common bean, chickpea, and peanut. A video tutorial on its use was produced and added to the SoyBase tutorial collection. The project has completed a new genetic map viewer (CMap-js), replacing the venerable CMap software that is now about 20 years old. A task planned for the coming year is to port genetic maps at SoyBase and LIS into CMap-js. Both SoyBase and LIS now include a sophisticated pan-genome viewer called Genome Context Viewer (GCV). GCV was developed as part of the LIS project. It permits users to indicate a gene or region of interest, and to see corresponding regions from whatever accessions or species are included in the GCV instance. At SoyBase, this includes 28 sequenced soybean genomes ( At LIS, this includes 18 legume species ( SoyBase development in the last year included extension of the Expression Explorer tool to visualize gene expression in various soybean tissues. The expression information spans a wide range of tissues and developmental conditions, coming from two large gene expression atlases. In support of Objective 3, the SoyBase and LIS projects have worked to provide and use more efficient machine access to data. Use of well-defined, stable, web-accessible "application programming interfaces" (APIs) to data permits various websites to query one another and access and use particular data sets at those respective sites. For example, SoyBase and LIS make use of germplasm (variety) data at GRIN-Global (GRIN: the USDA’s Germplasm Resources Information Network), to access trait and collection-location data. This information is used for displaying the collection locations of GRIN germplasm on an interactive geographic information system map. The APIs are also used internally, for efficient and stable access to data within the SoyBase and LIS projects. An example of this kind of access is use of the InterMine instances of soybean and other legume species, maintained as part of LIS, to access particular genomic data, e.g. genes, genomic sequences from selected regions, or genomic variants from regions and accessions. Also related to Objective 3, the SoyBase and LIS projects have increased their use of the InterMine technology. InterMine is a "data warehouse" technology and interface, initially developed to house genetic data for model organisms such as fly, nematode, and mouse, but now used by many genetic and genomic database projects worldwide. InterMine allows users to construct queries involving essentially any set of genetic or genomic features, and to quickly query and get reports about features such as genes and genetic associations. As part of LIS, InterMine instances have been developed for eight species (including soybean). At SoyBase, R\results of the 2020 Northern Uniform Soybean Test (NUST) were collected and added to the SoyBase database. A new user interface to access the database was also constructed and is in beta-testing with NUST participants. Objective 4, focusing on research community support and training, was met through outreach efforts throughout the year. A SoyBase tutorial session was offered in conjunction with the 2020 Soybean Breeders Workshop, attended by Soybean Breeders in the U.S. and Canada, the tutorial session highlighted the use of the new Expression Explorer tool for candidate gene identification of GWAS and biparental QTLs. A video tutorial explaining the Expression Explorer was produced and added to the SoyBase video tutorial collection. Manuscripts were published that describe new features of SoyBase (Brown et al., LOG NO. 378108) and LIS (Berendzen et al., LOG NO. 379729). A manuscript was also published that describes the new online tool GCViT, available at both SoyBase and LIS, that enables users to explore genetic variant data held in both database projects (Wilkey et al., LOG NO. 376417).

1. Incorporation of legume genomes into the Legume Information System. Genome sequences describe the order and content of the DNA in all of the chromosomes of an organism. A genome sequence provides a “road map” for the organism, and serves as a common framework or backbone for much of the work done by breeders and other researchers. This map or backbone provides the coordinates for identifying genes, genetic markers, and traits that can be mapped to chromosomal locations. Thus, the task of collecting and cataloging genome assemblies is part of the critical infrastructure for modern plant breeding and biology. ARS researchers in Ames, Iowa, have collected 15 new full genome assemblies, across eight legume species (not including soybean), and incorporated these into the Legume Information System (LIS) Data Store. This provides researchers with a single location to find genomic data that is often dispersed across the internet. These genome assemblies are for a broad collection of legumes, including cowpea, pea, common bean, alfalfa, and several wild relatives of peanut. These data will be of interest both to plant breeders, and to biologists working to understand the genes involved in adaptation to the various physical environments that these species occupy. Given a dense collection of genetic markers with known locations across a genome, researchers are able to identify corresponding traits. This information may be used, for example, to identify genetic markers for traits such as tolerance to increased heat, drought, or salinization. Breeding and research on legume crops impact people worldwide, as legumes provide protein and other nutrients for a large portion of the global population.

2. Incorporation of ten new soybean genome assemblies into SoyBase. Soybean varieties have been adapted to many environments, ranging from short-season varieties suited to Canada, to long-season southern varieties suited to Brazil; and also adapted to many uses, ranging from varieties used for oilseed applications, to varieties used for fresh vegetable (edamame) production. Researchers are gradually learning the genetic basis for the many trait differences among soybean varieties. Genome sequences, which represent the sequence of DNA letters in an organism, provide an important means for determining the genetic basis of traits. ARS researchers in Ames, Iowa, have incorporated 10 new soybean genomes, and the associated predicted genes, into the SoyBase and Legume Information System (LIS) Data Store. These data sets have been incorporated into SoyBase to allow detailed analysis and comparisons. The new assemblies and gene annotations include a wild soybean variety, as well as the widely used cultivar Fiskeby. The wild soybean genome will be useful for understanding soybean domestication and improvement. The cultivar Fiskeby is an important northern-adapted soybean variety, used in breeding programs for its superior tolerance to multiple environmental stresses. Functional annotation of Fiskeby gene models and display with other SoyBase data may facilitate the identification of disease resistance genes and molecular markers that will aid breeders in developing new early-flowering, stress-tolerant soybean lines.

3. Incorporation of the 2020 soybean variety trial data (Northern Uniform Soybean Tests) into SoyBase. For all major crops, variety trials are used to determine which new varieties are most suited to a particular region or to meeting particular grower and consumer objectives. Traits that are typically assessed in soybean variety trials include yield, tolerance against adverse field conditions such as nutrient deficiencies or pathogens, seed characteristics such as protein and oil concentration and quality, and growth harvest characteristics such as germination rate and plant architecture at harvest. ARS researchers in Ames, Iowa, have added soybean phenotypic data for 582 testing strains submitted to the Northern Uniform Soybean Tests (NUST). Additionally, parentage information for those strains were added to the SoyBase Soybean Parentage Database. Incorporating phenotypic data on these strains into SoyBase allows breeders access to performance data of testing strains from 1989 to the present. This will allow breeders to easily see the results of breeding activity across programs and to evaluate any increase in grain yield and other seed quality measurements and incorporate strains with superior genetics into their breeding programs.

4. Incorporation of two new datasets into the Expression Explorer Tool in SoyBase. Plant breeders and other researchers use information about gene expression to understand the functional roles of genes in plant development. Genes, which serve as the information source for a cell to make proteins (the building blocks of a cell), may be turned on or off during various life stages of a plant – or in response to particular environmental conditions or challenges. ARS researchers in Ames, Iowa, have incorporated two new gene expression atlases into a gene expression explorer tool at SoyBase. This tool provides researchers with a visual representation of each gene in soybean, showing when and at what levels each gene is “expressed” (turned on) in a plant. The tool was developed as part of a collaboration with the University of Toronto, Bio-analytic Resource for Plant Biology (UBAR) to graphically display gene atlas data along with tabular and graphical displays of other soybean gene expression studies. This allows soybean researchers to view gene expression levels in the various tissues and developmental time points. Including during several developmental timepoints in soybean seed development. These data can be used to identify candidate genes identified through other methods. This information can be used by breeders to produce improved soybean varieties.

Review Publications
Valliyodan, B., Brown, A.V., Wang, J., Patil, G., Liu, Y., Otyama, P.I., Nelson, R., Vuong, T., Song, Q., Musket, T.A., Wagner, R., Marri, P., Reddy, S., Sessions, A., Wu, X., Grant, D.M., Bayer, P., Roorkiwal, M., Varshney, R.K., Liu, X., Edwards, D., Xu, D., Joshi, T., Cannon, S.B., Nguyen, H.T. 2020. Genetic variation among 481 diverse soybean accessions, inferred from genomic re-sequencing. Scientific Data. 8. Article 50.
Stai, J.S., Von Wettberg, E.B, Smykal, P., Cannon, S.B. 2020. Which came first: the tuber or the vine? A taxonomic overview of underground storage in the legumes. Legume Perspectives. (19):5-7.
Kalberer, S.R., Belamkar, V., Singh, J., Cannon, S.B. 2020. Apios americana: natural history and ethnobotany. Legume Perspectives. (19):29-32.
Wilkey, A., Brown, A.V., Cannon, S.B., Cannon, E.K. 2020. GCViT: a method for interactive, genome-wide visualization of resequencing and SNP array data. Biomed Central (BMC) Genomics. 21. Article 822.
Brown, A.V., Connors, S., Huang, W., Wilkey, A., Grant, D.M., Weeks, N.T., Cannon, S.B., Graham, M.A., Nelson, R. 2020. A new decade and new data at SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Research. 49(D1):D1496-D1501.
Berendzen, J., Brown, A.V., Cameron, C.T., Campbell, J.D., Cleary, A.M., Dash, S., Hokin, S., Huang, W., Kalberer, S.R., Nelson, R., Redsun, S., Weeks, N.T., Wilkey, A., Farmer, A.D., Cannon, S.B. 2021. The legume information system and associated online genomic resources. Legume Science. Article e74.
Nelson, M.N., Jabbari, J.S., Turakulov, R., Pradhan, A., Pazos-Navarro, M., Stai, J.S., Cannon, S.B., Real, D. 2020. The first genetic map for a psoraleoid legume (Bituminaria bituminosa)reveals highly conserved synteny with phaseoloid legumes. Plants. 9(8). Article 973.
Yadav, A., Fernandez-Baca, D., Cannon, S.B. 2020. Family-specific gains and losses of protein domains in the legume and grass plant families. Evolutionary Bioinformatics. 16.
Singh, J., Sun, M., Cannon, S.B., Wu, J., Khan, A. 2021. An accumulation of genetic variation and selection across the disease-related genes during apple domestication. Tree Genetics and Genomes. 17. Article 29.
Cannon, S.B., Innes, R.W. 2021. A better mousetrap to guard against anthracnose disease in bean. Journal of Experimental Botany. 72(10):3487-3488.