Skip to main content
ARS Home » Midwest Area » Ames, Iowa » Corn Insects and Crop Genetics Research » Research » Research Project #434522

Research Project: SoyBase and the Legume Clade Database

Location: Corn Insects and Crop Genetics Research

2022 Annual Report

Objective 1: Accelerate trait analyses, germplasm analyses, genetic studies, and breeding of soybean and other economically important legume crops through stewardship of genomes, genetic data, genotype data, and phenotype data. Objective 2: Develop an infrastructure that enhances the integration of genotype and phenotype information and corresponding data sets with query and visualization tools to facilitate efficient plant breeding for soybean and select legume crops. Objective 3: Collaborate with database developers and plant researchers to develop improved methods and mechanisms for open, standardized data and knowledge exchange to enhance database utility and interoperability. Objective 4: Provide support and research coordination services for the soybean and other legume research and breeding communities; train new scientists and expand outreach activities through workshops, web-based tutorials, and other communications.

Incorporate revised primary reference genome sequence for soybean into SoyBase. House and provide access to genome sequences for other soybean accessions, haplotype data, and related annotations. Incorporate revised gene models and annotations into SoyBase. Install or implement web-based tools for curation and improvement of soybean gene models and gene annotations. Incorporate available legume genome sequences and annotations. Working with collaborators, collect and add genetic map and QTL data for crop legumes. Extend web-based tools for navigation among biological sequence data across the legumes. Extend and develop methods and storage capacity for accepting genomic data sets for soybean and other legume species. Develop a complete set of descriptors (ontologies) for soybean biology (anatomy, traits, and development), and for other significant crop legumes as needed. Work with the relevant ontology communities-of-practice to incorporate these descriptors into broadly accessible ontologies. Develop web tutorials for important typical uses of SoyBase and the Legume Clade Database. Present and train about features at relevant conferences and workshops. Regularly seek feedback from users about desired features and usability.

Progress Report
The SoyBase and Legume Clade Database project in Ames, Iowa, focuses on development of two online databases: SoyBase ( and the Legume Information System (LIS) ( LIS development is a collaboration between ARS staff in Ames, Iowa, and developers at the National Center for Genome Resources (NCGR) in Santa Fe, New Mexico. Software development is shared by these databases to improve efficiency and the user experience of these resources. The essence of Objective 1 is to provide stewardship for major genetic and genomic data sets, while Objective 2 focuses on developing and maintaining the computational infrastructure needed to house, integrate, and provide access to genetic and genomic data. The major focus in both the SoyBase and LIS online databases over the last fiscal year has been on preparation for fundamental redevelopment of these websites and databases, to accommodate the rapidly increasing pace of genomic data generation. Site and database redevelopment will enable both web/database sub-projects to make use of new technologies that offer improved capacities for handling large and numerous new genomic data sets. Goals of the database and website redevelopment include: greater modularity of code, to allow for rapid changes and additions to particular sections of the databases and sites; increased use of software container technologies, to facilitate portable software development environments, reproducible and secure production deployments, and easier adoption of project-developed applications by the community; updated web development languages and frameworks, to give more responsive user experiences; and updated database schemas, to allow for increased data scaling and to improve database integrity. The transition to the new framework and technologies is mostly complete for LIS. The transition is well underway for SoyBase and is expected to be completed over the coming fiscal year. In support of Objective 1, SoyBase and LIS have incorporated 24 new genome assemblies and associated annotation sets for various legume species (six for soybean and 16 for other legumes). These are available via the common Data Store that is used by SoyBase and LIS ( Functional information about more than 5,000 genes has also been harvested from the literature regarding soybean and added to SoyBase. Considering data collections of all types (genetic studies, genome assemblies, diversity data sets, etc.), the number of collections in the Data Store has more than doubled in the last year, increasing from 326 to 803. This curation work involves close evaluation of the data, often with corrections to bring the data into conformity with accepted standards. This work makes these important data sets more accessible to researchers and to other online biological projects. SoyBase and LIS are used by plant breeders and researchers throughout the U.S. and worldwide, as a source of information about the genetic basis for traits important in legume crops. The legume family includes crops such as soybean, chickpea, beans of various types, lentils, peanuts, alfalfa, and many others. By identifying the genetic elements responsible for traits such as flowering time, growth habit, or disease resistance, plant breeders can more directly target and select for desired features. SoyBase holds 24 reference-quality soybean genome assemblies (including those of wild relatives), and LIS holds 71 genome assemblies for other crop and model legumes. These are integrated with other resources such as genetic markers, information about gene function and expression, and genotype information about many thousands of accessions in the U.S. National Plant Germplasm System. Because LIS and SoyBase have curated genetic and genomic data for Glycine max (soybean) and other legume species, these databases are heavily utilized by researchers and breeders worldwide. In the last calendar year, SoyBase was cited in more than 500 scientific articles and 11 U.S. patents and LIS data was cited in at least 118 articles, indicating the scientific and commercial utility of these databases and resources. In support of Objective 2, we have added and updated several online tools, allowing sequence-based comparison searches, annotation of user submitted sequences, comparisons of gene order and composition between related species, and comparison of marker-trait associations between legume species. In support of Objective 3, the SoyBase and LIS projects have worked to provide and use more efficient machine access to data. Use of well-defined, stable, web-accessible application programming interfaces (APIs) permits various websites to query one another and access and use particular data sets at those respective sites. For example, SoyBase and LIS make use of germplasm (variety) data at USDA’s GRIN-Global (Germplasm Resources Information Network), to access trait and collection-location data. This information is used for displaying the collection locations of GRIN germplasm on an interactive geographic information system map. The APIs are also used internally, for efficient and stable access to data within the SoyBase and LIS projects. An example of this kind of access is access to particular genomic data, e.g., genes, genomic sequences from selected regions, or genomic variants from regions and accessions. Also related to Objective 3, the SoyBase and LIS projects have increased their use of the InterMine technology. InterMine is a data warehouse technology and interface, initially developed to house genetic data for model organisms, but now used by many genetic and genomic database projects worldwide. InterMine allows users to construct queries involving essentially any set of genetic or genomic features, and to quickly query and get reports about features such as genes and genetic associations. As part of LIS, InterMine instances have been developed for eight genera and more than 20 species, including soybean and its wild relatives. At SoyBase, results of the 2020 and 2021 Northern Uniform Soybean Test (NUST) were collected and added to the SoyBase database. These comprise the results of trials from 1989 to the present. Additionally, 577 new strain pedigrees have been added to the SoyBase Soybean Parentage database. The phenotype, strain, and trial data have been loaded into SoybeanBase, which is an instance of the BreedBase/Breeding Insight platform, in collaboration with Breeding Insight personnel. Objective 4, focusing on research community support and training, was met through outreach efforts throughout the year. Outreach related to research and community support included responses to over 55 requests for information/data from the SoyBase database. The project group participated in several working groups of the AgBioData Research Coordination Network (RCN) project, including the Generic Feature Format (GFF3) specification working group, the Ontology Working Group, the Data Federation Working Group, the Diversity Recruitment Working Group, and the Pan-Genome Working Group. The project group is also represented on the AgBioData Steering Committee. Other research community activities include membership on the Soybean Genetics Committee, and the REE Data Stewards Community of Practice. Outreach to communicate project research includes six manuscripts published in FY22, including manuscripts describing new features of SoyBase and LIS, methods for predicting rare or novel (“orphan”) genes and manuscripts describing the analyses of diverse germplasm collections in mung bean, soybean and its wild relatives, and peanut and its wild relatives.

1. Incorporated genome assemblies and annotations for peanut, bean, cowpea and alfalfa into SoyBase and the Legume Information System (LIS). Genome sequences describe the order and content of the DNA in all of the chromosomes of an organism, and serves as a common framework or backbone for much of the work done by breeders and other researchers. This framework identifies genes, genetic markers, and traits and their chromosomal locations. ARS researchers in Ames, Iowa, have collected 24 new full genome assemblies, across 15 legume species, and incorporated these into the Data Store that is used by SoyBase and LIS and available for use by researchers. The new genome assemblies include three peanut and its wild relatives, six wild relatives of soybean, four bean and its wild relatives, diverse cowpea varieties, three alfalfa and its relatives, and one savannah tree, the apple ring Acacia, that is an important component of agroforestry systems in Africa. These data will be of interest both to plant breeders, and to biologists working to understand the genes involved in adaptation to the various physical environments that these species occupy. This information may be used, for example, to identify genetic markers for traits such as tolerance to increased heat, drought, or salinization. Breeding and research on legume crops impact people worldwide, as legumes provide protein and other nutrients for a large portion of the global population.

2. Published a report describing important contributions of crop wild relatives to cultivated peanut. The small number of crop species and their generally narrow genetics is a fundamental vulnerability to food security. Wild crop relatives are strategic sources of genetic diversity for the breeding of resistance to pests, diseases and environmental stresses. ARS researchers in Ames, Iowa, participated in a consortium that incorporated a wild peanut relative, Arachis cardenasii, into domesticated peanut, Arachis hypogaea. This genetic incorporation, initiated by scientists beginning in 1967, involved complex and challenging genetic crosses. Subsequent breeding cycles substantially obscured this contribution from the wild relative. However, the genetic legacy from this breeding work can now be seen in enhanced peanut cultivars in at least 30 countries. This work has improved food security and provided economic and environmental benefits.

3. Published a study describing the newly assembled genome sequences of six perennial relatives of soybean. Soybean, one of the most important crops globally for its protein and oil content, faces numerous challenges from insects, pathogens, and environmental stresses. A group of wild relatives of soybean, from Australia, may provide information to researchers to better understand how to improve soybean resilience to various stresses. ARS researchers in Ames, Iowa, participated in an international consortium that reported the genome assemblies of six soybean relatives. All six of the newly sequenced species are perennial, and all survive in challenging environments in their native ranges in Australia. This work also identified genes that are highly conserved, as well as genes that are specific to one or several species. A gene involved in the transition between the perennial and annual varieties of soybean was also described. These results provide basic information that may be used by breeders for soybean improvement, particularly in mitigating environmental challenges due to climate change.

4. Incorporated variety trial and pedigree data for soybean, spanning the last 30 years of trials from the northern U.S. For all major crops, variety trials and pedigree data are used to determine which new varieties are most suited to a particular region or to meeting grower and consumer objectives. Traits that are typically assessed in soybean variety trials include yield, tolerance against adverse field conditions such as nutrient deficiencies or pathogens, seed characteristics such as protein and oil concentration and quality, and growth harvest characteristics such as germination rate and plant architecture at harvest. ARS researchers in Ames, Iowa, have added soybean measurements for these traits for more than 1,900 test strains submitted to the Northern Uniform Soybean Tests (NUST). Incorporating trait data on these strains into SoyBase allows breeders access to performance data of test strains from 1989 to the present. This will allow breeders to easily see the results of breeding activity across programs, evaluate any increase in grain yield and other seed quality measurements and incorporate strains with superior genetics into their breeding programs.

Review Publications
Brown, A.V., Grant, D.M., Nelson, R. 2021. Using crop databases to explore phenotypes: from QTL to candidate genes. Plants. 10(11). Article 2494.
Chiteri, K.O., Zaki Jubery, T., Dutta, S., Ganapathysubramanian, B., Cannon, S.B., Singh, A. 2022. Dissecting the root phenotypic and genotypic variability of the Iowa mung bean diversity panel. Frontiers in Plant Science. 12:808001.
Zhuang, Y., Wang, X., Li, X., Hu, J., Fan, L., Landis, J.B., Cannon, S.B., Grimwood, J., Schmutz, J., Jackson, S.A., Doyle, J.J., Zhang, X., Zhang, D., Ma, J. 2022. Phylogenomics of the genus Glycine sheds light on polyploid evolution and life-strategy transition. Nature Plants. 8: 233-244.
Bertioli, D.J., Clevenger, J., Godoy, I., Stalker, T., Wood, S., Santos, J., Ballen-Taborda, C., Abernathy, B., Azevedo, V., Campbell, J.D., Chavarro, C., Chu, Y., Farmer, A.D., Fonceka, D., Gao, D., Grimwood, J., Halpin, N., Korani, W., Michelotto, M.D., Ozias-Akins, P., Vaughn, J.N., Youngblood, R., Moretzsohn, M.C., Wright, G.C., Jackson, S.A., Cannon, S.B., Scheffler, B.E., Leal-Bertioli, S.M. 2021. Legacy genetics of Arachis cardenasii in the peanut crop shows the profound benefits of international seed exchange. Proceedings of the National Academy of Sciences(PNAS). 118(38). Article e2104899118.
Li, J., Singh, U., Bhandary, P., Campbell, J.D., Arendsee, Z., Seetharam, A., Wurtele, E. 2021. Foster thy young: enhanced prediction of orphan genes in assembled genomes. Nucleic Acids Research. 50(7):e37.
Redsun, S., Hokin, S., Cameron, C.T., Cleary, A.M., Berendzen, J., Dash, S., Brown, A.V., Wilkey, A., Campbell, J.D., Huang, W., Kalberer, S.R., Weeks, N.T., Cannon, S.B., Farmer, A.D. 2022. Doing genetic and genomic biology using the Legume Information System and associated resources. Methods in Molecular Biology. 2443.