Skip to main content
ARS Home » Midwest Area » Columbia, Missouri » Plant Genetics Research » Research » Publications at this Location » Publication #390903

Research Project: Soybean Seed Quality Improvement through Translational Genomics

Location: Plant Genetics Research

Title: A versatile resource of 1500 diverse wild and cultivated soybean genomes for post-genomics research

item ZHANG, HENGYOU - Danforth Plant Science Center
item JIANG, HELEN - Danforth Plant Science Center
item HU, ZHENBIN - Danforth Plant Science Center
item Song, Qijian
item An, Yong-Qiang - Charles

Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 11/17/2020
Publication Date: 11/17/2020
Citation: Zhang, H., Jiang, H., Hu, Z., Song, Q., An, Y. 2020. A versatile resource of 1500 diverse wild and cultivated soybean genomes for post-genomics research. bioRxiv.

Interpretive Summary: With the advance of next generation sequencing technologies, soybean researchers generated more than 15 terabytes of raw data from sequencing whole genomes of 1,465 diverse wild and cultivated accessions in the past-decade. In addition, our group also sequenced 91 representative wild soybean accessions. Having integrated and analyzed the genome sequencing data of wild soybean with the public data, we identified, validated, and characterized a total of 32 million single nucleotide polymorphisms (SNPs). We demonstrated that the 1,556 diverse soybean accessions represent almost the entire genetic diversity of over 20,000 wild and cultivated soybean accessions in the U.S. soybean collection. We made the detailed annotation of the 32mSNPs available at SoyBase ( and Ag Data Commons ( for research community to access. The 32 million SNPs in 1556 accessions allow researchers to examine the nucleotide sequence of a given gene or a deoxyribonucleic acid region such as a quantitative trait locus interval in 1,556 diverse sequenced soybean accessions, genotype known quantitative trait locus gene alleles, discover new quantitative trait locus and alleles, and identify soybean germplasm containing novel trait alleles that can be used as new genetic materials for soybean breeding. The set of single nucleotide polymorphisms data in 1,556 diverse wild and cultivated soybean accessions should play a significant role in releasing the potential of the huge amount of genomic data for US soybean research and product development.

Technical Abstract: Background With advances in next-generation sequencing technologies, an unprecedented amount of soybean accessions has been sequenced by many individual studies and made available as raw sequencing reads for post-genomic research. Results To develop a consolidated and user-friendly genomic resource for post-genomic research, we consolidated the raw resequencing data of 1465 soybean genomes available in the public and 91 highly diverse wild soybean genomes newly sequenced. These altogether provided a collection of 1556 sequenced genomes of 1501 diverse accessions (1.5'K). The collection comprises of wild, landraces and elite cultivars of soybean that were grown in East Asia or major soybean cultivating areas around the world. Our extensive sequence analysis discovered 32 million single nucleotide polymorphisms (32mSNPs) and revealed a SNP density of 30 SNPs/kb and 12 non-synonymous SNPs/gene reflecting a high structural and functional genomic diversity of the new collection. Each SNP was annotated with 30 categories of structural and/or functional information. We further identified paired accessions between the 1.5'K and 20,087 (20'K) accessions in US collection as genomic “equivalent” accessions sharing the highest genomic identity for minimizing the barriers in soybean germplasm exchange between countries. We also exemplified the utility of 32mSNPs in enhancing post-genomics research through in-silico genotyping, high-resolution GWAS, discovering and/or characterizing genes and alleles/mutations, identifying germplasms containing beneficial alleles that are potentially experiencing artificial selection. Conclusion The comprehensive analysis of publicly available large-scale genome sequencing data of diverse cultivated accessions and the newly in-house sequenced wild accessions greatly increased the soybean genome-wide variation resolution. This could facilitate a variety of genetic and molecular-level analyses in soybean. The 32mSNPs and 1.5'K accessions with their comprehensive annotation have been made available at the SoyBase and Ag Data Commons. The dataset could further serve as a versatile and expandable core resource for exploring the exponentially increasing genome sequencing data for a variety of post-genomic research.