Publication : USDA ARS

ARS Home » Research » Publications at this Location » Publication #154045

Title: IDENTIFICATION OF PARALOGS AND SNP IN THE ALLOPOLYPLOID SOYBEAN GENOMICS FROM EST DATA.

Author

	Van Tassell, Curtis - Curt
	Cregan, Perry
	MATUKUMALLI, LAKSHMI - GEORGE MASON UNIVERSITY
	GRENFENSTETTE, JOHN - GEORGE MASON UNIVERSITY
	CHOI, IK-YOUNG - KOREA

Submitted to: Proceedings of International Meeting on Single Nucleotide Polymorphism and Complex Genome Analysis
Publication Type: Abstract Only
Publication Acceptance Date: 8/10/2003
Publication Date: 8/10/2003
Citation: Van Tassell, C.P., Cregan, P.B., Matukumalli, L.K., Grenfenstette, J.J., Choi, I. 2003. Identification of paralogs and snp in the allopolyploid soybean genomics from est data [abstract]. Proceedings of International Meeting on Single Nucleotide Polymorphism and Complex Genome Analysis. Abstract 64.

Interpretive Summary:

Technical Abstract: Expressed sequence tag (EST) data have been used for single nucleotide polymoprihism (SNP) discovery in several organisms such as humans and maize (Useche et al., 2001). Soybean is an important commercial legume crop and is an ancient tetraploid. RFLP analysis showed that the soybean genome is duplicated at an average of 2.5 times with duplication up to 6 times at some regions (Shoemaker et al., 1996). For these and other reasons the soybean genome has not been sequenced and is not likely to be sequenced soon. EST sequencing has been performed as an economically feasible alternative to derive more information about soybean genes. About 300,000 EST sequences are available from GenBank and chromat data for these sequences are also available. NCBI and TIGR have clustered redundant EST into contigs, but systematic paralog analysis among the polyploids has not been attempted with limited sequence information. Many other important commercial crops are polyploids (e.g., wheat, alfalfa and cotton) with large and complex genomes. Soybean can serve as a model species for studying the polyploid genomes as it has several interesting features to make their study feasible. Soybean is a tetraploid and can be a simple model among the polyploids (e.g., wheat is hexaploid). About 80-90% of the alleles in N. American soybean cultivars were contributed by approximately 20 soybean introductions. Zhu et al. (2003) showed that six soybean genotypes can account for up to 80% of haplotype variation. Hence limited variation can be expected. Furthermore, a majority (65 %) of the EST data are from one cultivar (Williams 82), and these sequences can be used for paralog identification. Modern soybean cultivars are extensively inbred, so heterogeneity between the pair of homologous chromosomes is very small. In addition, Medicago trancatula, a closely related diploid model legume, is currently being sequenced. Identification of SNP in polyploid species is a more difficult problem than for diploids, because paralogs are very similar and also variation among the paralogs are not SNP. We are using the following bioinformatics approach to first distinguish the paralogs and to further study the variation within each paralog from different cultivars (SNP). The predictions are being experimentally tested and validated. Step 1: Process chromat data for the removal of vector, low quality sequence at the ends, contamination screening for bacterial, known mitochondrial and chloroplast genes, soybean retroviral elements using EST-PAGE (Matukumalli et al., 2003). Step 2: Cluster EST from the Williams 82 cultivar using megaBlast (15,861 clusters). Step 3: Further assemble clusters into contigs using CAP3 and analyze variation in each contig using PolyBayes. Step 4: Develop and compare a number of algorithmic approaches to distinguish the paralogs from the contig assemblies and clusters. Step 5: Match and compare EST from other cultivars to Williams 82 paralogs from the previous step to identify SNP. Identification of SNP from EST is valuable for discovering genetic variants in and around genes thereby permiting the genetic mapping of these genes and in some instances identifying the specific sequence variants that alter gene function.