Submitted to: BARC Poster Day
Publication Type: Abstract Only
Publication Acceptance Date: April 29, 2004
Publication Date: April 29, 2004
Citation: Matukumalli, L.K., Grefenstette, J.J., Van Tassell, C.P., Choii, I., Cregan, P.B. 2004. In silico prediction and validation of polymorphisms in soybean genome using est data. BARC Poster Day.
Soybean is one of the major commercial crops and it is an important raw material in the manufacture of foods, cosmetic and industrial products. Understanding the organization of the soybean genome and genes responsible for specific traits can help enhance productivity, disease resistance, oil and seed quality and tolerance to extreme conditions. Whole genome soybean sequencing is not in the priority lists and it is not likely to be sequenced soon. However a good deal of Expressed sequence tags (EST) sequencing has been performed as an economically feasible alternative to derive more information about the soybean genes. About 330,000 soybean EST sequences have been currently deposited in GenBank. Soybean genome is an ancient tetraploid, whose genome has over time become diploidized. (Hadley and Hymowitz, 1973) RFLP analysis showed that soybean genome is duplicated at an average of 2.5 times with duplication up to 6 times at some regions. (Shoemaker et al., 1996) Paralogs are duplicated copies of the same gene diverged after the duplication. Systematic paralog analysis has not been performed on the EST data and only general methodologies applicable to diploids have been used. Soybean is a tetraploid and can be a simple model among the polyploids (e.g., wheat is a hexaploid). North American soybean cultivars have limited diversity where 35 ancestors contributed more than 95% of all alleles. (Gizlice et al 1994) Zhu et al., (2003) studied SNP for 25 cultivars and showed that six soybean genotypes can account for up to 80% of haplotype variation. About 60% of the EST data are from one cultivar (Williams). EST data from Williams can be used to distinguish the paralogs. Paralog distinction can provide a better estimate on the number of soybean genes along with their paralogous relationships. It will also assist in the single nucleotide polymorphisms (SNP) discovery. Reliable SNP detection from EST is valuable for discovering genetic variants in and around genes thereby permitting the genetic mapping of these genes and in some instances identifying the specific sequence variants that can alter the gene function. SNP discovery in polyploid species is a more difficult problem than for diploids, because paralogs are very similar and variations among the paralogs are not SNP. The bioinformatics programs/tools developed are generalized as much possible for reuse in similar projects (other polyploid commercial crops like wheat, alfalfa, cotton etc.,) and will be made available open source.