Submitted to: Biomed Central (BMC) Genomics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 8/9/2006
Publication Date: 8/9/2006
Citation: Nelson, R., Shoemaker, R.C. 2006. Identification and Analysis of Gene Families from the Duplicated Genome of Soybean Using EST Sequences. Biomed Central (BMC) Genomics. 7:204. Interpretive Summary: Soybean is an important crop in the United States and in other countries of the World. Genetic improvement of this important crop depends on a complete and through understanding of the genes that make up the genome and how they have evolved. A first step to understanding the soybean genome is to identify all genes in the genome and then to infer functions for them in order to identify genes that may play a role in traits of agronomic importance. This study examined potential gene sequences identified in EST libraries of soybean and assigned the potential gene sequences with molecular functions. This work will be used by soybean researchers and potentially soybean breeders to help identify genes that may play a role in soybean physiology and traits. This information could be exploited by breeders to improve soybean yield and reduce input costs for soybean farmers.
Technical Abstract: Large scale gene analysis of most organisms is hampered by incomplete genomic sequences. In many organisms, such as soybean, the best source of sequence information is the existence of expressed sequence tag (EST) libraries. Soybean has a large (1115 Mbp) genome that has yet to be fully sequenced. However, it does have the 6th largest EST collection comprised of ESTs from a variety of soybean genotypes. Many EST libraries were constructed from RNA extracted from various genetic backgrounds, thus gene identification from these sources is complicated by the existence of both gene and allele sequence differences. We used the ESTminer suite of programs to identify potential soybean gene transcripts from a single genetic background allowing us to observe functional classifications between gene families as well as structural differences between genes and gene paralogs within families. The identification of potential gene sequences (pHaps) from soybean allows us to begin to get a picture of the genomic history of the organism as well as begin to observe the evolutionary fates of gene copies in this highly duplicated genome. RESULTS: We identified approximately 45,000 potential gene sequences (pHaps) from EST sequences of Williams/Williams82, an inbred genotype of soybean (Glycine max L. Merr.) using a redundancy criterion to identify reproducible sequence differences between related genes within gene families. Analysis of these sequences revealed single base substitutions and single base indels are the most frequently observed form of sequence variation between genes within families in the dataset. Genomic sequencing of selected loci indicate that intron-like intervening sequences are numerous and are approximately 220 bp in length. Functional annotation of gene sequences indicate functional classifications are not randomly distributed among gene families containing few or many genes.