Location: Crop Germplasm ResearchTitle: Genome sequence of the cultivated cotton Gossypium arboreum Author
Submitted to: Nature Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 4/24/2014
Publication Date: 6/1/2014
Citation: Li, F., Fan, G., Wang, K., Sun, F., Yuan, Y., Song, G., Ma, Z., Li, Q., Lu, C., Zou, C., Chen, W., Liang, X., Shang, H., Liu, W., Xiao, G., Gou, C., Ye, W., Xu, X., Zhang, X., Wei, H., Li, Z., Zhang, G., Wang, J., Liu, K., Kohel, R.J., Percy, R.G., Yu, J., Zhu, Y., Wang, J., Yu, S. 2014. Genome sequence of the cultivated cotton Gossypium arboreum. Nature Genetics. 46(6):567-574. Interpretive Summary: Upland cotton originated from the hybridization of two species and therefore has a very complex genetic makeup. Sequencing Upland cotton will greatly aid researchers in characterizing and exploiting cotton germplasm for agronomic traits, but its two-species origin greatly complicates sequencing efforts. As the first step toward sequencing Upland cotton, we had to sequence both of its putative parents. Here we report the complete sequencing and successful assembly of one of those parents. Over 90 percent of the assembled sequences, covering more than 98 percent of the parent species genome, were anchored and oriented to 13 chromosomes. A total of 41,330 genes were predicted with 92 percent being confirmed. The sequencing of the parent species A-genome, along with the previously published sequence of Upland cotton's other parent D-genome, lays the foundation for fully sequencing and assembling the more genetically complex commercial Upland cotton varieties. The parent species sequence provides the research community with critical resources and information for accelerated identification and enhancement of genetic systems contributing to cotton productivity, quality and environmental stability.
Technical Abstract: Cotton is one of the most economically important natural fiber crops in the world, and the complex tetraploid nature of its genome (AADD, 2n = 52) makes genetic, genomic and functional analyses extremely challenging. Here we sequenced and assembled 98.3% of the 1.7-gigabase G. arboreum (AA, 2n = 26) genome, whose progenitor is a putative contributor of the diploid A-subgenome to tetraploid cottons. Pair-end sequencing from 10 libraries with insert sizes ranging from 180 bp to 40 kb resulted in 193.6 Gb clean sequence that covers the genome by 112.6-fold. Using a set of 24,569 single-nucleotide polymorphism (SNP) markers that we obtained from 154 F2 restriction-site-associated DNA (RAD) lines, we were able to anchor and orient 90.4% of the assembly on 13 pseudo chromosomes. The majority of the genome (68.5%) is occupied by repetitive DNA sequences, most of which are long terminal repeats (LTRs). We predicted 41,330 protein-coding genes in G. arboreum, which is similar to that of the G. raimondii. One ancient (about 115 - 146 million years ago, MYA) and one recent (approximately 13 - 20 MYA) whole genome duplications (WGDs) were shared by both species before the speciation event around 2 - 13 MYA. The two-fold size changes of these otherwise highly co-linear genomes were the result of LTR insertions in the past five million years. Expansion and contraction of nucleotide- binding site (NBS) gene family sizes in different cotton species may be responsible for their resistance to Verticillium dahlia. The ethylene-central regulatory pathway may determine fundamentally the fate of cotton fiber cell development.