Location: Plant, Soil and Nutrition Research
Title: AnchorWave: sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplicationAuthor
SONG, BAOXING - Cornell University | |
MARCO-SOLA, SANTIAGO - Autonomous University Of Barcelona | |
MORETO, MIQUEL - Universitat Politècnica De Catalunya (UPC) | |
JOHNSON, LYNN - Cornell University | |
Buckler, Edward - Ed | |
STITZER, MICHELLE - Cornell University |
Submitted to: Proceedings of the National Academy of Sciences (PNAS)
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 11/15/2021 Publication Date: 12/21/2021 Citation: Song, B., Marco-Sola, S., Moreto, M., Johnson, L., Buckler IV, E.S., Stitzer, M.C. 2021. AnchorWave: sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proceedings of the National Academy of Sciences (PNAS). 119(1). Article e2113075119. https://doi.org/10.1073/pnas.2113075119. DOI: https://doi.org/10.1073/pnas.2113075119 Interpretive Summary: Technology to sequence and assemble genomes is rapidly improving, generating the entire DNA sequence of many species. Software designed to align and compare the genomes of two species were developed based on human and animal genomes, but plant genome structure poses challenges to applying these techniques. We developed AnchorWave to deal with the high repeat content, prevalence of whole genome duplications, and chromosomal rearrangements common to plant genomes. We developed the software AnchorWave to align two plant genomes. This allows interpretation of what sequences have changed and what has stayed the same since those two species have diverged. This forms a baseline for any investigation of these genomes. Although genome sequencing was once limited to a handful of model species, there are now projects developed to sequence and assemble millions of genomes. Simply knowing the sequence is often not enough to interpret what it means, and alignment is the necessary first step towards that understanding. We produced a whole genome alignment strategy that focuses first on identifying colinear blocks of genes between two genomes, then carefully aligning within those blocks using recent advances in the software implementation of global alignment strategies. Together, this extends the species that can be aligned, including many crop plants which are polyploid, or have high amounts of repeats in their genomes. Technical Abstract: Millions of species are currently being sequenced, and their genomes are being compared. Many of them have more complex genomes than model systems and raise novel challenges for genome alignment. Widely used local alignment strategies often produce limited or incongruous results when applied to genomes with dispersed repeats, long indels, and highly diverse sequences. Moreover, alignment using many-to-many or reciprocal best hit approaches conflicts with well-studied patterns between species with different rounds of whole-genome duplication. Here, we introduce Anchored Wavefront alignment (AnchorWave), which performs whole-genome duplication–informed collinear anchor identification between genomes and performs base pair–resolved global alignment for collinear blocks using a two-piece affine gap cost strategy. This strategy enables AnchorWave to precisely identify multikilobase indels generated by transposable element (TE) presence/absence variants (PAVs). When aligning two maize genomes, AnchorWave successfully recalled 87% of previously reported TE PAVs. By contrast, other genome alignment tools showed low power for TE PAV recall. AnchorWave precisely aligns up to three times more of the genome as position matches or indels than the closest competitive approach when comparing diverse genomes. Moreover, AnchorWave recalls transcription factor–binding sites at a rate of 1.05- to 74.85-fold higher than other tools with significantly lower false-positive alignments. AnchorWave complements available genome alignment tools by showing obvious improvement when applied to genomes with dispersed repeats, active TEs, high sequence diversity, and whole-genome duplication variation. |