Location: Plant, Soil and Nutrition ResearchTitle: AnchorWave: sensitive alignment of genomes with high diversity, structural polymorphism and whole-genome duplication variation
|SONG, BAOXING - Cornell University - New York|
|MARCO-SOLA, SANTIAGO - University Of Barcelona|
|MORETO, MIQUEL - Universitat Politècnica De Catalunya (UPC)|
|JOHNSON, LYNN - Cornell University - New York|
|Buckler, Edward - Ed|
|STITZER, MICHELLE - Cornell University - New York|
Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 7/30/2021
Publication Date: 7/30/2021
Citation: Song, B., Marco-Sola, S., Moreto, M., Johnson, L., Buckler IV, E.S., Stitzer, M.C. 2021. AnchorWave: sensitive alignment of genomes with high diversity, structural polymorphism and whole-genome duplication variation. bioRxiv. 2021.07.29.454331. https://doi.org/10.1101/2021.07.29.454331.
Interpretive Summary: Technology to sequence and assemble genomes is rapidly improving, generating the entire DNA sequence of many species. Software designed to align and compare the genomes of two species were developed based on human and animal genomes, but plant genome structure poses challenges to applying these techniques. We developed AnchorWave to deal with the high repeat content, prevalence of whole genome duplications, and chromosomal rearrangements common to plant genomes. We developed the software AnchorWave to align two plant genomes. This allows interpretation of what sequences have changed and what has stayed the same since those two species have diverged. This forms a baseline for any investigation of these genomes. Although genome sequencing was once limited to a handful of model species, there are now projects developed to sequence and assemble millions of genomes. Simply knowing the sequence is often not enough to interpret what it means, and alignment is the necessary first step towards that understanding. We produced a whole genome alignment strategy that focuses first on identifying colinear blocks of genes between two genomes, then carefully aligning within those blocks using recent advances in the software implementation of global alignment strategies. Together, this extends the species that can be aligned, including many crop plants which are polyploid, or have high amounts of repeats in their genomes.
Technical Abstract: Millions of species are currently being sequenced and their genomes are being compared. Many of them have more complex genomes than model systems and raised novel challenges for genome alignment. Widely used local alignment strategies often produce limited or incongruous results when applied to genomes with dispersed repeats, long indels, and highly diverse sequences. Moreover, alignment using many-to-many or reciprocal best hit approaches conflicts with well-studied patterns between species with different rounds of whole-genome duplication or polyploidy levels. Here we introduce AnchorWave, which performs whole-genome duplication informed collinear anchor identification between genomes and performs base-pair resolution global alignments for collinear blocks using the wavefront algorithm and a 2-piece affine gap cost strategy. This strategy enables AnchorWave to precisely identify multi-kilobase indels generated by transposable element (TE) presence/absence variants (PAVs). When aligning two maize genomes, AnchorWave successfully recalled 87% of previously reported TE PAVs between two maize lines. By contrast, other genome alignment tools showed almost zero power for TE PAV recall. AnchorWave precisely aligns up to three times more of the genome than the closest competitive approach, when comparing diverse genomes. Moreover, AnchorWave recalls transcription factor binding sites (TFBSs) at a rate of 1.05-74.85 fold higher than other tools, while with significantly lower false positive alignments. AnchorWave shows obvious improvement when applied to genomes with dispersed repeats, active transposable elements, high sequence diversity and whole-genome duplication variation. Significance statement One fundamental analysis needed to interpret genome assemblies is genome alignment. Yet, accurately aligning regulatory and transposon regions outside of genes remains challenging. We introduce AnchorWave, which implements a genome duplication informed longest path algorithm to identify collinear regions and performs base-pair resolved, end-to-end alignment for collinear blocks using an efficient 2-piece affine gap cost strategy. AnchorWave improves alignment of partially synthetic and real genomes under a number of scenarios: genomes with high similarity, large genomes with high TE activity, genomes with many inversions, and alignments between species with deeper evolutionary divergence and different whole-genome duplication histories. Potential use cases for the method include genome comparison for evolutionary analysis of non-genic sequences and population genetics of taxa with complex genomes.