Submitted to: Molecular Ecology Resources
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 12/20/2021
Publication Date: 1/7/2022
Citation: Crane, C.E., Nemacheck, J.A., Subramanyam, C.F., Williams, S.N., Goodwin, S.B. 2022. SLAG: A program for seeded local assembly in complex genomes. Molecular Ecology Resources. 22:596–616. https://doi.org/10.1111/1755-0998.13580.
Interpretive Summary: Decreased costs of DNA sequencing have led to an explosion of genomic sequence data, often resulting in the assembly of excellent reference genomes for many species. However, for many species reference genomes are not available; furthermore, many scientists do not need a complete genome, but are interested in a particular gene and its genomic context, or want to analyze the same genomic region from multiple individuals. For these situations there is a great need for a simple program to develop local assemblies of genomic sequence data around one or more starting sequences. To address this need, a program was developed that can identify and assemble reads that match a related nucleotide or protein sequence. This tool provides researchers with a rapid, computer-generated means for obtaining candidate DNA sequences flanking their gene or region of interest without performing bench work to clone sequences in the laboratory. The Perl program will be of great use to scientists working on species with sequence data but no reference genome and to extract relevant data from multiple individuals.
Technical Abstract: Despite an increasing number of “finished” genomes, the number of reference genomes is still very limited. However, many projects do not need a finished genome, instead only a few kilobases of sequence that contain and flank a gene of interest in a species or cultivar of interest. The program described here, SLAG (“Seeded Local Assembly of Genes”), fills this need for assembled genes when a finished reference genome is not yet available. SLAG orchestrates blast, sequence retrieval, depth filtering, phrap assembly, and homology filtering against a protein or DNA seed sequence, iteratively to grow contigs using 200-1000-base reads in one or both directions to the limits of sequence coverage or suitably nonrepetitive read depth. SLAG was used in simulations of three to 24 gene copies to investigate the situation where flanking sequences vary more than exons, which is a major problem for correct genome assembly in polyploid species. SLAG covered all the variants, but it associated the flanking regions randomly and sometimes mixed SNPs from several variants together. SLAG was also seeded with 25 arbitrarily chosen protein sequences to assemble homologous regions from pyrosequenced reads of ‘Chinese Spring’ wheat for comparison to three published assemblies. All four assemblies were about equally chimeric versus one another. Most contigs were longer than their homologs from cerealsDB and shorter than homologs from the International Wheat Genome Sequencing Project. SLAG assemblies were also seeded from three wheat DNA sequences related to Hessian fly resistance, and the contigs were bench validated by tiling PCR amplicons. SLAG is a successful proof of concept that has proven useful in exploring promoters in the polyploid, 16.5 gigabase wheat genome.