Submitted to: Journal of Dairy Science
Publication Type: Abstract Only
Publication Acceptance Date: 4/21/2016
Publication Date: 7/9/2016
Citation: Van Raden, P.M., Bickhart, D.M., O'Connell, J.R. 2016. Identifying and calling insertions, deletions, and single-base mutations efficiently from sequence data. Journal of Dairy Science. 99(E-Suppl. 1)/Journal of Animal Science. 94(E-Suppl. 5):140(abstr. 0302).
Technical Abstract: Whole genome sequencing studies can directly identify causative mutations for subsequent use in genomic evaluations, but sequence variant identification is a lengthy and sometimes inaccurate process. The speed and accuracy of identifying small insertions and deletions of sequence, collectively termed INDELs, can be improved by calling variants while aligning sequence reads. Those two steps are separate in current algorithms. Program findmap stores all previously known variants in memory, calls alleles for those, and outputs potential new variants. The algorithm uses a string pattern hash to store the entire reference genome in a rapidly accessed table. If both ends of a paired end read do not align fully, the length of a potential indel within the read is calculated from the map location differences for 2 partial matches. The algorithm then finds the indel location and checks if the full read matches after accounting for the indel. Potential indels detected by findmap are checked and edited by program findvar for consistency across reads. New variants from findvar were compared to those from GATK UnifiedGenotyper and from SamTools after BWA alignment. Accuracy of detection was examined using reads simulated from cattle reference map UMD3.1 for 10 animals with 10x coverage and including 38,062,190 SNPs, 532,179 insertions, and 1,127,620 deletions from run5 of the 1,000 bull genomes project. Half of the variants were simulated as heterozygous, one fourth homozygous alternate, and one fourth homozygous reference. For the homozygous alternate variants, findvar found 99.8% of SNPs, 75% of insertions, and 63% of deletions; GATK found 99.4%, 67%, and 68%; and SamTools found 99.8%, 12%, and 18%, respectively. For heterozygotes, findvar found 99.1%, 66%, and 52%; GATK found 99.0%, 62%, and 66%; and SamTools found 98.2%, 9%, and 11%, respectively. False positives as percentages of true variants were 18%, 4%, and 4% from findvar; 16%, 28%, and 18% from GATK; and 47%, 2%, and 1% from SamTools, respectively. Read depth was 85.9 from findmap / findvar, 96.1 from BWA / GATK, and 84.3 from BWA / SamTools. With 10 processors, clock times were 106 hours for BWA, 25 hours for GATK, 11 hours for SamTools, 3 hours for findmap, and 1 hour for findvar. The new software is freely available, with algorithms 10-30 times faster than current strategies for calling known and identifying new variants. Accuracy is improved by accounting for DNA variants while aligning sequence data.