FINDMAP : USDA ARS

ARS Home » Northeast Area » Beltsville, Maryland (BARC) » Beltsville Agricultural Research Center » Animal Genomics and Improvement Laboratory » AIP » Software » FINDMAP

findmap.f90	Align sequence reads to reference map, call previous variants, and identify new variants

Downloads	Version 2.2 programs, example and test outputs, and executables (released December 10, 2018) Findmap alignment and Findvar variant calling series Programs mapsim.f90, storemap.f90, map2seq.f90, findmap.f90, findvar.f90, leftmost.f90, and depth2vcf.c included Skip program mapsim when using actual map and variant files Skip simulation program map2seq when using actual fastq reads Programs leftmost and depth2vcf optional for converting to other formats To save space, zip download does not include actual reference genome, variant list, or fastq sequence read files Download findmapV2.2.zip onto a computer with the Unix operating system Type unzip findmapV2.2.zip, and hit enter After unzipping package, type runsim.script to run program series script and generate an example simulated reference map, a simulated variant file, simulated fastq files, and other program input and output files; type runmap.script to run alignment and variant calling programs Programs will execute in the Test_Output directory using options files there. Options files give more detail on available options and recommended values Output files in Test_Output should match those in Example_Output fairly closely Each new project should be run in a new directory because standard file names are used For quick testing, mapsim.options and map2seq.options are set to just 1 chromosome, and you must then reset them to 30 chromosome pairs for cattle (or other numbers to match your reference.fa file); similarly, maxhash is set to 87654321 and maxdup to 10 million in storemap.options to reduce memory in initial testing, but values of 287654321 and 300 million are recommended for a 3-Gbase genome Findmap output formats .found and .lost now include an extra column for map quality score findvar is not backward compatible to input data aligned by earlier findmap versions Map quality scores are accurate if detection option is set to 2 but quadruples run time Do not include all of the unmapped contigs in the fasta map file as separate chromosomes because those would increase memory for little gain; from the human reference hg38.fa, we used 25 chromosomes (including 22 autosomes plus X, Y, and mitochondrial DNA) and reformatted the file using fasta.sas (included) Variant list from the the 1000 Bull Genomes Project (Daetwyler et al., 2014) should be concatenated across chromosomes and then reformatted to variants.prior using program variant.sas. Main reason to reformat is that indel notation and locations used in findmap differ from vcf notation variant.sas program can also reformat the 00-common_all.vcf human variant file Simulation options For markersim.f90, there is an optional input file, genetic.cor, in the subdirectory Example_Output. For storemap.f90, there is an optional input file, flank.location, in the subdirectory Example_Output; its only purpose is to output the flanking sequence for each location in the input file Mapsim option newvar should be 0 when processing actual data so that all known variants will be used In mapsim.options, newvar can be set to 5 (or other odd number) to exclude every 5th variant and demonstrate detection of those "new" variants Program testvar.sas then tests accuracy of variant calls, both for previously known and newly discovered variants Program map2seq simulates genotypes for variant list in groups of 4, with every 4th variant homozygous alternate allele, every 2nd variant homozygous reference, and 1st and 3rd variants heterozygous or optionally can read file genotypes.true instead to declare variant genotypes for each DNA source
	Version 2.1 programs, example outputs, and executables (released October 1, 2018)
	Version 2 programs, example outputs, and executables (released May 31, 2018)
	Version 1 programs, example outputs, and executables (released July 19, 2016; last updated July 28, 2016)
	Version 0 programs, example outputs, and executables (beta version; released January 8, 2016)

Inputs	reference.fa	Standard fasta format for reference genome: > as 1st character in line for each new chromosome 50-byte lines of ACGT (or N for unknown bases), or acgt for repeated sections All programs in this series treat lower and uppercase as the same because storemap identifies, counts, stores, and links repeated k-mers to each other while hashing reference map
	variants.prior	Lists all previously known SNPs and indels Insertions reported 1 base to left of 1st base where they differ from reference genome, reading left to right Deletions reported at their detected location, not 1 base to left Use variants.sas to reformat the 1000 Bull Genomes variant file Format: chr# location vartype (SNP, INS, or DEL) variant# length alternate_allele
	fastq.filelist	List of DNA source names such as source1, source2, etc., along with numeric IDs
	source1.1.fq, source1.2.fq, source2.1.fq, source2.2.fq, etc.	Standard fastq format for paired end reads, with reads 1 and 2 of each pair at same position in 2 separate files for each DNA source
	*.options	Program control file with user-defined options

Outputs	storemap.unf	Hash table, etc., for reference map
	reference.unf	Unformatted map for faster input
	variant.readdepth	Number of ref and alt alleles, 1 row/variant Format: variant# chr# var_location ref# alt#
	individual.readdepth	Format: ID# chip# #SNPs Read counts for A and B alleles stored in 1-byte hexadecimal format (input format for imputation program findhap4; VanRaden et al., 2015)
	segments.found	Alignments, errors, and known variant locations for segments where paired end locations differ by <fraglen Format: segment# pair# direction chr# segment_location num_alts num_errs (var_locations var_type) (err_locations err_base)
	segments.lost	Same format, but for segments where paired end locations do not match
	segments.newindels	Locations and properties of new indels detected (those not already in variants.txt) Format: segment# pair# direction chr# seg_location indel_size indel_location bases (inserted or deleted)
	SNPs.new	Summary of new SNPs including read depth and number of alternate alleles found Locations can have >1 row if differing alternate alleles are observed Format: chr# SNP_location read_depth num_alt ref_allele alt_allele
	indels.new	Summary of new indels including read depth and number of alternate alleles found Locations can have >1 row if differing alternate alleles are observed Format: chr# indel_location read_depth num_alt ref_allele alt_allele
	variants.all	Combined list of prior and new variants in same format as variants.prior

References	2019	VanRaden, P.M., Bickhart, D.M., and O'Connell, J.R. Calling known variants and identifying new variants while rapidly aligning sequence data. J. Dairy Sci. 102:3216–3229.
	2016	VanRaden, P.M., and D.M. Bickhart. Fast single-pass alignment and variant calling using sequencing data. Plant Anim. Genome XXIV Conf., San Diego, CA, Jan. 9–13, W161. \| Presentation slides VanRaden, P.M., D.M. Bickhart, and J.R. O'Connell. Identifying and calling insertions, deletions, and single-base mutations efficiently from sequence data. J. Dairy Sci. 99(E-Suppl. 1):140(abstr. 0302). \| Presentation slides
	2015	VanRaden, P.M., C. Sun, and J.R. O'Connell. Fast imputation using medium- or low-coverage sequence data. BMC Genet. 16:82.
	2014	Daetwyler, H.D., A. Capitan, H. Pausch, P. Stothard, R. van Binsbergen, R.F. Brøndum, X. Liao, A. Djari, S.C. Rodriguez, C. Grohs, D. Esquerré, O. Bouchez, M.N. Rossignol, C. Klopp, D. Rocha, S. Fritz, A. Eggen, P.J. Bowman, D. Coote, A.J. Chamberlain, C. Anderson, C.P. Van Tassell, I. Hulsegge, M.E. Goddard, B. Guldbrandtsen, M.S. Lund, R.F. Veerkamp, D.A. Boichard, R. Fries, and B.J. Hayes. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nature Genet. 46:858–865. VanRaden, P.M., and C. Sun. Fast imputation using medium- or low-coverage sequence data. Proc. 10th World Congr. Genet. Appl. Livest. Prod., 179. \| Presentation slides

License	Fortran package findmap.f90 is public domain and was developed with U.S. taxpayer funding. Accurate results are not guaranteed. Please report any bugs to paul.vanraden@usda.gov. You may modify, improve, use, and redistribute the code to anyone for any purpose. Or, you can ask Paul to make changes that could benefit U.S. evaluations and other users.

Paul VanRaden
Animal Genomics and Improvement Laboratory
Agricultural Research Service, USDA

U.S. DEPARTMENT OF AGRICULTURE

Animal Genomics and Improvement Laboratory: Beltsville, MD