Skip to main content
ARS Home » Pacific West Area » Corvallis, Oregon » Horticultural Crops Research » Research » Publications at this Location » Publication #349975

Research Project: Integrated Disease Management of Exotic and Emerging Plant Diseases of Horticultural Crops

Location: Horticultural Crops Research

Title: Inferring variation in copy number using high throughput sequencing data in R

Author
item Knaus, Brian
item Grunwald, Niklaus - Nik

Submitted to: Frontiers in Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 3/26/2018
Publication Date: 4/13/2018
Citation: Knaus, B.J., Grunwald, N.J. 2018. Inferring variation in copy number using high throughput sequencing data in R. Frontiers in Genetics. 9:123. https://doi.org/10.3389/fgene.2018.00123.
DOI: https://doi.org/10.3389/fgene.2018.00123

Interpretive Summary: Genes or chromosomes can appear in different copy numbers: one copy in haploid organisms; two copies in diploid organisms; three copies in triploid organisms; four copies in tetraploid organisms; etc. Inference of copy number variation presents a technical challenge. Here we present a method to infer copy number that uses variant call format (VCF) data as input and is implemented in the R package vcfR. This method is based on the relative frequency of each allele (in both genic and non-genic regions) sequenced at heterozygous positions throughout a genome using high throughput sequencing technology. We validated these approaches with the model system of yeast and applied it to the oomycete pathogen Phytophthora infestans, both known to vary in ploidy. This functionality has been incorporated into the current release of the R package vcfR.

Technical Abstract: Inference of copy number variation presents a technical challenge because variant callers typically require the copy number of a genome or genomic region to be known a priori. Here we present a method to infer copy number that uses variant call format (VCF) data as input and is implemented in the R package vcfR. This method is based on the relative frequency of each allele (in both genic and non-genic regions) sequenced at heterozygous positions throughout a genome. These heterozygous positions are summarized by using arbitrarily sized windows of heterozygous positions, binning the allele frequencies, and selecting the bin with the greatest abundance of positions. This provides a non-parametric summary of the frequency that alleles were sequenced at. The method is applicable to organisms that have reference genomes that consist of full chromosomes or sub-chromosomal contigs. In contrast to other software designed to detect copy number variation, our method does not rely on an assumption of base ploidy, but instead infers it. We validated these approaches with the model system of Saccharomyces cerevisiae and applied it to the oomycete Phytophthora infestans, both known to vary in ploidy. This functionality has been incorporated into the current release of the R package vcfR to provide modular and flexible methods to investigate copy number variation in genomic projects.