2012 Annual Report
The primary responsibility of the University of Oregon group is to generate and assemble sequence data to produce a reference genome assembly for rainbow trout (sub objective 1.a. from the Approach). We have generated the following three primary data sources for this: 1) We constructed a barcoded short-insert Illumina library from each of 1152 BAC pools. Each pool contained 12 BACs for a total of 13824 BACs, which constitutes approximately 95% of the rainbow trout physical map minimal tiling path. The libraries had an average insert size of approximately 300 bps and were sequenced with paired-end 100 bp reads to approximately 50X BAC coverage each. We have used this as the primary source of data for assembly. 2) We constructed seven mate-pair Illumina libraries from the Swanson doubled haploid line genomic DNA. These libraries had average fragment sizes ranging from 2.5 to 15 kb and were sequenced with paired-end 100 bp reads to approximately 5X genomic coverage each. We have used these data to scaffold across repetitive sequence elements that are too large to be bridged by the short-insert libraries. 3) We constructed three short-insert Illumina libraries from the Swanson doubled haploid line genomic DNA. These libraries had an average fragment size of approximately 350 bp and were sequenced with paired-end 100 bp reads to approximately 40X genomic coverage total. We have integrated these data with the BAC pool data to help distinguish between repetitive and non-repetitive sequence which is essential for accurate assembly (see below). We also plan to use these data to assemble genomic regions that are not covered in the BAC-based assembly. Upon initial assemblies of the BAC pool sequence data, we realized that different BAC clones within the same pool are represented at different coverage levels, presumably due to variation in the molar concentration of the BACs within the same pool. This causes problems when using standard assembly software even though the coverage variation is small. The main problems is that standard assembly software uses coverage level to distinguish between repetitive and non-repetitive sequence (e.g. sequence contigs with higher than normal coverage are classified as repetitive). Therefore, repeat detection is not reliable when BAC clones are already present at variable concentrations. We have spent a considerable amount of time developing and testing an assembly pipeline that integrates the BAC pool sequence data with that from genomic DNA to provide reliable repeat detection and produce an optimal assembly. In an ideal assembly, each BAC clone would be completely scaffolded and 100% of sequences could be unambiguously assigned to BAC clones and physical map contigs based on overlap between neighboring clones. In our most recent assembly, approximately 70% of the sequence could be unambiguously assigned to BAC clones and physical map contigs based on overlap between neighboring clones. Although these results are very encouraging, we are currently modifying and testing our pipeline with the goal continued assembly improvements. In addition to the rainbow trout reference genome assembly work described above, the University of Oregon group is responsible for RAD library construction, genotyping, and genetic map construction from doubled haploid androgenetic and gynogenetic progeny to be provided by Washington State University (sub-objective 1.b. from the Approach). These samples are expected to be available in early/mid 2013. We will begin this aspect of the project as soon as they arrive.