Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
talks Quality Control of High-‐Throughput Sequencing data Methods for assessing quality and detec;ng problems TYPES OF QC QC analysis takes many forms • “Tech Dev”: understanding limita;ons of technology – Library prepara;on protocols – Sequencing technologies – Data processing workflows & algorithms • Systema;c pipeline QC – Quality -‐> exclude failed runs / poorly constructed libraries – Iden;ty -‐> detect contamina;on and swaps LIMITATIONS OF TECHNOLOGY Typical limita;ons of technology • Inability to capture data due to genome loca;on and/or sequence composi;on (GC), reference ar;facts – Examples: High GC content, design of exome bait sets -‐> Depth of coverage analysis -‐> Library size esGmates • Sequencing chemistry/tech error modes – Example: High rates of FP indels in IonTorrent PGM -‐> Indel Error Rate esGmaGon • Data processing algorithms – Example: Posi;on-‐based callers (UG, samtools) do a poor job calling indels -‐> Compare different tools/workflows Example of depth of coverage analysis on WGS • Which of several library construc;on protocols provides the best coverage distribu;on for WGS? – “Old protocol” – “New protocol + varying concentra;ons of betaine” • Data – Test sample NA12878, Illumina HiSeq WGS – List of gene intervals of interest • Tools & methods – – – – DepthOfCoverage -‐> Per-‐site values aggregated by gene interval GCContentByInterval -‐> Per-‐gene GC content IGV for sequence data visualiza;on R + ggplot2 for plocng coverage distribu;on Distribu;on of coverage over gene intervals Faceted by prep method and GC content Increasing concentra-ons of betaine improve coverage GC-‐rich genes are badly covered by new protocol without betaine High GC intervals (x>0.6) Betaine rescues high-‐GC genes Med. GC Intervals (0.4<x<0.6) Low GC Intervals (x<0.4) Wide distribu;on => uneven coverage Narrow distribu;on => even coverage NormalizaGon: X Norm(X) = Mean(x) Visual inspec;on of select intervals shows differen;al effects Old protocol New protocol (no betaine) Coverage in GC-‐rich regions increases with betaine concentra;on New protocol +1M betaine New protocol +2M betaine Coverage is similar between GC-‐rich and average regions GC content = 0.69 Example of depth of coverage analysis on WEx • Which exome technology provides the best coverage over the intervals that interest us? – “Tech 1” – “Tech 2” – differs mainly by loca;on of baits rela;ve to exons • Data – Test sample NA12878, Illumina HiSeq WEx – List of exome target intervals of interest • Tools & methods – DiagnoseTargets -‐> Per-‐interval summary of usability metrics – IGV for sequence data visualiza;on VCF output by DiagnoseTargets Degrees of failure to problema;c sequence context Example of dpue roblema;c sequence context Old Tech 1 Tech 1 Tech 1 coverage was very low in this area But deeper coverage in the rest inflates the overall coverage score for the exon, allowing it to pass filters Tech 2 performs badly in the area where Tech 1 also fails; failure is more obvious! Tech 2 exon Tech 1 interval Tech2 interval extends to the intron (250 bp upstream, also void of coverage) Caveat: raw sequence data displayed here are by defini;on not normalized, so comparisons should be limited to rela;ve amounts of coverage between areas per technology, rather than absolute amounts between technologies. Loca;on of bait sets plays drama;c role in exome usability Tech 1 provided decent coverage so sequence context is not the problem Tech 1 Tech 2 produces abundant coverage in the intron region Tech 2 Tech 2 produces bad coverage in area of interest intron exon Tech 1 interval Tech 2 interval Caveat: raw sequence data displayed here are by defini;on not normalized, so comparisons should be limited to rela;ve amounts of coverage between areas per technology, rather than absolute amounts between technologies. QC of workflows and algorithms based on benchmarking • Choose a public, common sample – Human: NA12878 + parents • Sequence with mul;ple technologies (machines and protocols) - With and without PCR, different machines, different read-‐ lengths, etc.) • Compare to a knowledgebase (E.g. NIST GIAB) – Make sure that results are beper or at least no worse (specific comparison metrics will be discussed towards the end of the workshop) SYSTEMATIC PIPELINE QC Example: QC in the Broad’s produc;on pipeline (1) Fidelity of barcode matching, cluster density, number of reads, bases, etc. QC (2) Quality of alignment, library construc;on, coverage, base quality, internal controls, SAM QC format valida;on + Iden;ty through fingerprints (3) Cumula;ve quality from (2) by sample, cross-‐sample contamina;on + Iden;ty fingerprints, read groups cross-‐check (4) VCF format valida;on, genotype concordance on control samples, variant calling quality metrics QC QC QC Data that fail any step of quality or idenGty verificaGon should get blacklisted Controlling for contamina;on and sample swaps • Contamina;on (and barcode-‐swapping) can be checked using VerifyBamID • Check for contamina;on in tumor samples using ContEst • Fingerprin;ng (currently private code) – Run samples on fingerprin;ng chip (~100 sites) – Gives Odds ra;o between a “swap” and a “no swap” situa;on – One could use GenotypeConcordance as a subs;tute for the private code but it would only work well on the aggregated bam • Cross-‐check results from separate lanes once aggregated per-‐sample (again using private code) Tools for systema;c QC of sequence and mapping quality • Picard Metrics collec;on tools – See Collect*Metrics tools in hpps://broadins;tute.github.io/picard/index.html – Metrics are defined in hpps://broadins;tute.github.io/picard/picard-‐metric-‐defini;ons.html • User-‐friendly alterna;ves with GUI (we don’t) – FastQC • Specializes in basic sequence quality assessment – QualiMap • Specializes in mapping quality assessment Typical quality failures detected by Picard QC tools • Normal amounts of raw data (in Gb) but poor target coverage • High propor;on of chimerism • Strange insert size distribu;on (too big / too small) • Shearing-‐based oxida;on (poor OxoQ values) • Library size too small Exomes • • Whole genomes Severe unevenness in distribu;on • High propor;on of unmapped reads of coverage (Fold80 penalty values) • High percentage of adapter/oligo HS / reference bias based oxida;on (poor cref-‐OxoQ values) Enough data produced but low mapping rate % PF reads (pass filters) % PF reads aligned 93.031 78.435 93.378 74.277 Fewer than 80% reads produced (that pass quality filters) were mapped; typical alignment for human exome is > 98% High percentage of duplicate reads Mean % Excl % Excl % Excl coverage dupe overlap Total 32 5.2 7.9 17.3 32 21.1 2.3 25.1 23 20.4 1.8 25.8 Appear to reach coverage target but values are inflated by duplica;on Showing duplicate reads Hiding duplicate reads Uneven coverage in a PCR-‐Free whole genome Mean coverage % bases at 15X 31.6 69 Reaches overall coverage target but data is unevenly distributed: piles of reads in some places alternate with uncovered regions Unevenly covered WGS sample Evenly covered WGS sample 3 2 Read Group 1 Uneven coverage between read groups Not always a problem – some;mes we add an extra run for a sample to “top up” coverage (but in this case RG2 in par;cular looks problema;c) High percentage of chimerism % Chimeras % Selected bases % Target bases 20X 30.272 65.438 78.478 13.405 70.036 84.811 Reaches coverage goals but data integrity may be an issue as number of chimeric reads is so high; could confound detec;on of structural rearrangements and indels. Strange insert size distribu;on Abnormal spike Bacterial contamina;on in cheek swab samples produces deep piles of partly aligned reads talks Further reading hpp://www.broadins;tute.org/gatk/guide/ hpps://broadins;tute.github.io/picard/index.html hpps://broadins;tute.github.io/picard/picard-‐metric-‐defini;ons.html