Download Quality Control of High-‐Throughput Sequencing data

talks Quality Control of High-‐Throughput Sequencing data Methods for assessing quality and detec;ng problems TYPES OF QC QC analysis takes many forms •  “Tech Dev”: understanding limita;ons of technology –  Library prepara;on protocols –  Sequencing technologies –  Data processing workflows & algorithms •  Systema;c pipeline QC –  Quality -‐> exclude failed runs / poorly constructed libraries –  Iden;ty -‐> detect contamina;on and swaps LIMITATIONS OF TECHNOLOGY Typical limita;ons of technology •  Inability to capture data due to genome loca;on and/or sequence composi;on (GC), reference ar;facts –  Examples: High GC content, design of exome bait sets -‐> Depth of coverage analysis -‐> Library size esGmates •  Sequencing chemistry/tech error modes –  Example: High rates of FP indels in IonTorrent PGM -‐> Indel Error Rate esGmaGon •  Data processing algorithms –  Example: Posi;on-‐based callers (UG, samtools) do a poor job calling indels -‐> Compare different tools/workflows Example of depth of coverage analysis on WGS •  Which of several library construc;on protocols provides the best coverage distribu;on for WGS? –  “Old protocol” –  “New protocol + varying concentra;ons of betaine” •  Data –  Test sample NA12878, Illumina HiSeq WGS –  List of gene intervals of interest •  Tools & methods –  –  –  –  DepthOfCoverage -‐> Per-‐site values aggregated by gene interval GCContentByInterval -‐> Per-‐gene GC content IGV for sequence data visualiza;on R + ggplot2 for plocng coverage distribu;on Distribu;on of coverage over gene intervals Faceted by prep method and GC content Increasing concentra-ons of betaine improve coverage GC-‐rich genes are badly covered by new protocol without betaine High GC intervals (x>0.6) Betaine rescues high-‐GC genes Med. GC Intervals (0.4<x<0.6) Low GC Intervals (x<0.4) Wide distribu;on => uneven coverage Narrow distribu;on => even coverage NormalizaGon: X Norm(X) = Mean(x) Visual inspec;on of select intervals shows differen;al effects Old protocol New protocol (no betaine) Coverage in GC-‐rich regions increases with betaine concentra;on New protocol +1M betaine New protocol +2M betaine Coverage is similar between GC-‐rich and average regions GC content = 0.69 Example of depth of coverage analysis on WEx •  Which exome technology provides the best coverage over the intervals that interest us? –  “Tech 1” –  “Tech 2” – differs mainly by loca;on of baits rela;ve to exons •  Data –  Test sample NA12878, Illumina HiSeq WEx –  List of exome target intervals of interest •  Tools & methods –  DiagnoseTargets -‐> Per-‐interval summary of usability metrics –  IGV for sequence data visualiza;on VCF output by DiagnoseTargets Degrees of failure to problema;c sequence context Example of dpue roblema;c sequence context Old Tech 1 Tech 1 Tech 1 coverage was very low in this area But deeper coverage in the rest inflates the overall coverage score for the exon, allowing it to pass filters Tech 2 performs badly in the area where Tech 1 also fails; failure is more obvious! Tech 2 exon Tech 1 interval Tech2 interval extends to the intron (250 bp upstream, also void of coverage) Caveat: raw sequence data displayed here are by defini;on not normalized, so comparisons should be limited to rela;ve amounts of coverage between areas per technology, rather than absolute amounts between technologies. Loca;on of bait sets plays drama;c role in exome usability Tech 1 provided decent coverage so sequence context is not the problem Tech 1 Tech 2 produces abundant coverage in the intron region Tech 2 Tech 2 produces bad coverage in area of interest intron exon Tech 1 interval Tech 2 interval Caveat: raw sequence data displayed here are by defini;on not normalized, so comparisons should be limited to rela;ve amounts of coverage between areas per technology, rather than absolute amounts between technologies. QC of workflows and algorithms based on benchmarking •  Choose a public, common sample –  Human: NA12878 + parents •  Sequence with mul;ple technologies (machines and protocols) -  With and without PCR, different machines, different read-‐ lengths, etc.) •  Compare to a knowledgebase (E.g. NIST GIAB) –  Make sure that results are beper or at least no worse (specific comparison metrics will be discussed towards the end of the workshop) SYSTEMATIC PIPELINE QC Example: QC in the Broad’s produc;on pipeline (1) Fidelity of barcode matching, cluster density, number of reads, bases, etc. QC (2) Quality of alignment, library construc;on, coverage, base quality, internal controls, SAM QC format valida;on + Iden;ty through fingerprints (3) Cumula;ve quality from (2) by sample, cross-‐sample contamina;on + Iden;ty fingerprints, read groups cross-‐check (4) VCF format valida;on, genotype concordance on control samples, variant calling quality metrics QC QC QC Data that fail any step of quality or idenGty verificaGon should get blacklisted Controlling for contamina;on and sample swaps •  Contamina;on (and barcode-‐swapping) can be checked using VerifyBamID •  Check for contamina;on in tumor samples using ContEst •  Fingerprin;ng (currently private code) –  Run samples on fingerprin;ng chip (~100 sites) –  Gives Odds ra;o between a “swap” and a “no swap” situa;on –  One could use GenotypeConcordance as a subs;tute for the private code but it would only work well on the aggregated bam •  Cross-‐check results from separate lanes once aggregated per-‐sample (again using private code) Tools for systema;c QC of sequence and mapping quality •  Picard Metrics collec;on tools –  See Collect*Metrics tools in hpps://broadins;tute.github.io/picard/index.html –  Metrics are defined in hpps://broadins;tute.github.io/picard/picard-‐metric-‐defini;ons.html •  User-‐friendly alterna;ves with GUI (we don’t) –  FastQC •  Specializes in basic sequence quality assessment –  QualiMap •  Specializes in mapping quality assessment Typical quality failures detected by Picard QC tools •  Normal amounts of raw data (in Gb) but poor target coverage •  High propor;on of chimerism •  Strange insert size distribu;on (too big / too small) •  Shearing-‐based oxida;on (poor OxoQ values) •  Library size too small Exomes •  •  Whole genomes Severe unevenness in distribu;on •  High propor;on of unmapped reads of coverage (Fold80 penalty values) •  High percentage of adapter/oligo HS / reference bias based oxida;on (poor cref-‐OxoQ values) Enough data produced but low mapping rate % PF reads (pass filters) % PF reads aligned 93.031 78.435 93.378 74.277 Fewer than 80% reads produced (that pass quality filters) were mapped; typical alignment for human exome is > 98% High percentage of duplicate reads Mean % Excl % Excl % Excl coverage dupe overlap Total 32 5.2 7.9 17.3 32 21.1 2.3 25.1 23 20.4 1.8 25.8 Appear to reach coverage target but values are inflated by duplica;on Showing duplicate reads Hiding duplicate reads Uneven coverage in a PCR-‐Free whole genome Mean coverage % bases at 15X 31.6 69 Reaches overall coverage target but data is unevenly distributed: piles of reads in some places alternate with uncovered regions Unevenly covered WGS sample Evenly covered WGS sample 3 2 Read Group 1 Uneven coverage between read groups Not always a problem – some;mes we add an extra run for a sample to “top up” coverage (but in this case RG2 in par;cular looks problema;c) High percentage of chimerism % Chimeras % Selected bases % Target bases 20X 30.272 65.438 78.478 13.405 70.036 84.811 Reaches coverage goals but data integrity may be an issue as number of chimeric reads is so high; could confound detec;on of structural rearrangements and indels. Strange insert size distribu;on Abnormal spike Bacterial contamina;on in cheek swab samples produces deep piles of partly aligned reads talks Further reading hpp://www.broadins;tute.org/gatk/guide/ hpps://broadins;tute.github.io/picard/index.html hpps://broadins;tute.github.io/picard/picard-‐metric-‐defini;ons.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Quality Control of High-‐Throughput Sequencing data