Download Quality control - Broad Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
talks Quality control Methods for assessing quality and detec5ng problems TYPES OF QC QC analysis takes many forms •  Understanding limita5ons of technology –  Sample prepara5on protocols –  Sequencing technologies –  Data processing workflows & algorithms •  Systema5c pipeline QC –  Quality -­‐> exclude failed runs / samples –  Iden5ty -­‐> detect contamina5on and swaps •  Project-­‐specific troubleshoo5ng LIMITATIONS OF TECHNOLOGY Typical limita5ons of technology •  Inability to capture data due to genome loca5on and/or sequence composi5on (GC) –  Examples: High GC content, design of exome bait sets -­‐> Depth of coverage analysis •  Sequencing chemistry/tech error modes –  Example: High rates of FP indels in IonTorrent PGM •  Data processing algorithms –  Example: Posi5on-­‐based callers (UG) do a bad job calling indels -­‐> Compare different tools/workflows Example of depth of coverage analysis on WGS •  Which of several sample prep protocols provides the best coverage distribu5on for WGS? –  “Old protocol” –  “New protocol + varying concentra5ons of betaine” •  Data –  Test sample NA12878, Illumina HiSeq WGS –  List of gene intervals of interest •  Tools & methods – 
– 
– 
– 
DepthOfCoverage -­‐> Per-­‐site values aggregated by gene interval GCContentByInterval -­‐> Per-­‐gene GC content IGV for sequence data visualiza5on R + ggplot2 for plodng coverage distribu5on Distribu5on of coverage over gene intervals Faceted by prep method and GC content Increasing concentra-ons of betaine improve coverage GC-­‐rich genes are badly covered by new protocol without betaine High GC intervals (>0.6) Betaine rescues high-­‐GC genes Med. GC Intervals (0.4<<0.6) Low GC intervals (<0.4) Wide distribu5on => uneven coverage Narrow distribu5on => even coverage NormalizaMon: X Norm(X) = Mean(x) Visual inspec5on of select intervals shows differen5al effects Old protocol New protocol (no betaine) Coverage in GC-­‐rich regions increases with betaine concentra5on New protocol +1M betaine New protocol +2M betaine Coverage is similar between GC-­‐rich and average regions GC content = 0.69 Example of depth of coverage analysis on WEx •  Which exome technology provides the best coverage over the intervals that interest us? –  “Tech 1” –  “Tech 2” – differs mainly by loca5on of baits rela5ve to exons •  Data –  Test sample NA12878, Illumina HiSeq WEx –  List of exome target intervals of interest •  Tools & methods –  DiagnoseTargets -­‐> Per-­‐interval summary of usability metrics –  IGV for sequence data visualiza5on VCF output by DiagnoseTargets Degrees of failure to problema5c sequence context Example of dpue roblema5c sequence context Old Tech 1 Tech 1 Tech 1 coverage was very low in this area But deeper coverage in the rest inflates the overall coverage score for the exon, allowing it to pass filters Tech 2 performs badly in the area where Tech 1 also fails; failure is more obvious! Tech 2 exon Tech 1 interval Tech2 interval extends to the intron (250 bp upstream, also void of coverage) Caveat: raw sequence data displayed here are by defini5on not normalized, so comparisons should be limited to rela5ve amounts of coverage between areas per technology, rather than absolute amounts between technologies. Loca5on of bait sets plays drama5c role in exome usability Tech 1 provided decent coverage so sequence context is not the problem Tech 1 Tech 2 produces abundant coverage in the intron region Tech 2 Tech 2 produces bad coverage in area of interest intron exon Tech 1 interval Tech 2 interval Caveat: raw sequence data displayed here are by defini5on not normalized, so comparisons should be limited to rela5ve amounts of coverage between areas per technology, rather than absolute amounts between technologies. QC of workflows and algorithms based on benchmarking •  Internally: benchmarking against a knowledge base –  Our favorite sample: NA12878 (+ parents) –  Sequenced with mul5ple technologies –  Manual review over extensive stretches of genome Benchmark for tesMng changes to GATK tools •  Externally: Genome In A Bople h[p://www.genomeinabo[le.org NA12878 Knowledge Base
1
2
3
5
FALSE_POSITIVE
6
TRUE_POSITIVE
7
reviews
8
ami
9
chartl
10 11 12 13 14 15 16 17 18 19 20 21 22
Assigned truth status
4
delangel
X
0
50,000,000
100,000,000
150,000,000
Position along chromosome
200,000,000
TruthStatus
250,000,000
delangel_fosmids
depristo
ebanks
gauthier
gege
haasb
justinzook
multiple
rpoplin
thibault
valentin
SYSTEMATIC PIPELINE QC Example: QC in the Broad’s produc5on pipeline (1) Quality of barcode matching, cluster density, number of reads, bases, etc. QC (2) Quality of alignment, library construc5on, coverage, base quality, internal controls, SAM QC format valida5on + Iden5ty through fingerprints (3) Cumula5ve quality from (2) by sample, cross-­‐sample contamina5on + Iden5ty fingerprints, read groups cross-­‐check (4) VCF format valida5on, genotype concordance on control samples, variant calling quality metrics QC QC QC Data that fail any step of quality or idenMty verificaMon get blacklisted Controlling for contamina5on and sample swaps •  Fingerprin5ng –  Run samples on fingerprin5ng chip (~100 sites) –  Use VerifyBamID to match BAMs to ini5al fingerprints (hpp://genome.sph.umich.edu/wiki/VerifyBamID) –  Use GenotypeConcordance to compare fingerprint genotypes to results in final callset –  Use internal control samples •  Cross-­‐check results from separate lanes once aggregated per-­‐sample •  Check for cross-­‐sample contamina5on using e.g. ContEst (hpp://www.ncbi.nlm.nih.gov/pubmed/21803805) Tools for systema5c QC of sequence and mapping quality •  Picard Metrics collec5on tools –  See Collect*Metrics tools in hpps://broadins5tute.github.io/picard/index.html –  Metrics are defined in hpps://broadins5tute.github.io/picard/picard-­‐metric-­‐defini5ons.html •  User-­‐friendly alterna5ves with GUI –  FastQC •  Specializes in basic sequence quality assessment –  QualiMap •  Specializes in mapping quality assessment FastQC for user-­‐friendly base quality metrics hpp://www.bioinforma5cs.babraham.ac.uk/projects/fastqc/ QualiMap for user-­‐friendly mapping quality metrics hpp://qualimap.bioinfo.cipf.es/ Variant QC tools: see previous talk on evalua5ng callsets Addi5onal sample QC based on phenotypic inference •  Kinship -­‐> degree of rela5on between samples (King / PLINK) •  Pedigree -­‐> reconstruct family structure (trios) •  Sex -­‐> coverage / clustering analysis over X and Y Many projects discard samples with non-­‐standard sex genotypes (e.g. X0, XXY) •  Ethnicity inference -­‐> PCA + clustering on subset of conserved sites (S. Purcell) These methods developed for GWAS can be used for QC purposes, e.g. to check idenMty and verify supplied metadata, as well as adjust variant QC expectaMons Pairwise kinship inference Duplicates Monkol Lek, 2014 Parent-­‐
Offspring Siblings TAT SIGMA T2D-­‐GENES GoT2D SCZ Opawa NFBC ESP Bipolar BUP ATV 1000G Monkol Lek, 2014 Ethnicity affects many variant call metrics SNP count 24K 21K 18K Older popula5ons tend to display more heterogeneity PROJECT TROUBLESHOOTING The overwhelming majority of results that don’t make sense can be explained by quality issues So go back and check all these metrics (1) Quality of barcode matching, cluster density, number of reads, bases, etc. QC (2) Quality of alignment, library construc5on, coverage, base quality, internal controls, SAM QC format valida5on + Iden5ty through fingerprints (3) Cumula5ve quality from (2) by sample, cross-­‐sample contamina5on + Iden5ty fingerprints, read groups cross-­‐check (4) VCF format valida5on, genotype concordance on control samples, variant calling quality metrics QC QC QC But some5mes it’s our fault (sort of*) •  If you ever get VQSR plots that look like this and/or where the novel Ti/Tv values are very bad: •  You’re probably using a post-­‐1000G dbsnp version. If so, switch to our bundled *_129 dbsnp version. * The underlying reason is hard-­‐coded expectaMons in the VQSR plohng scripts Finally, what to do if individual variant calls look wrong? •  Things to try –  Retry that region with the latest nightly build (in case it was a bug that has been fixed) –  Generate and check the HC’s –bamout output for realignment (to see what’s going on) –  Use HC’s –forceAc5ve –disableOp5miza5ons arguments to force a call •  Things to look out for –  Homopolymers or large repeats in sequence context –  Lots of soz-­‐clips –  Overlapping indel in cohort •  Things to remember –  A variant can pass site-­‐level filters even if individual genotypes are uncertain talks Further reading hpp://www.broadins5tute.org/gatk/guide/