Download Quality Characterization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Two-dimensional nuclear magnetic resonance spectroscopy wikipedia , lookup

Transcript
EDACC
Quality Characterization for Various
Epigenetic Assays
Cristian Coarfa
Bioinformatics Research Laboratory
Molecular and Human Genetics
Data Types Submitted To EDACC
•
•
•
•
•
•
•
•
ChIP-Seq
Methyl-C
RRBS
MRE-Seq
MeDIP-Seq
Chromatin Accessibility
small RNA-Seq
mRNA-Seq
Quality Characterization


How to measure the quality of mapped reads?
Note: not quality of sequencing




statistics on this are provided by the sequencer
Most labs do some sort of visual inspection
Metrics for characterizing level 2 data quality
Apply it to various data types submitted to EDACC
Enrichment Based Protocols



ChIP-Seq, MeDIP-Seq, Chromatin Accessibility
Methods implemented
– PTIH (percent tags in hotspots)
– iROC (integral of ROC)
– Percent tags in peaks (FindPeaks)
– Poisson enrichment metric
Implemented in EDACC pipeline
– Metrics computed on all submitted data
PTIH (percent tags in hotspots)
• Detect enriched regions using “hotspot” algorithm
• PTIH = percentage of all tags that fall in hotspots
Hotspot algorithm
Scan statistic gauging enrichment with a z-score based
on the binomial distribution.
n tags
250 bp
50kb
N tags
Binomial distribution gives probability of seeing n tags in
the small window given N tags total in the large window.
This adjusts for local background fluctuations (due to CNV,
for instance).
PTIH values
0.48
0.19
0.72
0.48
PTIH values
0.48
0.19
0.72
0.48
Ratio of Tags in Peaks
• Determine uniquely mapping reads
• Use FindPeaks to call peaks
• Count reads mapping into peaks
– percentage of total mapped reads
Poisson Based Enrichment Method
•
•
•
•
•
Determine uniquely mapping reads
Remove duplicate reads
Bin the reads into 1kb windows
Infer parameters of a simple poisson distribution
Filter enriched windows
– p-value < 0.01
• Count reads mapping into enriched windows
Next Step – Metrics Evaluation
•
•
Metrics probe different features of data
Use visual inspection to ascertain which (one or more)
of the proposed methods captures useful aspects of
data quality.
ChIP-Seq/Chromatin Accessibility/FindPeaks QC Metrics
• Collaborative efforts between centers
• ~330 lanes of verified ChIP-Seq, MeDIPSeq, and Chromatin accesibility data
• Accesible in Epigenome Atlas
Going forward


EDACC will run continuously on all submitted data
Option to automatically flag data that fall below
specified thresholds


Include QC metrics in metadata


For most data types we need further experience on what thresholds
make sense
Provide downstream users with this information
Note that we are breaking new ground

uniform quality scoring is not being performed by other major
consortia (ENCODE, modENCODE)
Pearson correlation for ChIP-Seq Histone Modification
• Using raw density maps at 10kb resolution
• Process
–
–
–
–
–
Select uniquely mapping reads
Extend 200bp in mapping strand direction
Remove monoclonal reads
Build density map
Pearson correlation with other submitted marks
• Ideally: a mark correlates best with other experiments for
the same assay
• How well does Pearson correlation work ?
– Help us identify 5 bad lanes, REMCs retracted the data
PCA Analysis
• 10kb windows on chr20
• PCA using Pearson correlation metric
Input
H3K36me3
H3K9me3
H3K79me1
H3K20me1
Pearson
correlation
metric
H3K27me3
PCA 53.8%
H3K4me3
H3K9ac
H2AK5ac
H2BK120ac
H2BK12ac
H2BK15ac
H2BK20ac
H3K14ac
H3K18ac
H3K23ac
H3K27ac
H3K4ac
H3K56ac
H4K5ac
H4K8ac
H4K91ac
MRE-Seq
• Reads are mapped onto reference genome
• Uniquely mapping reads are kept
• Build the fragment map of expecting mapping locations
– based on the enzyme cocktail used
• Count reads mapping within the expected digest
fragments
• 76-99% of reads map within expected fragment
mRNA-Seq
•
•
•
•
Reads are mapped onto reference genome
Uniquely mapping reads are kept
Count reads mapping within UCSC genes exons
70-90% of reads map within gene exons
– UCSC known genes
– Entrez genes
Small RNA-Seq
•
•
•
•
Trim adaptors
Reads are mapped onto reference genome
Reads mapping up to 100 locations are kept
Count reads overlapping with known small RNAs
– miRNAs, piRNAs, sno/scaRNAs, piRNAs, repeat
RNAs
• At least 30% of reads overlap with known small RNAs
Bisulfite Sequencing
• Map using Pash
• Methyl-C
– Genome wide
– QC
• C->T Conversion rates; typically 99%
• RRBS
– Enzyme cocktail
– QC
• Map within expected cut sites
• Ratio varies 40%-90%
QC for MeDIP-Seq Data Using Galaxy
Exercise
• Download the input MeDIP-Seq file from
the workshop wiki
• Determine the ratio of reads in peaks
using FindPeaks in Galaxy