Download transcription factor binding site

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

NUMT wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Nucleosome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Tag SNP wikipedia , lookup

Replisome wikipedia , lookup

Human genome wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Non-coding DNA wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

Pathogenomics wikipedia , lookup

DNA sequencing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Epigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human Genome Project wikipedia , lookup

RNA-Seq wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Exome sequencing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genomic library wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Transcript
ChIP-seq data analysis
Harri Lähdesmäki
Department of Computer Science
Aalto University
January 8, 2015
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Motivation: transcription factor binding site (TFBS)
prediction
!
!
!
!
Last time we studied computational methods for predicting
transcription factor (TF) binding sites
To understand transcriptional regulation in general, we would
like to analyze/predict TF binding sites in any cell
type/biological condition/. . .
A critical limitation of most of the standard TFBS prediction
methods is that they are condition independent
That is, TF binding predictions would be the same for all cell
types/biological conditions/. . .
!
!
How can one model and explain variation between hundreds of
different cell types?
Some methods do exist which make condition dependent
binding predictions but typically those methods require some
condition dependent data anyway
!
We will get back to these later
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
ChIP-seq
!
So, for any given condition, how do we find the genomic
locations where DNA binding proteins bind?
!
Chromatin immunoprecipitation followed by sequencing
(ChIP-seq) is the current state-of-the-art method
!
The basic principle:
!
!
Use a specific antibody to detect a protein of interest (or a
post-translationally modified histone protein in nucleosomes)
ChIP-seq procedure enriches DNA fragments that are bound to
these proteins or nucleosomes
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
ChIP-seq protocol
!
ChIP-seq steps:
!
Crosslink DNA-binding
proteins with DNA in vivo
Shear the chromatin into
small fragments (200-600
bp) amenable for
sequencing (sonication)
Immunoprecipitate the
DNA-protein complex
with a specific antibody
Reverse the crosslinks
Assay enriched DNA to
determine the sequences
bound by the protein of
interest
!
!
!
!
Figure from (Visel et al., 2009)
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
ChIP-seq protocol again
Biometrics, March 2011
152
REVI
Figure
from (Zhang
2011)to its in. vivo
Figure 1. Details of a ChIPseq experiment.
A DNA-binding
proteinetis al.,
cross-linked
. genomic
. . . .DNA
. .targets,
. . . and
. . . .
.
the chromatin (a complex of DNA and protein) is isolated (1). The DNA with the bound. proteins
is. extracted
from the cells,
.
.
. . . . . . . . . . . .
.
a Roche/454, Life/APG, Polonator
and is sheared by sonication into fragments of average length 500–1,000 bp (2). DNA fragments that are cross-linked to the
Emulsion PCR
protein of interest are enriched by immunoprecipitation with an antibody that specifically binds that protein (3–4). After the
One DNA
bead. Clonal
tothe
thousands
copies occurs
in microreactors
in an emulsion
immunoprecipitation
step,molecule
the DNA per
is separated
from amplification
the protein (5),
resultingof
suspension
of IP-enriched
DNA is size
selected on a gel, and only smaller fragments (e.g., 100–300 bp) are retained (6). Then, the size-selected, IP-enriched DNA is
sequenced to generate millions of short reads, each of which represents
or end (7–8). (In an
alternative
PCReither a fragment start
Break
Template
“paired end” experiment that is rarely used for ChIP-seq, a read isamplification
generated from each emulsion
end of each DNA fragment.)
After
dissociation
read sequences are aligned to a reference genome, read positions can be used to infer binding site positions. (8) shows two
binding sites, with the right-hand site more enriched in DNA fragments. Fragments that do not align with a binding site
reflect biases like nonspecific immunoprecipitation, misalignment, etc. This figure appears in color in the electronic version of
this article.
.
.
.
.
.
.
.
100–200
NGS sequencing – recap
!
immunoprecipitated “treatment”
sample, and then using an
Primer, template,
analysis method that considers
the treatment
profile reladNTPs and
polymerase
tive to the control profile (Kharchenko et al., 2008; Nix,
Courdy, and Boucher, 2008; Rozowsky et al., 2009). Control data can be used to help identify false positives, assess numerical
background models, and estimate a threshb Illumina/Solexa
Template immobilization
Solid-phase amplification
One DNA molecule per cluster
old for segmenting a read density or enrichment profile
in order to identify a subset of significantly enriched regions. Analysis methods are described as “two-sample” when
a control data set is available and as “single sample”
when only treatment data are available. As with ChIPchip, there are various ways to generate control samples,
Chemica
linked to
c Helicos BioSciences: one-pass sequ
Single molecule: primer immobilize
Cluster
growth
Sample preparation
DNA (5 µg)
Template
dNTPs
and
polymerase
100–200 million molecular clusters
Bridge amplification
Billions of primed, single-molecule t
Figure from (Metzker, 2010)
d Helicos BioSciences: two-pass sequencing
e Pacific Biosciences, Life/Visigen, LI-COR Biosciences
Single molecule: template immobilized
Single molecule: polymerase immobilized
.
Billions of primed, single-molecule templates
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Thousands of primed, single-molecule templates
Figure 1 | Template immobilization strategies. In emulsion PCR (emPCR) (a), a reaction mixture consisting
Nature Review
NGS sequencing – recap
!
Sequencing by synthesis,
R E V I E W Sresulting in from millions to billions
of reads
a Illumina/Solexa — Reversible terminators
F
F
F
G
F
C
c Helicos BioSciences — Reversible terminators
F
F
F
T
F
A
C
F
G
Incorporate
all four
nucleotides,
each label
with a
different dye
C
F
C
F
G
F
T
F
F
G
F
C
G
T
F
F
F
Wash, fourcolour imaging
Cleave dye
and terminating
groups, wash
C
G
A
C
G
C
G
T
G
A
F
G
A
F
G
A
F
T
A
Incorporate
single,
dye-labelled
nucleotides
T
C
F
C
F
F
C
F
F
C
G
G
C
F
G
F
G
Each cycle,
add a
different
dye-labelled
dNTP
T
C
F
F
F
G
G
F
C
F
F
F
C
C
C
G
C
C
G
C
F
G
Wash, onecolour imaging
T
C
Cleave dye
and inhibiting
groups, cap,
wash
Repeat cycles
G
Repeat cycles
b
d
C
A
T
G
C
T
A
G
C
T
A
G
Top: CATCGT
Bottom: CCCCCC
Top: CTAGTG
Bottom: CAGCTA
.
.
. . .
Figure from
(Metzker, 2010) . . . . . . . . . . . . . . .
Figure 2 | Four-colour and one-colour cyclic reversible termination methods. a | The four-colour cyclic reversible
23,101
. .
.
. 3`-O-azidomethyl
. . . . . reversible
. . . terminator
. . . .chemistry
.
. (BOX
. 1)
.using
termination (CRT) method uses Illumina/Solexa’s
Nature
Reviews
| Genetics
solid-phase-amplified template clusters (FIG. 1b, shown as single templates for illustrative purposes). Following
imaging, a cleavage step removes the fluorescent dyes and regenerates the 3`-OH group using the reducing agent
tris(2-carboxyethyl)phosphine (TCEP)23. b | The four-colour images highlight the sequencing data from two clonally
amplified templates. c | Unlike Illumina/Solexa’s terminators, the Helicos Virtual Terminators33 are labelled with the
same dye and dispensed individually in a predetermined order, analogous to a single-nucleotide addition method.
Following total internal reflection fluorescence imaging, a cleavage step removes the fluorescent dye and inhibitory
groups using TCEP to permit the addition of the next Cy5-2`-deoxyribonucleoside triphosphate (dNTP) analogue. The
free sulphhydryl groups are then capped with iodoacetamide before the next nucleotide addition33 (step not shown).
d | The one-colour images highlight the sequencing data from two single-molecule templates.
Short read alignment – recap
One-base-encoded probe
!
An oligonucleotide sequence in
which one interrogation base is
associated with a particular
dye (for example, A in the first
position corresponds to a green
dye). An example of a one-base
degenerate probe set is
‘1-probes’, which indicates
that the first nucleotide is the
interrogation base. The
remaining bases consist of
either degenerate (four possible
bases) or universal bases.
Caenorhabditis elegans genome. From a single HeliScope
run using only 7 of the instrument’s 50 channels, approximately 2.8 Gb of high-quality data were generated in
8 days from >25-base consensus reads with 0, 1 or 2
errors. Greater than 99% coverage of the genome was
reported, and for regions that showed >5-fold coverage,
the consensus accuracy was 99.999% (J. W. Efcavitch,
personal communication).
Key first steps in ChIP-seq data analysis include
!
!
adapter removal, quality control, etc.
mapping short reads to a reference genome
Sequencing by ligation. SBL is another cyclic method
that differs from CRT in its use of DNA ligase35 and
either one-base-encoded probes or two-base-encoded
probes. In its simplest form, a fluorescently labelled
probe hybridizes to its complementary sequence adjacent to the primed template. DNA ligase is then added
to join the dye-labelled probe to the primer. Non-ligated
probes are washed away, followed by fluorescence
36 | JANUARY 2010 | VOLUME 11
www.nature.com/reviews/genetics
!
Alignment is conceptually simple: given each of the short
reads at a time, align the read locally to a reference genome
using e.g. Smith-Waterman algorithm, BLAST or even exact
string matching
!
Given a large number reads generated in a single experiment,
S-W and BLAST would be too slow in practice
Several computationally more efficient methods have been
developed, which use e.g. methods based on
!
!
!
Hash tables
Suffix/prefix trees (Burrows-Wheeler transform)
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
REVIEWS
Poisson model
A probability distribution that
is often used to model the
number of random events in a
fixed interval. Given an average
number of events in the
interval, the probability of a
given number of occurrences
can be calculated.
alignment, as all subsequent results are based on the
aligned reads. Owing to the large number of reads, the
use of conventional alignment algorithms can take hundreds or thousands of processor hours; therefore, a new
generation of aligners has been developed57, and more
are expected soon. Every aligner is a balance between
accuracy, speed, memory and flexibility, and no aligner
can be best suited for all applications. Alignment for
ChIP–seq should allow for a small number of mismatches due to sequencing errors, SNPs and indels or
the difference between the genome of interest and the
reference genome. This is simpler than in RNA–seq,
for example, in which large gaps corresponding to
introns must be considered. Popular aligners include:
Eland, an efficient and fast aligner for short reads that
was developed by Illumina and is the default aligner on
that platform; Mapping and Assembly with Qualities
(MAQ)58, a widely used aligner with a more exhaustive
algorithm and excellent capabilities for detecting SNPs;
and Bowtie59, an extremely fast mapper that is based
on an algorithm that was originally developed for file
compression. These methods use the quality score that
accompanies each base call to indicate its reliability. For
the SOLiD di-base sequencing technology, in which two
consecutive bases are read at a time, modified aligners
have been developed60,61. Many current analysis pipelines discard non-unique tags, but studies involving the
repetitive regions of the genome27,62–64 require careful
handling of these non-unique tags.
Strand specificity and read density visualization
!
A “data view” of protein-DNA binding
Protein or
nucleosome
of interest
5
Positive strand
3
3
Negative strand
5
5 ends of
fragments are
sequenced
Identification of enriched regions. After sequenced
reads are aligned to the genome, the next step is to identify regions that are enriched in the ChIP sample relative
to the control with statistical significance.
Several ‘peak callers’ that scan along the genome
to identify the enriched regions are currently
available24,26,38,48,65–70. In early algorithms, regions were
scored by the number of tags in a window of a given
Short reads
size and then assessed by a set of criteria based on facare aligned
tors such as enrichment over the control and minimum
tag density. Subsequent algorithms take advantage of
the directionality of the reads71. As shown in FIG. 5, the
fragments are sequenced at the 5` end, and the locaDistribution of tags
Reference genome
tions of mapped reads should form two distributions,
is computed
one on the positive strand and the other on the negative
strand, with a consistent distance between the peaks of
the distributions. In these methods, a smoothed profile of each strand is constructed65,72 and the combined
profile is calculated either by shifting each distribution
Profile is generated from
combined tags
towards the centre or by extending each mapped position
For example, each mapped
Peak identification
into an appropriately oriented ‘fragment’ and then
location is extended
can be performed
adding the fragments together. The latter approach
with a fragment of
on either profile
estimated size
should result in a more accurate profile with respect to
the width of the binding, but it requires an estimate
of the fragment size as well as the assumption that
Fragments are added
fragment size is uniform.
Given a combined profile, peaks can be scored in several ways. A simple fold ratio of the signal for the ChIP
sample relative to that of the control sample around the
peak (FIG. 3B) provides important information, but it is
not adequate. A fold ratio of 5 estimated from 50 and 10
tags (from the ChIP and control experiments, respecFigure 5 | Strand-specific profiles at enriched sites. DNA fragments from a
. | Genetics
. . . tively)
. . has
. a different
. . . statistical
. . . significance
. .
. to the.same.
Nature
Figure experiment
from (Park,
2009)
chromatin immunoprecipitation
are sequenced
from
theReviews
5` end.
ratio estimated from, for example, 500 and 100 tags.
Therefore, the alignment of these tags to the genome results in two peaks
. (one
. on
.
. . A.Poisson
. . model
. . for. the. tag
. distribution
. . . is. an effective
. .
each strand) that flank the binding location of the protein or nucleosome of interest.
approach that accounts for the ratio as well as the absoThis strand-specific pattern can be used for the optimal detection of enriched
27
lute tag numbers , and it can also be modified to account
regions. To create an approximate distribution of all fragments, each tag location can
for regional bias in tag density due to the chromatin
be extended by an estimated fragment size in the appropriate orientation and the
structure, copy number variation or amplification bias67.
number of fragments can be counted at each position.
676 | O CTOBER 2009 | VOLUME 10
.
.
.
www.nature.com/reviews/genetics
Single-end vs. paired-end sequencing
Ÿ)''0DXZd`ccXeGlYc`j_\ijC`d`k\[%8cci`^_kji\j\im\[
!
Sequencing by synthesis always proceeds from 5’ to 3’
!
So-called single-end sequencing technology sequences only one
of the ends of a DNA fragment
!
Either reverse or forward strand is sequenced
!
Current used sequencing technologies can read only a limited
number of nucleotides from the 5’ end, e.g., from 36 to 200
nucleotides
!
Because DNA fragments have size variation, we do not get
information about the whole fragment
!
Paired-end sequencing chemistry, where both end of a DNA
fragment is sequenced, is becoming more popular
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Identification of binding sites from ChIP-seq data
!
Given a read density (or read densities on both strands) along
genome, the actual data analysis task involves identification of
the protein binding sites
!
Given the above information, we should expect to see two
“signal peaks” on opposite DNA strands within a proper
distance
!
But how much signal (how many reads) are considered
enough to call a protein-DNA interaction site?
What affects the signal strength?
!
!
!
!
!
!
Protein binding in the first place
Sequencing efficiency and depth
Chromatin accessibility
Fragmentation efficiency
Mappability (i.e., uniqueness) of a local genomic region
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
ChIP-seq controls
!
The best way to assess significance of a signal at putative
binding sites is to use a control for ChIP-seq
!
!
Input-DNA: sequencing data of the genomic DNA. More/Less
accessible regions are sequenced more/less
ChIP-seq experiment with an unspecific antibody which does
not detect any specific protein
!
ChIP-seq controls can be used to account for many of the
biases which affect the signal strength
!
Input-DNA is currently considered the best control
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
R E V I E WDetecting
S
A
!
binding sites from ChIP-seq data
Early methods used a single cut-off
for
signal strength or a
Ba
Not statistically significant
log-fold enrichment (for a given putative binding
Enrichment
ratio: 1.5
region/window)
15
ChIP
A
Considering all
statistically
significant peaks
Fraction of peaks recovered
Considering all
statistically
significant peaks
Fraction of peaks recovered
score = log
# ChIP-seqControl
reads 10in a window
# Input DNA
reads in a window
Bb
Statistically significant
1
Ba
Considering only peaks
with fold enrichment
0above a threshold
Enrichment
ratio: 4
Not statistically significant
Enrichment
ratio: 1.5
0
1
Considering only peaks
with fold enrichment
above a threshold
ChIP
15
Control
10
ChIP
Control
20
5
Enrichment
ratio: 1.5
150
100
Figure1 from (Park, 2009)
Fraction of reads sampled from the data
Bb
Statistically significant
Figure
! 3 | Depth of sequencing. A | To determine whether enough tags have been sequenced, a simulation can be
carried out to characterize the fraction of the peaks that would be recovered if a smaller numberNature
of tags
had been
Reviews
| Genetics
Enrichment
Enrichment
sequenced. In many cases, new statistically significant
are discovered at
a steady
rate with an increasing number
ratio:peaks
4
ratio:
1.5
of tags (solid curve) — that is, there is no saturation of binding sites. However, when a minimum threshold is imposed for
the enrichment ratio between chromatin
(ChIP)150
and input DNA peaks, the rate at which new
ChIP immunoprecipitation
20
peaks are discovered slows down (dashed curve) — that is, saturation of detected binding sites can occur when only
sufficiently prominent binding positions are considered. For a given data set, multiple curves corresponding to
. . . . . . . . . . . . . .
5 the threshold at which the curve. becomes
different thresholds can be examined to identify
sufficiently flat to meet .the .
Control
100
. .
.
. . . . . . . . . . . . .
.
.
desired
saturation
criteria
(defined
by
the
intersection
of
the
orange
lines
on
the
graph).
We
refer to such a threshold as
0
the minimum saturation1 enrichment ratio (MSER). The MSER can serve as a measure for the depth of sequencing
0
Fraction of reads sampled
from in
thea data
achieved
data set: a high MSER, for example, might indicate that the data set was undersampled, as only the more
prominent peaks were saturated (see REF. 48 for details). Ba | A peak that is not statistically significant — the
Figure 3 | Depth of sequencing.
A | To determine
whetherthe
enough
been
sequenced,isalow
simulation
enrichment
ratio between
ChIPtags
and have
control
experiments
(1.5) andcan
thebe
number of tag counts (shown under
carried out to characterize the the
fraction
of
the
peaks
that
would
be
recovered
if
a
smaller
number
of
tags
had
been
Nature
Reviews
| Genetics
peaks) is also low. Bb | Two ways in which a peak can be statistically significant.
On the left, although the number of
sequenced. In many cases, newtag
statistically
peaks are discovered
at a steady
rate
with
an increasing
number
counts issignificant
low, the enrichment
ratio between
the ChIP
and
control
experiments
is high (4). On the right, the peaks
of tags (solid curve) — that is, there
no same
saturation
of binding
sites.
However,
when
a minimum
threshold
imposed
haveisthe
enrichment
ratio
as those
in a but
have
a larger number
of is
tag
counts;for
this example shows that
the enrichment ratio between chromatin
(ChIP)
andprominent
input DNApeaks
peaks,becoming
the rate atstatistically
which newsignificant and that there might not
continuedimmunoprecipitation
sequencing might lead
to less
peaks are discovered slows down
(dashed curve)
— that is, point
saturation
of
detected
binding
sites
can
occur
when only
Figure
S6.
Workflow
chart
of
MACS.
!
necessarily
be a saturation
after
which
no further
binding
sites
are discovered.
sufficiently prominent binding positions are considered. For a given data set, multiple curves corresponding to
different thresholds can be examined to identify the threshold at which the curve becomes sufficiently flat to meet the
desired saturation criteria (defined by the intersection of the orange lines on the graph). We refer to such a threshold as
the minimum saturation enrichment
Thecontinued
MSER can serve
a measure
for the depth
sequencing
! ratio
more
and (MSER).
more sites
to beasfound
at a steady
largeofsample
size increases the statistical power and
achieved in a data set: a high MSER, for example, might indicate that the data set was undersampled, as only the more
pace with additional sequencing (FIG. 3A, lower curve). causes features that have small effect sizes to attain staprominent peaks were saturated (see REF. 48 for details).
Ba | A peak that is not statistically significant — the
In another study 38, human RNA polymerase II targets tistical significance. In the study discussed above48, we
enrichment ratio between the ChIP and control experiments is low (1.5) and the number of tag counts (shown under
wereinshown
saturate
but forsignificant.
signal transducer
that
each ChIP–seq
data set could be annothe peaks) is also low. Bb | Two ways
whichto
a peak
can quickly,
be statistically
On the left,proposed
although the
number
of
and
activator
ofthe
transcription
1 (STAT1),
the number
tated
a minimal
tag counts is low, the enrichment
ratio
between
ChIP and control
experiments
is high (4). On
thewith
right,
the peakssaturated enrichment ratio (MSER)
rise steadily.
that,example
— a shows
point that
at which saturation occurs — to give a sense
have the same enrichment ratioofastargets
those incontinued
a but haveto
a larger
number This
of tagsuggests
counts; this
continued sequencing might lead
to lessin
prominent
peaksthere
becoming
statistically
there might not
at least
some cases,
might
not be asignificant
satura- and
of that
the sequencing
depth achieved. We also found that
necessarily be a saturation point
after
which
no further
are discovered.
tion
point
that
can bebinding
used tosites
determine
the number there is a linear relationship between the number of
Current state-of-the-art methods are probabilistic in nature
.
.
.
.
.
Model-based Analysis of ChIP-Seq (MACS)
A commonly used method for detecting TF binding sites from
ChIP-seq data: MACS (Zhang et al, 2008)
Workflow:
of tags to be sequenced if peaks are found based on reads and the MSER, when properly scaled. This makes
statistical significance.
it possible to predict how many more reads are needed
However,
does exist
a fixed the
when
a particular
more and more sites continued to
be foundaatsaturation
a steady point
large sample
sizeifincreases
statistical
powerlevel
and of MSER is desired. Although
threshold
is
imposed
on
the
fold
enrichment
between
these
concepts
and
tools should be tested on more
pace with additional sequencing (FIG. 3A, lower curve). causes features that have small effect sizes to attain stathe peaks
in theIIChIP
experiment
the peaksIn
inthe
thestudy
data
sets, they
provide
48
In another study 38, human RNA
polymerase
targets
tisticaland
significance.
discussed
above
, wea framework for understanding
— that is,
saturation
when depth-of-sequencing
issues in ChIP–seq experiments.
were shown to saturate quickly,control
but for experiment
signal transducer
proposed
thatoccurs
each ChIP–seq
data set could be annoonly
prominent
peaks
(as
defined
by
minimum
fold
and activator of transcription 1 (STAT1), the number tated with a minimal saturated enrichment ratio (MSER)
enrichment)
are considered.
Whenat all
peaks
are occurs
Multiplexing.
For
small genomes, including those of
of targets continued to rise steadily.
This suggests
that, — a point
which
saturation
— to give
a sense
even
peaks with
small
enrichment
Saccharomyces
cerevisiae,
at least in some cases, thereconsidered,
might not be
a saturaof the
sequencing
depthcan
achieved.
We also found
that Caenorhabditis elegans and
become
statistically
significant
as more
tags accumulate
D. melanogaster,
theofnumber of reads generated in
tion point that can be used to
determine
the number
there
is a linear
relationship between
the number
(FIG. 3B)
therefore
the number
of significant
peaksproperly
a sequencing
unitmakes
(for example, one of eight lanes on
of tags to be sequenced if peaks
are and
found
based on
reads and
the MSER, when
scaled. This
may
continue
to
rise
with
more
sequencing.
This
is
an
Illumina
Genome
Analyzer) may be several times
statistical significance.
it possible to predict how many more reads are needed
similar
whatifhappens
genome-wide
association
greater
than the
number of reads needed to provide
However, a saturation point
doestoexist
a fixed in when
a particular
level of MSER
is desired.
Although
studies
and other
genomic these
investigations
which
sufficient
coverage
of the genome at a suitable depth
threshold is imposed on the fold
enrichment
between
concepts in
and
toolsashould
be tested
on more
the peaks in the ChIP experiment and the peaks in the data sets, they provide a framework for understanding
control experiment — that is, saturation occurs when depth-of-sequencing
in ChIP–seq
.
.
Figure issues
from (Zhang
et al.,experiments.
2008) . . . . . . . . . . . . . . .
674 |prominent
O CTOBER 2009
10 by minimum fold
. .
. www.nature.com/reviews/genetics
. . . . . . . . . . . . .
.
.
only
peaks| VOLUME
(as defined
enrichment) are considered. When all peaksŸ)''0DXZd`ccXeGlYc`j_\ijC`d`k\[%8cci`^_kji\j\im\[
are Multiplexing. For small genomes, including those of
considered, even peaks with small enrichment can Saccharomyces cerevisiae, Caenorhabditis elegans and
become statistically significant as more tags accumulate D. melanogaster, the number of reads generated in
(FIG. 3B) and therefore the number of significant peaks
a sequencing unit (for example, one of eight lanes on
may continue to rise with more sequencing. This is an Illumina Genome Analyzer) may be several times
.
.
.
.
.
Model-based Analysis of ChIP-Seq (MACS)
!
Find model peaks:
!
!
mfoldlow and mfoldhigh : high confidence fold-enrichment
parameters
MACS slides 2 × bandwidth windows across the genome to
find regions with tags (=reads) more than mfoldlow (but
smaller than mfoldhigh to avoid artefacts) enriched relative to
a random tag genome distribution
!
bandwidth = assumed sonicated fragment size
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Model-based Analysis of ChIP-Seq (MACS)
Model the shift size of ChIP-seq tags
0.5
0.4
0.3
Tag percentage (%)
0.0
0.1
0.2
0.4
0.3
0.2
0.1
100
200
300
−200
−100
0
100
2.8
.
.
.
. . . .
. . . .
. . . .
2.7
.
.
Average tag number in control / kb
500
.
. . . .
200
Location with respect to FKHR motif (bp)
(d)
600
Figure from (Zhang et al., 2008)
−300
2.6
0
2.5
−100
2.4
Tag percentage (%)
Watson tag
Crick tags
d = 122 bp
0.0
−200
Location with respect to the center of Watson and Crick peaks (bp)
(c)
Volume 9, Issue 9, Article R137
0.6
0.5
Watson tags
Crick tags
d = 126 bp
−300
400
!
300
!
Take 1000 high-quality peaks randomly
Separate Watson
and Crick tags
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
Aligns the tags by the mid point
Find d: distance between the modes of the Watson and Crick
peaks in the alignment
(a)
(b)
200
!
100
!
Control tag number (normalized) / 10 kb
!
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
300
Model-based Analysis of ChIP-Seq (MACS)
!
Shift all the tags by d/2 toward the 3’ ends to the most likely
protein-DNA interaction sites
!
Remove redundant tags:
!
!
!
!
!
Sometimes the same tag can be sequenced repeatedly, more
than expected from a random genome-wide tag distribution
Such tags might arise from biases during ChIP-DNA
amplification and sequencing library preparation (PCR
duplicates)
These are likely to add noise to the final peak calls
MACS removes duplicate tags in excess of what is warranted by
the sequencing depth (binomial distribution p-value < 10−5 )
For example, for the 3.9 million ChIP-seq tags, MACS allows
each genomic position to contain no more than one tag and
removes all the redundancies
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Model-based Analysis of ChIP-Seq (MACS)
!
Identifying the most likely binding sites
!
!
For experiments with a control, MACS linearly scales the total
control tag count to be the same as the total ChIP tag count
Counting process: if reads were sampled independently from a
population with given, fixed fractions of genomic locations, the
read counts x in each genomic location/window would follow a
multinomial distribution (binomial for a single location), which
can be approximated by the Poisson distribution
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Multinomial and binomial distributions
!
n independent trials
!
Each trial leads to a success for exactly one of k categories
!
Each category has a fixed success probability pk
!
Multinomial gives probability of any particular combination of
numbers of successes for the k categories
!
Binomial is a special case of 2 categories; can be
approximated with a Poisson for larger n
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Model-based Analysis of ChIP-Seq (MACS)
!
Let x denote the number of sequencing reads in a window
(genomic region)
!
Analysing each genomic window independently, if all genomic
regions would be equally likely, then
λxBG −λBG
x ∼ Poisson(·|λBG ) =
e
,
x!
x = 0, 1, 2, . . .
where λBG is the rate of observing reads in the control sample
!
Because ChIP-seq data has several bias sources which vary
across the genome, it is better to model the data using a
local/dynamic Poisson
λlocal = max(λBG , [λ1K ], λ5K , λ10K )
!
λX K is estimated from the X K window of the control sample
(e.g. input-DNA) centered at a specific location
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Model-based Analysis of ChIP-Seq (MACS)
Assessing statistical significance of x reads (in a genomic
region i) using hypothesis testing
!
!
The p-value is the probability of observing x many reads or
more, assuming the null hypothesis is true:
p − value =
k=x
The location with the highest fragment pileup (summit) is
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
predicted
as the precise binding location
The ratio between the ChIP-seq tag count and λlocal is
reported as the fold enrichment
. . . .
. . . .
. . . .
Tag percentage (%)
0.3
Tag percentage (%)
.
.
.
. bp.
d .= 122
. . .
0.0
0.0
The read count in ChIP versus control in 10 kb windows
across the genome. Each dot represents a 10 kb window; red
dots are windows
containing ChIP peaks and black (d)
dots are
(c)
windows containing control peaks used for FDR calculation.
−300
−200
−100
0
100
200
300
−300
−200
−100
2.7
Average tag number in control / kb
200
400
600
800
1,000
−20
−10
FoxA1 ChIP−Seq tag number / 10 kb
70
Spatial resolution for FoxA1 Ch
.
.
.
. . . .
. . . .
. . . .
Spatial resolution (nt)
.
.
MACS
Without local lambda
Without tag shifting
45
MACS
Without local lambda
Without tag shifting
65
60
10
(f)
Figure from (Zhang et al., 2008)
.
0
Distance to FoxA1 peak center (k
Motif occurrence in peak centers for FoxA1 ChIP-Seq
55
100
2.3
0
FKHR motif (%)
0
Location with respect to FKHR motif
2.8
600
500
400
300
200
100
0
Control tag number (normalized) / 10 kb
Location with respect to the center of Watson and Crick peaks (bp)
(e)
.
0.1
0.1
0.2
Model-based Analysis of ChIP-Seq (MACS)
. . . .
. . . .
0.5
.
0.4
.
0.3
Watson tags
. Crick
. . tags
. . .
0.4
.
.
0.2
d = 126 bp
0.6
(b)
0.5
(a)
!
Volume 9, Issue 9, Ar
2.6
!
Poisson(x|λlocal )
2.5
!
∞
!
2.4
!
H0 : the ith location is not a binding site
H1 : the ith location is a binding site
. . . .
. . . .
. . . .
40
!
.
.
.
.
.
.
.
.
.
.
ChIP-seq peak: Illustration
!
An illustration of a strong TF binding site
Figure from http://www.nature.com/nmeth/journal/v6/n4/images/nmeth.f.247-F2.jpg
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Multiple testing
!
Multiple testing problem occurs when a test needs to be
applied many times
!
Nominal p-values are valid for a single test
!
Hypothesis testing will lead to many false positives if the
p-values are not corrected for multiple testing
!
Multiple testing is an issue in many bioinformatics application
e.g.
!
!
!
!
!
Differential gene expression analysis
TF binding site prediction
Detection of protein binding sites from ChIP-seq
Detecting disease associated genomic variant
...
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Multiple testing
!
Type I error
!
!
!
Null hypothesis H0 is true but it is rejected in favour of H1
E.g.: a genomic region i is not a binding site but it is declared
as a binding site
Assume m independent tests for which the null hypothesis is
true, then the probability that any of the hypothesis will be
rejected is
α = 1 − (1 − α)m
i.e., the probability of making one or more type I errors
!
This is also called the family-wise error rate (FWER)
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Bonferroni correction
(1)
(m)
!
Let H0 , . . . , H0 be a collection of hypotheses and
p1 , . . . , pm the corresponding p-values
!
Let I0 be the subset of the m0 = |I0 | (unknown) true null
hypotheses
!
Recall the Bonferroni correction:
!
!
Given the original significance level α and the number of
statistical tests m, then Bonferroni correction will reject only
those null hypothesis i for which pi ≤ α/m
For the Bonferroni correction FWER ≤ α because
⎛
⎞
$
α⎠ ! '
α(
α
⎝
FWER = P
pi ≤
≤
P pi ≤
= m0 ≤= α
m
m
m
I0
!
I0
The Bonferroni correction is conservative
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
False discovery rate
!
False discovery rate (FDR) is defined as the proportion of
false positives among all positives
#false positives
#false positives + #true positives
Formally FDR is defined as the expectation of the above
quantity
The Benjamini-Hochberg (BH) step-up procedure is
commonly used in bio applications
Let α be given and p(1) , p(2) , . . . , p(m) be the ordered (from
smallest to largest) list of the m p-values, then the BH
procedure works as follows
FDR =
!
!
!
1. Find the largest k such that p(k) ≤
2. Then reject all H(i) for i = 1, . . . , k
!
k
mα
For BH, the probability of expected proportion of false
positives ≤ α
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Multiple correction in MACS
!
For a ChIP-seq experiment with controls, MACS empirically
estimates the false discovery rate (FDR)
!
At each p-value, MACS uses the same parameters to find
ChIP peaks over control and control peaks over ChIP (that is,
a sample swap)
!
The empirical FDR is defined as
empirical FDR =
#control peaks
#ChIP peaks
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Differential binding
!
MACS can also be applied to differential binding between two
conditions by treating one of the samples as the control
!
Differential binding analysis in MACS will only work with two
samples, i.e., one biological replicate per condition
!
Empirical FDR control will not work in such a scenario
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
Summary
!
ChIP-seq is a powerful way to detect TF binding sites (or e.g.
enriched regions of histone modifications; next lecture)
!
ChIP-seq approaches are limited in that
!
!
Only a subset of all TFs have a chip-grade antibody
A single experiment will profile a single protein
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
References
!
!
!
!
!
Metzker ML (2010) Sequencing technologies - the next generation, Nat Rev Genet. 11(1):31-46.
Park PJ (2009) ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet.
10(10):669-80.
Axel Visel, Edward M. Rubin & Len A. Pennacchio (2009) Genomic views of distant-acting enhancers,”
Nature 461, 199-205.
Zhang Y et al. (2008), Model-based analysis of ChIP-Seq (MACS), Genome Biol. 9(9):R137.
Zhang X et al. (2011) PICS: probabilistic inference for ChIP-seq, Biometrics, 67(1): 151-163.
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.