Download Ilot CpG

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA nanotechnology wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Exome sequencing wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA sequencing wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
I LOTS C P G
ET M ÉTHYLATION
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
CpG
dinucléotides C-G sur le même brin : CpG
la distribution de CpG varie dans le génome humaine (et mammifères en général)
juste avant le début d’un
gène (ATG), les CpG
sont souvent dense
dans le génome, très
rares*
* dans le modèle iid, la fréquence de CpG ainsi que celle de GpC = πCπG
chrX
chr12
chr5
chr1
nucléotides
151.10M
130.5M
177.7M
225.3M
composition
C = 29.81M, G = 29.87M (19.7%)
C = 26.63M, G = 26.61M (20.4%)
C = 35.09M, G = 35.13M (19.8%)
C = 47.02M, G = 47.02M (20.9%)
GpC
5.94M (3.9%)
5.54M (4.2%)
7.11M (4.0%)
9.95M (4.4%)
CpG
1.25M (0.8%)
1.28M (1.0%)
1.51M (0.8%)
2.28M (1.0%)
Wikipedia
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
ii
Méthylation
Typiquement, la cytosine de CpG est méthylée, et peut se transformer en thymine
facilement
Wikipedia
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
iii
Ilots CpG
enrichissement dans une région de longueur ` : compter les occurrences n·
nCpG
CpG(o/e) = `
nC · nG
ilot CpG (CpG island) : définitions par (G + C)% élevée, enrichissement de CpG,
longueur minimale
P.e. : (G + C)% ≥ 50%, ` ≥ 200, CpG(o/e) ≥ 0.6
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
iv
%GC ! 55% and a length ! 500 bp with ObsCpG!ExpCpG ! 0.65
resulted in the exclusion of the vast majority of Alus and
CGIs were first
identified
by digestion
of mouse
genomic
DNAof
unknown
sequences,
while only slightly
decreasing
the number
CpG islands
that occur
within the
5! regions
of genes.
The
using the methyl-CpG
sensitive
restriction
enzyme
HpaII
(CCGG
also
substantially
the number
recognition site).increased
A small stringency
portion of
the
genome, reduced
composed
of veryof
exonic CpG islands. The biological functions of these islands are
highly fragmented
was found
be islands
derived
frominsequences
not DNA,
well understood,
buttoCpG
located
nonpromoter
containing clusters
of non-methylated
[5,6,20].
Quantifiregions
can play significantCpG
rolessites
in gene
regulation
(18); they
also
seem
to
be
frequent
targets
for
de
novo
methylation
cation of these digestion products, combined with sequence anal-in
cancer and aging (19). Therefore, although the increased strinysis and correction
contaminating
indicated
theseof
gencyfor
preferentially
locates DNA
CpG islands
in thethat
5! regions
were derived from
300loss
discrete
CGIs
[21,22].
genes,approximately
it may also result26
in the
of smaller
regions
of DNA
thecharacterised
data set that may
in gene
These sequencesfrom
were
asbeatfunctionally
least 200important
bp in length
not.
Rather, there
is aminimum
continuum of
500-bp regions
of DNA
of increasing
the
length,
CpG[o/e]
and that
G + C composition
move
between
this bulk
DNA
and the
properties of aThis
CpGincreased
island.
to 500
bp, 0.65%
and
55%,
respectively.
stringency
The human genome showed the strongest suppression of CpG.
reduced
the number
ofthe
identified
byplot
approximately
90%
Several
sequences
plotted in
lower left islands
field of the
of
and largely
excluded
!ExpCpG ofcontaminating
the human genomeAlu
(Fig.elements.
4A) turned This algorithm
%GC
vs. ObsCpG
and (TT- islands, sugout
be simplethe
repetitive
sequences
such
as (TA)n associated
alsotoreduced
number
of gene
promoter
TAA)n (data not shown). CpG suppression in the human
gesting that bona fide CGIs were also being discarded [24].
genome is caused not only by CpG depletion through evolution
elements
such
as ‘‘young”
Alus
resemble
but Repeat
also by the
high content
of simple
repetitive
sequences
and the base comaposition
low rate of
utilization
for genes.
A. thaliana to
contains
ofsequence
CGIs and
significantly
contribute
the number of false
5-methylcytosine, and its genome shows a wide distribution of
positives identified [24]. Preliminary computational analysis of the
the occurrence for CpG (Fig. 4B). However, because of the low
human
genome
sequence
identified
50 267
CGIs, of which only
GC
content
in this organism,
few fragments
fulfilling
our criteria
for
CpGwere
islandunique
are visible
the A. thaliana
genome.
In this
28 a890
[4].inMany
of the multi
copy
sequences could
pected; [o/e]) of 0.6 (Fig. 1) [7,8].
The completion of the human genome project in 2001 facilitated in silico CGI prediction [4]. Values for length and base composition similar to those identified by Gardiner-Garden and
Frommer are routinely employed by the major genome browsers
to annotate CGIs (Table 1). Thresholds are somewhat arbitrary
however, and the effect of varying these values can profoundly alter prediction accuracy [23–25]. To reduce the extraneous inclusion of non-CGI sequences Takai and Jones investigated the effect
fied in the Repbase database [26]. This database is subject to iterative improvements due to updating the repeat repertoire.
Reanalysis of the human genome sequence in 2002 resulted in
the loss of a further 1890 false positives suggesting a more conservative estimate of 27 000 CGIs [27]. The beneficial consequences of
repeat masking can be illustrated by the example of a low copy
repetitive element that is related to the adenovirus sequence located on human chromosomes 4 and 19 [28]. This element is identified as a single CGI or a tandem cluster of repeated CGIs by
of genes. This table shows that modifying the criteria to a
2. CGI identification
Ilot CpG : paramètres
→ paramètres différents . . .
→ CpG fréquent dans éléments mobiles (p.e.,
Alu)
be removed by screening against known classes of repeats identiand with a G + C content of 50% and a CpG frequency (observed/ex-
Table 1
Fig. prediction
3. The modified
criteria also helped remove Alu sequences previously identified as part of 5! region CpG islands. In this example, a 1,233-bp fragment
Overview of CpG island
algorithms.
originally extracted by the algorithm included two Alu sequences with some CpG suppression associated with the nonhistone chromosome protein 2 like 1
to 620 bp and excludedRM
theaAlu sequences.
Database/prediction (NHP2L1). The modified
Lengthstringent criteria reduced
G + C the size of the island
CpG[o/e]
Comments
Reference
ENSEMBL
NCBI relaxed
NCBI strict
USCSb
EMBOSS
CpGProD
CpGcluster
Stringent length constraint
Total CGIs = 307 193
Total CGIs = 24 163
Total CGIs = 28 226
Variable parameters
Total CGIs = 76 793
Clustering Total = 197 727
[88]
a
b
c
Takai and Jones
P400
P200
P500
>200
UDc
>500
NA
P50%
P50%
P50%
P50%
UD
>50%
NA
P0.6
P0.6
P0.6
>0.6
UD
>0.6
NA
N
N
N
Y
NA
Y
N
PNAS " March 19, 2002 " vol. 99 " no. 6 " 3743
[89]
[90]
[23]
[25]
RM, repeat masked; Y, yes; N, no; NA, non applicable.
Parameters used for CGI identification for the ENCODE project although totals vary due to repeat masking differences between hg17 and hg18 builds [87].
UD, user defined.
Takai & Jones PNAS 99 :3740 (2002) ; Illingworth & Bird FEBS Lett 583 :1713 (2009)
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
v
Ilots CpG : distribution
1714
→ longueur ∼ 1kpb
→ parfois se trouvent à l’intérieur d’un gène ou dans une région intergénique
R.S. Illingworth, A.P. Bird / FEBS Letters 583 (2009) 1713–1720
4 ilots
dans une région
de 65 kpb
ilots aux promoteurs
des gènes
ilot à l'intérieur
du gène
Fig. 1. CpG islands located within a region of human chromosome 19. The upper panel illustrates a 65 kb portion of human chromosome 19 (17195000–17260000) which
contains five annotated genes (blue bars) and four CpG islands. The promoters of OCEL1, NR2F6 and ANKLE1 overlap with CGIs (i,iii and iv) and an additional CGI (ii) localises to
the third exon of NR2F6. The classical sequence parameters applied to CGI prediction are illustrated (dashed red lines) for CpG (observed/expected; CpG[o/e] = 0.6) and G + C
base composition (GC% = 50%). The lower panel represents an enlarged view of four 6 kb regions (i–iv) spanning each CGI and illustrates the distribution of CpG sites (vertical
black strokes) relative to the annotated genes.
Illingworth & Bird FEBS Lett 583 :1713 (2009)
of increasing the minimum length, CpG[o/e] and G + C composition
to 500 bp, 0.65% and 55%, respectively. This increased stringency
reduced the number of identified islands by approximately 90%
CGIs were first identified by digestion of mouse genomic DNA
and largely excluded contaminating Alu elements. This algorithm
using the
methyl-CpG? sensitive
(CCGG
Méthylation
IFT6299restriction
H2014 ?enzyme
UdeM HpaII
? Miklós
Csűrös
also reduced the number of gene promoter associated islands, sugrecognition site). A small portion of the genome, composed of very
2. CGI identification
vi
TSS
Méthylation et transcription
H3K4me3 and H2A.Z
block de novo methylation
Enhancer
Insulator
TET
LMR
variable
5mC
CTCF
5mC blocks
CTCF binding
H2A.Z
anticorrelated
with 5mC
Gene body
TET
5mC alters
splicing
TSS
H3K4me3 and H2A.Z
block de novo methylation
NDR limits
de novo
methylation
Unmethylated CpG
Methylated CpG
Variable CpG methylation
H3K4me1
H3K4me3
Bound DNMTs maintain
methylation
H2A.Z
Repetitive
DNA
DNMT3A
DNMT3B
Figure 1 | Molecular anatomy of CpG sites in chromatin and Nature
their roles
in gene
Reviews
| Genetics
expression. About 60% of human genes have CpG islands (CGIs) at their promoters
and frequently have nucleosome-depleted regions (NDRs) at the transcriptional start
site (TSS). The nucleosomes flanking the TSS are marked by trimethylation of histone H3
at lysine 4 (H3K4me3), which is associated with active transcription, and the histone
variant H2A.Z, which is antagonistic to DNA methyltransferases (DNMTs). Downstream
of the TSS, the DNA is mostly CpG-depleted and is predominantly methylated in
repetitive elements and in gene bodies. CGIs, which are sometimes located in gene
bodies, mostly remain unmethylated but occasionally acquire 5-methylcytosine (5mC)
in a tissue-specific manner (not shown). Transcription elongation, unlike initiation, is not
blocked by gene body methylation, and variable methylation may be involved in
controlling splicing. Gene bodies are preferential sites of methylation in the context
CHG (where H is A, C or T) in embryonic stem cells5, but the function is not understood
(not shown). DNA methylation is maintained by DNMT1 and also by DNMT3A and/or
DNMT3B, which are bound to nucleosomes containing methylated DNA99. Enhancers
tend to be CpG-poor and show incomplete methylation, suggesting a dynamic process
of methylation or demethylation occurs, perhaps owing to the presence of ten-eleven
translocation (TET) proteins in these regions, although this remains to be shown. They
also have NDRs, and the flanking nucleosomes have the signature H3K4me1 mark
and also the histone variant H2A.Z32,100. The binding of proteins such as CTCF to
insulators can be blocked by methylation of their non-CGI recognition sequences, thus
leading to altered regulation of gene expression, but the generality of this needs further
exploration. The sites flanking the CTCF sites are strongly nucleosome-depleted, and
the flanking nucleosomes show a remarkable degree of phasing. The figure does not
Méthylation
? IFT6299
H2014
? UdeM
? Miklós
Csűrös
show the structure
of CpG-depleted
promoters
or silenced
CGIs, although
in both cases
the silent state is associated with nucleosomes at the TSS. LMR, low-methylated region.
R E V I E W SBound DNMTs maintain
NDR limits
de novo
methylation
methylation
Repetitive
DNA
Unmethylated
CpGstrongly anti-correlated
H3K4me1 with
H2A.Z, both
of which are
DNMT3A
H2A.Z
46,47
Methylated
CpG
DNMT3B
DNA methylation
. The occurrence ofH3K4me3
the H3K4me3
Variable CpG methylation
mark in mice is possibly maintained by the presence of
CXXC finger protein 1 (CXXC1; also known as CFP1),
Figure
1 | Molecular
anatomy
of CpG sites intochromatin
and Nature
their roles
in gene
Reviews
| Genetics
which
recruits
the H3K4
methyltransferase
the
expression.
Aboutthat
60%the
of human
have CpG islands (CGIs) at their promoters
region,
thus ensuring
+1 and genes
–1 nucleosomes
and frequently have nucleosome-depleted regions (NDRs) at the transcriptional start
contain marks that are incompatible with de novo DNA
site (TSS). The nucleosomes flanking the TSS are marked by trimethylation of histone H3
methylation 48. The unmethylated state of the CpG
at lysine 4 (H3K4me3), which is associated with active transcription, and the histone
island is also presumably ensured by the presence of the
variant H2A.Z, which is antagonistic to DNA methyltransferases (DNMTs). Downstream
TET1
protein,
which
is isfound
atCpG-depleted
a high proportion
of the
TSS, the
DNA
mostly
and isofpredominantly methylated in
the TSSs
of
high-CpG-content
promoters.
Presumably,
repetitive elements and in gene bodies. CGIs, which are sometimes located in gene
TET1
converts
anyremain
5mC unmethylated
that might be but
in this
region acquire 5-methylcytosine (5mC)
bodies,
mostly
occasionally
49
intoin5-hydroxymethylcytosine
.
The
molecular
anat- elongation, unlike initiation, is not
a tissue-specific manner (not shown). Transcription
omyblocked
of activebyCGIs
therefore
explain
they methylation
are
genecan
body
methylation,
andwhy
variable
may be involved in
resistant
to methylation
(FIG. 1)bodies
.
controlling
splicing. Gene
are preferential sites of methylation in the context
Of
course,
notHall
genes are
expressed
CHG
(where
is CGI-promoter
A, C or T) in embryonic
stem
cells5, but the function is not understood
in ESCs,
and many
suppressed
by the Polycomb
(not shown).
DNAare
methylation
is maintained
by DNMT1 and also by DNMT3A and/or
complex,
so why
are are
these
not de novo
methylated?
The methylated DNA99. Enhancers
DNMT3B,
which
bound
to nucleosomes
containing
answer
lies inand
theshow
factincomplete
that they contain
tendprobably
to be CpG-poor
methylation, suggesting a dynamic process
of methylation
or demethylation
occurs,
perhaps
owing to the presence of ten-eleven
the antagonistic
H3K4me3
(REF. 12) and
H2A.Z
marks46,47
translocation
(TET)
proteins
in
these
regions,
although
and are also bound by TET1, which would ensure
that this remains to be shown. They
have 5mC-free.
NDRs, and the
flanking nucleosomes
have the signature H3K4me1 mark
theyalso
remain
Interestingly,
this protection
and
histone
variant
H2A.Z32,100.50The
seems
toalso
breakthe
down
during
immortalization
, andbinding
these of proteins such as CTCF to
can besusceptible
blocked byto
methylation
of their non-CGI recognition sequences, thus
CGIsinsulators
become highly
de novo methylation,
41–43 but the generality of this needs further
leading
to altered
regulationtransformation
of gene expression,
which
increases
after oncogenic
.
exploration.
The sites
thethe
CTCF
sites
are strongly nucleosome-depleted, and
This
model predicts
thatflanking
the higher
level
of expresflanking
nucleosomes
remarkable
degree of phasing. The figure does not
sionthe
is, the
less likely
it is that ashow
CGI isa to
become de novo
show
the
structure
of
CpG-depleted
promoters
or silenced CGIs, although in both cases
methylated. Direct evidence in support of this prediction
the silent state is associated with nucleosomes at the TSS. LMR, low-methylated region.
has recently come from several exciting papers that have
shown that monoallelic methylation of CGIs preferentially occurs on the allele that is less highly expressed.
For example, Hitchins et al.51 showed that an allele of the
MLH1 gene containing a single-nucleotide
in present, and then this is followed
nucleosomevariant
becomes
the promoter, which was less active
the more combythan
the recruitment
of DNMT3A to this nucleosome and,
mon allele in transfection experiments,
was more likely
to methylation occurs. Whether a
subsequently,
de novo
become methylated in the somaticsimilar
cells of cancer-affected
sequence of events occurs in cells that are not
families. In other words, the less active
allele was
the one is not yet known.
expressing
DNMT3L
that was more likely to acquire de novo
methylation.
Furthermore,
Ooi
et al. 12 showed that de novo
52
An alternative scenario was shown
by
Boumber
et al.
,
methylation could not occur on a nucleosome bearing
who found that an allele of RIL (also
as PDLIM4)
Jones Nat Rev
Genet
13 :484
(2012)
theknown
H3K4me2
or H3K4me3
marks,
which
are associbearing a polymorphism in the promoter that created an
ated with active genes. The nucleosomes flanking the
additional binding site for the transcription factor SP1 or
nucleosome-depleted start site often contain both
SP3 was much less likely to become de novo methylated
the histone mark H3K4me3 and the histone variant
than the allele without this polymorphism. The extra SP1
site therefore confers resistance of this allele to de novo
vii
methylation, although the authors could not demonstrate
NATURE REVIEWS | GENETICS
that the extra transcription factor binding site increased
TET1 conv
into 5-hyd
omy of acti
resistant to
Of cours
in ESCs, a
complex, so
answer pr
the antagon
and are als
they remai
seems to bre
CGIs becom
which incre
This mo
sion is, the l
methylated
has recently
shown that
tially occur
For exampl
MLH1 gen
the promot
mon allele i
become me
families. In
that was m
An alternat
who found
bearing a p
additional b
SP3 was mu
than the alle
site therefo
methylation
that the ext
gene expres
Gene body
Most gene
ylated and
elements. M
is a major
ing to dise
cancer-cau
tant to rea
at gene pro
genes54 and
tions here r
that these
Méthylation et cellules souches
ICLES
mCG
mCHG
mCHH
b
mCG
mCHG
OCT4
1 kb
Chr 6: 31,246,431
c 0.25
IMR90 mCG
H1 mCG
mCHG mCHH
H1 mCHG
H1
mCHH
IMR90
mCG
0.2
0.15
17.3%
0.1
75.5%
Fraction of total mC
Fraction of total mC
c 0.25
7.2%
17.3%
H1 mCG
H1 mCHG
H1 mCHH
0.2
0.15
mCG = 4.7 x 107
0
mC = 6.2 × 107
H1 mCG/CG
IMR90 mCG/CG
*
*
IMR90
Chr 3: 100,016,095–
BMP4 (H1)
100,016,287 (W)
0
25
50
75
Methylation level (%)
H1 mCHG/CHG
H1 mCHH/CHH
100
100
80
0
*
+4
iPS (IMR90)
Chr 1: 200,015,530–
200,015,725 (W)
*
*
25
50
75
Methylation level (%)
60
Chr 10: 30,837,441–
30,837,664 (W)
Figure 2 | Bisulphite-PCR validation of non-CG
DNA
methylation in
Chr 1:
200,015,530–
differentiated and stem cells. DNA methylation sequence context is
200,015,725
displayed according to the key and the percentage methylation(W)
at each
position is represented by the fill of each circle (see Supplementary Table 2
for values). Non-CG methylated positions indicated by an asterisk are
unique to that cell type and ‘14’ indicates a mCHH that is shifted 4 bases
downstream in H9 cells. iPS, induced pluripotent stem cell.
Chr 3: 100,016,095–
100,016,287 (W)
Chr 10: 30,837,441–
30,837,664 (W)
Figure 2 | Bisulphite-PCR validation of non-CG DNA methylation in
differentiated and stem cells. DNA methylation sequence context is
100
displayed
according
the key
CHH methylation
identified in H1
cells and absentto
in IMR90
cells isand the percentage methylation at each
not simply due to
genetic
differences
between
the
two
cell
types,
position is represented by thebutfill of each circle (see Supplementary Table 2
rather that the presence of non-CG methylation is characteristic of an
forstate.
values).
Non-CG
methylated
embryonic stem-cell
For each cell
type, two biological
replicates positions indicated by an asterisk are
100
were performed with cells of different passage number (see Supuniqueandtocomparison
that cellof type
and ‘14’ indicates a mCHH that is shifted 4 bases
plementary Information),
the methylcytosines
80
identified independently
in each replicate
revealed
a highiPS,
concorddownstream
in H9
cells.
induced pluripotent
stem cell.
Lister, Pelizzola & al Nature 462 :315 (2009)
(H1 = cellules souches ; IMR90=lung fibroblast)
mC per 10 kb
0.04
40
Watson
20
0.02
0
H1 0.02
mCG/CG
Crick
IMR90
0.04 mCG/CG
2x10
7
4x107
H1 mCHG/CHG
H1 mCHH/CHH
6x10
8x10
7
Chromosome 12
7
0
1x108
1.2x108
alized mC density per 100 kb
Watson
0
H9
+4
BMP4 (H1)
0.05
0.05
d
H9
IMR90
0.1
mC = 6.2 × 107
C
H1
*
Normalized mC density per 100 kb
H1
mCHH
iPS (IMR90)
mCG = 4.7 x 107
75.5%
IMR90
mCG
mC = 4.5 × 107
IMR90
7.2%
IMR90
mC
mCG = 4.5 × 107
mCHG
H1
H1
99.98%
mC = 4.5 × 107
mCG
mCHH
H1 mC
mCG = 4.5 × 107
C
IMR90NATURE | Vol 462 | 19 November 2009
IMR90
mC
a
mCHH
H1
ARTICLES
99.98%
mCHG
MethylC-Seq
1 kb
Chr 6: 31,246,431
H1 mC
R90
H1
mCG
OCT4
Bisulphite PCR
b
mCHH
MethylC-Seq
mCHG
Bisulphite PCR
mCG
→ méthylation de cytosine aussi à de sites CH (H = {A, CNATURE
, T}) | Vol 462 | 19 November 2009
→ change lors de la différentiation de cellule
Figure 1 | Global trends of human DNA methylomes. a, The percentage of
methylcytosines identified for H1 and IMR90 cells in each sequence context.
ance of cytosine methylation status between replicates (Supplemen60
b, AnnoJ browser representation of OCT4. c, Distribution of the
tary Fig. 2). For each cell type, the final DNA methylation map
methylation level in each sequence context. The y axis indicates the fraction
presented in this study represents the composite of the two biological
40
of all methylcytosines that display each methylation level (x axis), where
replicates. The OCT4 gene (also called POU5F1) exemplifies both
methylation level is the mC/C ratio at each reference cytosine (at least 10
cell-specific differential methylation and the presence of non-CG
20
reads required). d, Blue dots indicate methylcytosine density in H1 cells in
methylation (Fig. 1b), and in addition displayed a ,50-fold reduc10-kb windows throughout chromosome 12 (black rectangle, centromere).
transcript in IMR90 cells (data not shown). The
Smoothed lines represent
the methylcytosine
density in each
context in H1
0tion in OCT4
Méthylation
?
IFT6299
H2014
?
UdeM
?
Miklós
Csűrös
absence of
mCHG and mCHH methylation in IMR90 cells coincided
and IMR90 cells. Black triangles indicate various regions of contrasting
with significantly lower transcript abundance of the de novo DNA
trends in CG and non-CG methylation. mC, methylcytosine.
CHH methylation identified in H1 cells and absent in IMR90 cells is
not simply due to genetic differences between the two cell types, but
viii
rather that the presence of non-CG methylation is characteristic of an
tion information is erased during amplification. Some
investigators have suggested that it could be feasible to
maintain the pattern of methylation during PCR if an
appropriate DNA methyltransferase were present in the
tion enzyme has an accompanying DNA methyltransferase that protects the endogenous DNA from
the restriction defence system by methylating bases in the
recognition site. Some restriction enzymes are inhibited
Détection de méthylation
Table 1 | Main principles of DNA methylation analysis
Pretreatment
Analytical step
Locus-specific analysis
Enzyme
digestion
HpaII-PCR
Affinity
enrichment
MeDIP-PCR
Sodium
bisulphite
MethyLight
EpiTYPER
Pyrosequencing
Gel-based analysis
Southern blot
RLGS
MS-AP-PCR
AIMS
Sanger BS
MSP
MS-SNuPE
COBRA
Array-based analysis
NGS-based analysis
DMH
MCAM
HELP
MethylScope
CHARM
MMASS
Methyl–seq
MCA–seq
HELP–seq
MSCC
MeDIP
mDIP
mCIP
MIRA
MeDIP–seq
MIRA–seq
BiMP
GoldenGate
Infinium
RRBS
BC–seq
BSPP
WGSBS
AIMS, amplification of inter-methylated sites; BC–seq, bisulphite conversion followed by capture and sequencing; BiMP, bisulphite
methylation profiling; BS, bisulphite sequencing; BSPP, bisulphite padlock probes; CHARM, comprehensive high-throughput arrays
for relative methylation; COBRA, combined bisulphite restriction analysis; DMH, differential methylation hybridization; HELP, HpaII
tiny fragment enrichment by ligation-mediated PCR; MCA, methylated CpG island amplification; MCAM, MCA with microarray
hybridization; MeDIP, mDIP and mCIP, methylated DNA immunoprecipitation; MIRA, methylated CpG island recovery assay;
MMASS, microarray-based methylation assessment of single samples; MS-AP-PCR, methylation-sensitive arbitrarily primed PCR;
MSCC, methylation-sensitive cut counting; MSP, methylation-specific PCR; MS-SNuPE, methylation-sensitive single nucleotide
primer extension; NGS, next-generation sequencing; RLGS, restriction landmark genome scanning; RRBS, reduced representation
bisulphite sequencing; –seq, followed by sequencing; WGSBS, whole-genome shotgun bisulphite sequencing.
| MARCH 2010 | VOLUME 11
www.nature.com/reviews/genetics
© 2010 Macmillan Publishers Limited. All rights reserved
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
Laird Nat Rev Genet 11 :191 (2010)
ix
BMC Bioinformatics 2009, 10:232
http://www.biomedcentral.com/1471-2105/10/232
Séquençage bisulphite
Watson
Crick
>>ACmGTTCGCTTGAG>>
<<TGCmAAGCGAACTC<<
Cm methylated
C Un-methylated
1) Denaturation
Watson
>>ACmGTTCGCTTGAG>>
Crick
<<TGCmAAGCGAACTC<<
BSC
<<TGCmAAGUGAAUTU<<
2) Bisulfite Treatment
BSW
>>ACmGTTUGUTTGAG>>
3) PCR Amplification
BSW
BSWR
>>ACmGTTTGTTTGAG>>
<<TG CAAACAAACTC<<
<<TGCmAAGTGAATTT<<
BSCR >>ACG TTCACTTAAA>>
BSC
Figure 1of bisulfite sequencing
Pipeline
Pipeline of bisulfite sequencing. 1) Denaturation: separating Watson and Crick strands; 2) Bisulfite treatment: converting
un-methylated cytosines (blue) to uracils; methylated cytosines (red) remain unchanged; 3) PCR amplification of bisulfitetreated sequences resulting in four distinct strands: Bisulfite Watson (BSW), bisulfite Crick (BSC), reverse
of 10 :232 (2009)
Xi & Li complement
BMC Bioinformatics
BSW (BSWR), and reverse complement of BSC (BSCR).
Méthylation
? IFT6299
H2014
? UdeM
? Miklós
Csűrös
detect the
methylation
pattern
of every
C in the
genome.
Nevertheless, the mapping of millions of bisulfite reads to
another 19% are Gs, only ~1.8% of dinucleotides are CpG
dinucleotides. Because C methylation occurs almost
x
as mismatches, where a C in the BS-read is aligned to a T
in the reference [2]. Although this all-inclusive C/T conversion is effective for reads derived from the C-poor
strands, it is not appropriate for reads derived from the Gpoor strands, where all the Cs are actually transcribed
from Gs by PCR amplification and thus could not be converted to Ts during bisulfite treatment. During shotgun
sequencing, however, a bisulfite read is almost equally
likely to be derived from either the C-poor or the G-poor
strands. There is no precise way to determine the original
tions have to be recorded, even the non-unique mappings. Therefore, this approach is only practical for small
reference sequences, where only the C-poor strands are
sequenced. For example, Meissner et al. used this mapping strategy for reduced representation bisulfite sequencing (RRBS) [2], where the genomic DNA was digested by
the Mspl restriction enzyme and 40–220 bp segments
were selected for sequencing. The reference sequence (~27
M nt) is only about 1% of the whole mouse genome, covering 4.8% of the total CpG dinucleotides.
Séquençage bisulphite 2
Alignement asymétrique . . .
1) Multiple Mapping
>>ATTTCG>>
Bisulfite Read
Reference
ATTTCG
ATTTCG
ATTTCG
>>ATACTTCGATGATCTCGCAAGACTCCGGC>>
2) Mapping Asymmetry
Bisulfite Read
Reference
C
C
T
T
Figure 2of bisulfite reads
Mapping
Mapping of bisulfite reads. 1) Increased search space due to the cytosine-thymine conversion in the bisulfite treatment. 2)
Mapping asymmetry: thymines in bisulfite reads can be aligned with cytosines in the reference (illustrated in
notBioinformatics
the
Xiblue)
& Li but
BMC
10 :232 (2009)
reverse.
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
Page 3 of 9
(page number not for citation purposes)
xi
BSMAP
BMC Bioinformatics 2009,de
10:232bits pour détecter match/mismatch
http://www.biomedcentral.com/1471-2105/10/232
hachage + manipulation
(stocker un masque
pour la référence)
Figure algorithm
BSMAP
3
BSMAP algorithm. A) Bisulfite seed table, using the original seed and bisulfite variants as keys and corresponding
Xi & LicoordiBMC Bioinformatics 10 :232 (2009)
nates in the reference genome as values. Each read was looked up in the seed table for potential mapping positions. B) A positional specific mask of the corresponding reference sequence was generated by setting 01 to C(light blue) and 11 to A, G,
T(black). The original read was masked by a bitwise AND operation with the positional specific mask. C) The reference
sequence and the masked read were compared with a bitwise XOR operation. Non-zero XOR results were counted as mismatches (red). Bisulfite alignment is marked in green.
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
xii
are unmethylated and thus converted. Finally, we set
T & 10/ln(10), which is the same scale as ‘phred’ scores
(10). We used the score matrix in Table 1, which approximately fits these settings.
First, we rand
cytosine in bot
cytosine received
(Table 2). A m
this cytosine is m
which the
On peut aussi trouver une pénalisation propre (LODS) à la conversion bisulfite
: DNA
the probability
Table 1. Score matrix for aligning bisulfite-converted DNA reads to
depended wheth
a reference genome sequence
(Table 2).
Second, we ra
a
c
g
t
genome, by pick
obtained from
a
6
#18
#18
#18
c
#18
6
#18
3
Genome Databa
g
#18
#18
6
#18
match
tutions,
but also
C:T ou T:T
t
#18
#18
#18
3
sertions are larg
read to come en
Columns refer to bases in the read, and rows refer to bases in the
genome.
original genome
réfé
ren
ce
Pénalisation modifiée
Frith, Mori & Asai Nucleic Acids Res 40 :e100 (2012)
Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös
xiii
ransformed
transformation
transformed
me
statistical
e transformation
the
meβ-value
statistical
ted
range
thevalue
β-value
bimodal
ited value range
the cost of
yatbimodal
ogical
) at the cost of
ty.
logical
lity.
REVIEWS
Figure 2 | Two alternative strategies for bisulphite
alignment. a | An illustrative example of bisulphite
sequencing for a DNA fragment with known DNA
methylation levels at four CpGs and a total of eight
bisulphite-sequencing reads. For easier visualization,
the sequencing reads are four bases long (realistic
numbers would be 50 to 200 bases), and the size of the
genomic DNA sequence is just 23 bases (3 gigabases
would be a realistic number for the human genome).
b | Alignment of the bisulphite-sequencing reads
(centre) to the reference sequence (top) using a
wild-card aligner that tolerates zero mismatches and
zero gaps. The aligner replaces each C in the reference
sequence by the wild-card letter Y, which can match
both C and T in the read sequences. Reads with more
a
Setupone
of the
example
Figure 2 | Two alternative strategies for bisulphite
than
perfect
alignment with the reference sequence
a
Setup
of the example
Figure 2 | Two
strategies
bisulphite
alignment.
a | alternative
An illustrative
example for
of bisulphite
are
discarded
(greyed out), and for each CpG in the
Genomic DNA sequence CCGATGATGTCGCTGACGCACGA
alignment.for
a |aAn
illustrative
of bisulphite
sequencing
DNA
fragmentexample
with known
DNA
genomic
DNA sequence,
the DNA methylation level
CCGATGATGTCGCTGACGCACGA
Genomic
DNA sequence
DNA
methylation
level
100%
50%
50% 0%
sequencing levels
for a DNA
fragment
witha total
known
methylation
at four
CpGs and
ofDNA
eight
(bottom)
is calculated
as
the percentage
of aligning
Cs
DNA
methylation
level
100%
50%
50% 0%
methylation levels at four
CpGs
a total
of eight
bisulphite-sequencing
reads.
Forand
easier
visualization,
among all uniquely mapped reads. Note that the third
bisulphite-sequencing
reads.
For easier
the
sequencing reads are
four bases
longvisualization,
(realistic
CpG is incorrectly
assigned aselective
DNA methylation level of
DNA fragmentation,
the sequencing
reads
are
four
basesand
long
(realistic
numbers
would be
50 to
200
bases),
the
size of the
100%, which
is due
to
factselective
that the unmethylated
conversion
of the
unmethylated
DNA
fragmentation,
Cs
into Ts, DNA
sequencingwhereas the
numbersDNA
would
be 50 tois200
and(3the
size of the
conversion
unmethylated
genomic
sequence
justbases),
23 bases
gigabases
read was discarded
asofambiguous,
Csread
into could
Ts, DNAbesequencing
genomic
sequence
is just
23 bases
(3 genome).
gigabases
would
be aDNA
realistic
number
for the
human
methylated
uniquely mapped. c | The
be a realistic
number for the humanreads
genome).
bwould
| Alignment
of the bisulphite-sequencing
same alignment carried out by a three-letter aligner,
b | Alignment
the bisulphite-sequencing
reads
(centre)
to the of
reference
sequence (top) using
a
which also tolerates
zeroACGT,ATGA,ATGA,ATGT,
mismatches and zero gaps.
Bisulphite-sequencing
reads
(centre) to
the reference
sequence
using a and
wild-card
aligner
that tolerates
zero(top)
mismatches
The aligner replaces
eachTCGA,TCGA,TCGT,TTGT
C in the reference sequence
Bisulphite-sequencing
reads
ACGT,ATGA,ATGA,ATGT,
wild-card
aligner
thatreplaces
tolerateseach
zeroCmismatches
and
zero
gaps. The
aligner
in the reference
by an upper-case T and each
C in the sequencing
TCGA,TCGA,TCGT,TTGT
zero gaps.byThe
replaces
C in can
the match
reference
sequence
thealigner
wild-card
lettereach
Y, which
reads by a lower-case t, with no distinction being made
sequence
the
wild-card
letter Y, which
both
C andby
T in
the
read sequences.
Reads can
withmatch
more
between upper-case T and lower-case t during the
bothone
C and
T in the
read sequences.
Reads withsequence
more
Wild-card alignment
b alignment.
than
perfect
alignment
with the reference
As a result of the reduced sequencing
Wild-card alignment
bcomplexity
than
one perfect
alignment
with
reference
are
discarded
(greyed
out), and
forthe
each
CpG in sequence
the
with only YYGATGATGTYGYTGAYGYAYGA
three letters remaining, a larger
Reference sequence
are discarded
(greyed out),
formethylation
each CpG inlevel
the
genomic
DNA sequence,
theand
DNA
number
of reads
align YYGATGATGTYGYTGAYGYAYGA
to more than one position in the
Reference
sequence
genomicisDNA
sequence,
thepercentage
DNA methylation
levelCs
(bottom)
calculated
as the
of aligning
reference sequence and
are discarded. The three-letter
TCGA
Read alignment
(bottom)
calculated
as thereads.
percentage
of aligning
among
all is
uniquely
mapped
Note that
the thirdCs
alignment avoids incorrect
TCGAresults in this example, but
TCGA
Read alignment
among
all uniquely
mappeda reads.
Note that the
third
CpG
is incorrectly
assigned
DNA methylation
level
of
TCGA for theTCGT
it fails to provide any values
first and third CpG.
CpG iswhich
incorrectly
assigned
DNAthe
methylation
level of
TCGT reads, it is
100%,
is due to
the facta that
unmethylated
TTGT
(As an alternative to discarding ambiguous
100%,
which
is dueas
toambiguous,
the fact thatwhereas
the unmethylated
TTGT
ACGT
read
was
discarded
the
also possible to assign them randomly
to one
of the
ACGT
read was discarded
as be
ambiguous,
ATGT
ATGT
methylated
read could
uniquely whereas
mapped.the
c | The
best-matching positions; in the
current example,
ATGT
ATGT
ATGA
methylated
read
could
be
uniquely
mapped.
c
|
The
same alignment carried out by a three-letter aligner,
the wild-card alignment would provide correct results
ATGA
ATGA
same alignment
carried
by a three-letter
aligner,
which
also tolerates
zeroout
mismatches
and zero
gaps.
50% of the time, whereas the three-letter alignment
ATGA
which
also replaces
tolerateseach
zeroCmismatches
and zero
gaps.
The
aligner
in the reference
sequence
exhibits
higher uncertainty
50%be correct
100% only
0%
DNA methylation
level 100% and would
The
replaces
each
CC
inin
the
reference
sequence
by
analigner
upper-case
T and
each
the
sequencing
6.25%
of the time.)
50%
100% 0%
DNA methylation
level 100%
M
values
by anby
upper-case
T and
eachnoCdistinction
in the sequencing
reads
a lower-case
t, with
being made
Logistically transformed
reads by upper-case
a lower-caseTt,and
with
no distinction
being
made
between
lower-case
t during
the
β-values. The transformation
between upper-case
t during the
alignment.
As a result Tofand
thelower-case
reduced
sequencing
c advisable
Three-letter
alignment
to process
samples in an order that minimizes
mitigates some statistical
alignment. with
As a only
resultthree
of the
reduced
sequencing
c confounding
Three-letter alignment
complexity
letters
remaining,
larger
problems
of theaβ-value
betweenTTGATGATGTTGTTGATGTATGA
potential sources of batch effects
Reference
sequence
complexity
withalign
only three
letters
remaining,
a larger
number
of reads
to more
than
one position
in range
the
(namely,
limited value
(for
example,
processing
date and microarray batch)
Reference
sequence
TTGATGATGTTGTTGATGTATGA
number ofsequence
reads align
more
than
position
reference
andtoare
discarded.
The
three-letter
and one
strongly
bimodalin the
TtGA cases verand
phenotype ofTtGA
interest (for example,
Readthe
alignment
reference avoids
sequence
and are
discarded.
distribution)
atthree-letter
the cost
of
alignment
incorrect
results
in thisThe
example,
but
TtGA
TtGA
TtGA
TtGA
Read
alignment
sus
controls)
and to use tools for batch
effect removal,
reduced
biological
avoidsany
incorrect
results
in this
example,
but
italignment
fails to provide
values for
the
first
and
third CpG.
TtGA
TtGTTtGA
which
can
substantially
increase
robustness
and statisinterpretability.
it fails
to provide to
anydiscarding
values forambiguous
the
first and
thirditCpG.
(As
an alternative
reads,
is
TtGT
TTGT
tical power 50,52,53. Other common
biases in bisulphite
(As an
alternative
to discarding
ambiguous
also
possible
to assign
them randomly
to onereads,
of theit is
AtGTTTGT AtGT
Batch effects
microarray data include nonspecific
of DNA
also possible topositions;
assign them
randomly
to
one of the
best-matching
in the
current
example,
AtGT binding
AtGT
ATGT
ATGT
Systematic biases in the data
best-matching
positions;
in theprovide
currentcorrect
example,
fragments to multiple probes (which
shown
the
wild-card alignment
would
results
ATGT has been
ATGT
ATGA to
that are unrelated to the
the wild-card
alignment
provide
correct
results
50%
of the time,
whereas would
the three-letter
alignment
ATGA
ATGA
cause false positives for sex-specific DNA methylation
research question
but that
54
50% of the
time,
whereas the
alignment
ATGA
exhibits
higher
uncertainty
andthree-letter
would
be undesirable
correct
only
arise from
(and
on
the
autosomes
variDNA
methylation
level) and
N/A the presence
50% of genetic
N/A 0%
exhibits
higher
uncertainty and would
be correctdifferences
only
6.25%
of the
time.)
often unrecognized)
ants
probe
impact
DNAaffecting
methylation
levelbinding
N/A or read-out.
50% The
N/A
0% of
6.25% of the time.)
in sample handling.
Ambiguı̈té
a Setup of the example
Genomic DNA sequence CCGATGATGTCGCTGACGCACGA
DNA methylation level
100%
50%
50% 0%
DNA fragmentation, selective
conversion of unmethylated
Cs into Ts, DNA sequencing
Placement est difficile : plus deR Ematches
C :T et T :T, régions de compléxité réduite
VIEWS
REVIEWS
(CpG)
(mismatch asymétrique:
C:T T:T OK,
mais non pas T:C)
these technical issues can be minimized by removing all
Nature Reviews | Genetics
probes that exhibit a high sequence
identity with mulNature
Reviews
| Genetics
advisable to process samples in an order that minimizes Methylated
DNA
can
be
enriched
using
methylationtiple genomic regions as well as those
overlapping
with
A nonrandom relationship
advisable to process
in sources
an orderofthat
minimizes
Methylated
DNA can
bemethylated
enriched using
methylationconfounding
betweensamples
potential
batch
effects
specific
antibodies
(in
DNA
immunobetween the phenotype of
common genetic variants.
confounding
potential
sources
ofexternal
batchbatch)
effects precipitation
specific antibodies
(in methylated
DNAsequencing
immuno(for
example, between
processing
date and
microarray
coupled with
high-throughput
interest
and
factors
(forthe
example,
processing
date(for
and
microarray
batch)
precipitation coupled
with high-throughput
sequencing
(forexample,
example,
batch
effects
or
and
phenotype
of interest
cases
ver(MeDIP–seq)),
methyl-CpG-binding
domain
(MBD)
Processing enrichment-based
data. Enrichment-based
population
structure)
that
can proteins
andcontrols)
the phenotype
of interest
(for
example,
cases
ver(MeDIP–seq)),
domain
(MBD)
sus
and to use
tools for
batch
effect
removal,
(in DNA
MBDmethyl-CpG-binding
sequencing mapping
(MBD-seq))
or a restricassays
for
methylation
use
various
methtoeffect
spurious
Méthylation
?riseIFT6299
H2014
?(inUdeM
? Miklós
Csűrös
sus controls)
and to useincrease
tools forgive
batch
removal,
proteins
MBD
sequencing
(MBD-seq))
orDNA
a restricwhich
can substantially
robustness
and
statistion
enzyme
that
specifically
cuts
methylated
(in
associations.
ods for enriching DNA in a methylation-specific manner.
50,52,53
which
can substantially
increase robustness
and statis- methylation-dependent
tion enzyme that specifically
cuts methylated
DNA (in
tical
power
. Other common
biases in bisulphite
restriction
enzyme sequencing
Confounding
50,52,53
Bisulphite-sequencing reads ACGT,ATGA,ATGA,ATGT,
TCGA,TCGA,TCGT,TTGT
b Wild-card alignment
Reference sequence
YYGATGATGTYGYTGAYGYAYGA
Read alignment
TCGA
TCGA
TCGT
TTGT
ATGT
placement ambigu
DNA methylation level 100%
50%
ACGT
ATGT
ATGA
ATGA
100% 0%
c Three-letter alignment
Reference sequence
TTGATGATGTTGTTGATGTATGA
Read alignment
TtGA
TtGA
TtGA
TtGA
TtGT
TTGT
AtGT
AtGT
ATGT
ATGT
ATGA
ATGA
DNA methylation level N/A
50%
N/A
0%
Nature Reviews | Genetics
(mismatch symétrique:
alphabet réduit avec
C converti en T
partout)
Methylated DNA can be enriched using methylationspecific antibodies (in methylated DNA immunoprecipitation coupled with high-throughput sequencing
(MeDIP–seq)), methyl-CpG-binding domain (MBD)
proteins (in MBD sequencing (MBD-seq)) or a restriction enzyme that specifically cuts methylated DNA (in
methylation-dependent restriction enzyme sequencing
(McrBC-seq)). Alternatively, unmethylated DNA can be
enriched using restriction enzymes that specifically cut
unmethylated DNA (for example, in HpaII tiny fragment
enrichment by ligation-mediated PCR coupled with
sequencing (HELP–seq)). Next-generation sequencing
of the resulting DNA libraries counts the frequency of
specific DNA fragments in each library and provides
the raw data from which DNA methylation levels can be
inferred. In contrast to bisulphite sequencing, the DNA
methylation information is not contained in the read
sequence but in the enrichment or depletion of sequencing reads that map to specific regions of the genome.
As a result, enrichment-based methods require careful
Bock Nat Rev Genet 13 :705 (2012)
xiv
Inférence
On veut surtout détecter régions de méthylation différente (differentially
R E VmethylaIEWS
ted region, DMR ) entre échantillons (100s de génomes)
a
Genomic DNA sequence
… CG
… CG
0%
1%
1%
42%
78%
0%
1%
0%
0%
38%
85%
86%
2%
0%
0%
0%
41%
67%
8%
1%
12%
3%
15%
8%
36%
72%
4%
5%
2%
15%
5%
33%
11%
39%
94%
0%
2%
13%
1%
19%
2%
24%
22%
33%
92%
Single-CpG analysis
CG1
CG2
CG3
CG4
CG5
CG6
CG7
CG8
CG9
CG10
Higher in cases (q value)
0.333
0.993
0.085
0.068
0.993
0.993
0.993
0.993
0.196
0.993
Higher in controls (q value)
0.993
0.732
0.993
0.993
0.070
0.104
0.104
0.110
0.993
0.351
Cases
… CG
Controls
CG
… CG
Sample 1
3%
Sample 2
…
…
…
CG
… CG
…
…
…
CG … CG
6%
80%
57%
1%
2%
0%
50%
74%
Sample 3
0%
1%
95%
Sample 4
0%
2%
Sample 5
1%
Sample 6
…
… CG
b
c
Genome-wide: tiling
analysis Tiling region
1
Tiling region
3
TilingDiscovery
region 5
Tiling region
7
q-value
estimation
de taux
de faux
positives
(False
Rate,
FDR
)
Tiling region 2
Tiling region 4
Tiling region 6
Higher in cases (q value)
0.549
0.048*
0.988
0.988
Higher in controls (q value)
0.768
0.993
0.067
0.067
d
Méthylation
? IFT6299 H2014 ? UdeM ? Miklós Csűrös
Enhancer
Annotated genome analysis
Promoter region
Tiling region 8
0.549
0.988
0.299
0.299
Bock Nat Rev Genet 13 :705 (2012)
xv
a
Cases
Genomic DNA sequence
Sample 1
CG
… CG
3%
Inférence 2
…
…
…
CG
… CG
…
…
…
CG … CG
6%
80%
57%
1%
… CG
… CG
… CG
…
… CG
0%
1%
1%
42%
78%
2%
0%
50%
74%
0%
1%
0%
0%
38%
85%
Sample 3
0%
1%
95%
86%
2%
0%
0%
0%
41%
67%
Sample 4
0%
2%
8%
1%
12%
3%
15%
8%
36%
72%
1%
4%
5%
2%
15%
Sample 5plus fort dans
(signal
fenêtres
ou selon
annotation)
5%
33%
11%
39%
94%
Controls
Sample 2
a
0%
CG
2%
… CG
Sample 1
3%
6%
80%
57%
1%
0%
1%
1%
42%
78%
b
Sample 2
Single-CpG analysis
Sample 3
2%
CG1
0%
0%
CG2
1%
50%
CG3
95%
74%
CG4
86%
0%
CG5
2%
1%
CG6
0%
0%
CG7
0%
0%
CG8
0%
38%
CG9
41%
85%
CG10
67%
Higher in cases (q value)
Sample 4
Higher in controls (q value)
Sample 5
0.333
0%
0.993
1%
0.993
2%
0.732
4%
0.085
8%
0.993
5%
0.068
1%
0.993
2%
0.993
12%
0.070
15%
0.993
3%
0.104
5%
0.993
15%
0.104
33%
0.993
8%
0.110
11%
0.196
36%
0.993
39%
0.993
72%
0.351
94%
0%
2%
13%
1%
19%
2%
24%
22%
33%
92%
Controls
Cases
Sample 6
Genomic DNA sequence
Sample 6
…
…
…
13%
1%
CG … CG
…
…
…
19%
2%
CG … CG
REVIEWS
24%
22%
33%
… CG … CG … CG …
92%
… CG
c
Genome-wide tiling analysis
Tiling region 1
Tiling region 3
Tiling region 2
b
Tiling region 5
Tiling region 7
Tiling region 4
Tiling region 6
Tiling region 8
Single-CpG
analysis
Higher
in cases
(q value)
CG10.549 CG2
CG30.048*CG4
CG5 0.988 CG6
CG7
0.988
CG8 0.549CG9
0.988CG10
Higher in controls (q value)
Higher in cases (q value)
0.768
0.333
0.993
0.993
0.085
0.068
0.067
0.993
0.993
0.067
0.993
0.299
0.993
0.196
0.299
0.993
Higher in controls (q value)
0.993
0.993
0.070
0.104
0.110
0.732
d
Enhancer
Annotated genome analysis
c
Genome-wide tiling analysis
Tiling region 1
0.549
0.024*
0.993
0.048*
0.104
0.993
0.351
Promoter region
Tiling region 3
Tiling region 2
Higher in cases (q value)
Higher in controls (q value)
Higher in cases (q value)
0.993
First exon
Tiling region 7
Tiling region 5
Tiling region 4
Tiling region 6
0.986
0.988
0.045*
0.988
0.353 Tiling region 8
0.299
0.549
0.988
Figure
3 | Effective identification0.993
of differentially methylated
regions in a
highly annotated
0.768
0.067
0.067
0.299 genome.
0.299
a | An illustrative example of differences in DNA methylation within the promoter region of aBock
gene
and
at
an
upstream
Nature
Reviews
Nat Rev
Genet 13| Genetics
:705 (2012)
enhancer. For easier visualization, DNA methylation data are shown for only three cases and three controls (a realistic
number would be hundreds of samples) and for ten CpGs in total (dozens to hundreds of CpGs are realistic numbers for a
d
typical promoter region). b | When the DNA methylation levels between cases and controls are compared at the resolution
Enhancer
Annotated genome analysis
Promoter region
of single CpGs, all multiple-testing-corrected q values exceed 0.05 and are therefore considered to be insignificant.
exon in green),xvi
Méthylation ? IFT6299 H2014
UdeM ? statistical
Miklós Csűrös
c | When?combining
evidence from neighbouring CpGs over a fixed distance (tiling regions First
highlighted
one region is identified as being significantly more highly methylated among the cases compared to the controls
Higher in controls (q value)
Inférence avec HMM ?
Figure 1. Overview of Bisulfighter. (a) mC calling. Bisulfite-converted
reads are aligned to a reference genome, and the mC level is estimated
Méthylation
IFT6299
UdeMis ?theMiklós
Csűrös
as a ratio of? C–C
matches.H2014
A major?feature
utilization
of alignment probability for filtering out unreliable alignments, and for
Downloaded from http://nar.oxfordjournals.org/ at Universite de Montreal
ent protocols for
hibitively expensive
es for both of two
n many biological
tion, some of them
fore library gener8). This makes it
ne correspondence
database entries
nce Read Archive
here biological repain, such as retro(9).
has been the lack
ing and DMR demple, performance
ively evaluated for
sequencing depths
is common that
competitors even
cly available.
software package
a. Bisulfighter uses
mC calling, and a
based on hidden
automated adjustulfighter does not
R detection, and
without biological
ensive experiments
a, and demonstrate
etter accuracy than
→ 2 échantillons
→ émissions : restreintes à CpG seulement
distribution binomiale (m reads avec méthylation
sur n, niveau θ spécifique à l’état)
m
θm (1 − θ)n−m
n
→ duration géométrique de rester dans un état
→ LODS score d’une région :
n
o
P région UP
o
log n
P région NoCH
Saito, Tsuji & Mituyama Nucleic Acids Res (2014)
xvii