* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Ilot CpG
Survey
Document related concepts
Transcript
I LOTS C P G ET M ÉTHYLATION Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös CpG dinucléotides C-G sur le même brin : CpG la distribution de CpG varie dans le génome humaine (et mammifères en général) juste avant le début d’un gène (ATG), les CpG sont souvent dense dans le génome, très rares* * dans le modèle iid, la fréquence de CpG ainsi que celle de GpC = πCπG chrX chr12 chr5 chr1 nucléotides 151.10M 130.5M 177.7M 225.3M composition C = 29.81M, G = 29.87M (19.7%) C = 26.63M, G = 26.61M (20.4%) C = 35.09M, G = 35.13M (19.8%) C = 47.02M, G = 47.02M (20.9%) GpC 5.94M (3.9%) 5.54M (4.2%) 7.11M (4.0%) 9.95M (4.4%) CpG 1.25M (0.8%) 1.28M (1.0%) 1.51M (0.8%) 2.28M (1.0%) Wikipedia Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös ii Méthylation Typiquement, la cytosine de CpG est méthylée, et peut se transformer en thymine facilement Wikipedia Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös iii Ilots CpG enrichissement dans une région de longueur ` : compter les occurrences n· nCpG CpG(o/e) = ` nC · nG ilot CpG (CpG island) : définitions par (G + C)% élevée, enrichissement de CpG, longueur minimale P.e. : (G + C)% ≥ 50%, ` ≥ 200, CpG(o/e) ≥ 0.6 Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös iv %GC ! 55% and a length ! 500 bp with ObsCpG!ExpCpG ! 0.65 resulted in the exclusion of the vast majority of Alus and CGIs were first identified by digestion of mouse genomic DNAof unknown sequences, while only slightly decreasing the number CpG islands that occur within the 5! regions of genes. The using the methyl-CpG sensitive restriction enzyme HpaII (CCGG also substantially the number recognition site).increased A small stringency portion of the genome, reduced composed of veryof exonic CpG islands. The biological functions of these islands are highly fragmented was found be islands derived frominsequences not DNA, well understood, buttoCpG located nonpromoter containing clusters of non-methylated [5,6,20]. Quantifiregions can play significantCpG rolessites in gene regulation (18); they also seem to be frequent targets for de novo methylation cation of these digestion products, combined with sequence anal-in cancer and aging (19). Therefore, although the increased strinysis and correction contaminating indicated theseof gencyfor preferentially locates DNA CpG islands in thethat 5! regions were derived from 300loss discrete CGIs [21,22]. genes,approximately it may also result26 in the of smaller regions of DNA thecharacterised data set that may in gene These sequencesfrom were asbeatfunctionally least 200important bp in length not. Rather, there is aminimum continuum of 500-bp regions of DNA of increasing the length, CpG[o/e] and that G + C composition move between this bulk DNA and the properties of aThis CpGincreased island. to 500 bp, 0.65% and 55%, respectively. stringency The human genome showed the strongest suppression of CpG. reduced the number ofthe identified byplot approximately 90% Several sequences plotted in lower left islands field of the of and largely excluded !ExpCpG ofcontaminating the human genomeAlu (Fig.elements. 4A) turned This algorithm %GC vs. ObsCpG and (TT- islands, sugout be simplethe repetitive sequences such as (TA)n associated alsotoreduced number of gene promoter TAA)n (data not shown). CpG suppression in the human gesting that bona fide CGIs were also being discarded [24]. genome is caused not only by CpG depletion through evolution elements such as ‘‘young” Alus resemble but Repeat also by the high content of simple repetitive sequences and the base comaposition low rate of utilization for genes. A. thaliana to contains ofsequence CGIs and significantly contribute the number of false 5-methylcytosine, and its genome shows a wide distribution of positives identified [24]. Preliminary computational analysis of the the occurrence for CpG (Fig. 4B). However, because of the low human genome sequence identified 50 267 CGIs, of which only GC content in this organism, few fragments fulfilling our criteria for CpGwere islandunique are visible the A. thaliana genome. In this 28 a890 [4].inMany of the multi copy sequences could pected; [o/e]) of 0.6 (Fig. 1) [7,8]. The completion of the human genome project in 2001 facilitated in silico CGI prediction [4]. Values for length and base composition similar to those identified by Gardiner-Garden and Frommer are routinely employed by the major genome browsers to annotate CGIs (Table 1). Thresholds are somewhat arbitrary however, and the effect of varying these values can profoundly alter prediction accuracy [23–25]. To reduce the extraneous inclusion of non-CGI sequences Takai and Jones investigated the effect fied in the Repbase database [26]. This database is subject to iterative improvements due to updating the repeat repertoire. Reanalysis of the human genome sequence in 2002 resulted in the loss of a further 1890 false positives suggesting a more conservative estimate of 27 000 CGIs [27]. The beneficial consequences of repeat masking can be illustrated by the example of a low copy repetitive element that is related to the adenovirus sequence located on human chromosomes 4 and 19 [28]. This element is identified as a single CGI or a tandem cluster of repeated CGIs by of genes. This table shows that modifying the criteria to a 2. CGI identification Ilot CpG : paramètres → paramètres différents . . . → CpG fréquent dans éléments mobiles (p.e., Alu) be removed by screening against known classes of repeats identiand with a G + C content of 50% and a CpG frequency (observed/ex- Table 1 Fig. prediction 3. The modified criteria also helped remove Alu sequences previously identified as part of 5! region CpG islands. In this example, a 1,233-bp fragment Overview of CpG island algorithms. originally extracted by the algorithm included two Alu sequences with some CpG suppression associated with the nonhistone chromosome protein 2 like 1 to 620 bp and excludedRM theaAlu sequences. Database/prediction (NHP2L1). The modified Lengthstringent criteria reduced G + C the size of the island CpG[o/e] Comments Reference ENSEMBL NCBI relaxed NCBI strict USCSb EMBOSS CpGProD CpGcluster Stringent length constraint Total CGIs = 307 193 Total CGIs = 24 163 Total CGIs = 28 226 Variable parameters Total CGIs = 76 793 Clustering Total = 197 727 [88] a b c Takai and Jones P400 P200 P500 >200 UDc >500 NA P50% P50% P50% P50% UD >50% NA P0.6 P0.6 P0.6 >0.6 UD >0.6 NA N N N Y NA Y N PNAS " March 19, 2002 " vol. 99 " no. 6 " 3743 [89] [90] [23] [25] RM, repeat masked; Y, yes; N, no; NA, non applicable. Parameters used for CGI identification for the ENCODE project although totals vary due to repeat masking differences between hg17 and hg18 builds [87]. UD, user defined. Takai & Jones PNAS 99 :3740 (2002) ; Illingworth & Bird FEBS Lett 583 :1713 (2009) Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös v Ilots CpG : distribution 1714 → longueur ∼ 1kpb → parfois se trouvent à l’intérieur d’un gène ou dans une région intergénique R.S. Illingworth, A.P. Bird / FEBS Letters 583 (2009) 1713–1720 4 ilots dans une région de 65 kpb ilots aux promoteurs des gènes ilot à l'intérieur du gène Fig. 1. CpG islands located within a region of human chromosome 19. The upper panel illustrates a 65 kb portion of human chromosome 19 (17195000–17260000) which contains five annotated genes (blue bars) and four CpG islands. The promoters of OCEL1, NR2F6 and ANKLE1 overlap with CGIs (i,iii and iv) and an additional CGI (ii) localises to the third exon of NR2F6. The classical sequence parameters applied to CGI prediction are illustrated (dashed red lines) for CpG (observed/expected; CpG[o/e] = 0.6) and G + C base composition (GC% = 50%). The lower panel represents an enlarged view of four 6 kb regions (i–iv) spanning each CGI and illustrates the distribution of CpG sites (vertical black strokes) relative to the annotated genes. Illingworth & Bird FEBS Lett 583 :1713 (2009) of increasing the minimum length, CpG[o/e] and G + C composition to 500 bp, 0.65% and 55%, respectively. This increased stringency reduced the number of identified islands by approximately 90% CGIs were first identified by digestion of mouse genomic DNA and largely excluded contaminating Alu elements. This algorithm using the methyl-CpG? sensitive (CCGG Méthylation IFT6299restriction H2014 ?enzyme UdeM HpaII ? Miklós Csűrös also reduced the number of gene promoter associated islands, sugrecognition site). A small portion of the genome, composed of very 2. CGI identification vi TSS Méthylation et transcription H3K4me3 and H2A.Z block de novo methylation Enhancer Insulator TET LMR variable 5mC CTCF 5mC blocks CTCF binding H2A.Z anticorrelated with 5mC Gene body TET 5mC alters splicing TSS H3K4me3 and H2A.Z block de novo methylation NDR limits de novo methylation Unmethylated CpG Methylated CpG Variable CpG methylation H3K4me1 H3K4me3 Bound DNMTs maintain methylation H2A.Z Repetitive DNA DNMT3A DNMT3B Figure 1 | Molecular anatomy of CpG sites in chromatin and Nature their roles in gene Reviews | Genetics expression. About 60% of human genes have CpG islands (CGIs) at their promoters and frequently have nucleosome-depleted regions (NDRs) at the transcriptional start site (TSS). The nucleosomes flanking the TSS are marked by trimethylation of histone H3 at lysine 4 (H3K4me3), which is associated with active transcription, and the histone variant H2A.Z, which is antagonistic to DNA methyltransferases (DNMTs). Downstream of the TSS, the DNA is mostly CpG-depleted and is predominantly methylated in repetitive elements and in gene bodies. CGIs, which are sometimes located in gene bodies, mostly remain unmethylated but occasionally acquire 5-methylcytosine (5mC) in a tissue-specific manner (not shown). Transcription elongation, unlike initiation, is not blocked by gene body methylation, and variable methylation may be involved in controlling splicing. Gene bodies are preferential sites of methylation in the context CHG (where H is A, C or T) in embryonic stem cells5, but the function is not understood (not shown). DNA methylation is maintained by DNMT1 and also by DNMT3A and/or DNMT3B, which are bound to nucleosomes containing methylated DNA99. Enhancers tend to be CpG-poor and show incomplete methylation, suggesting a dynamic process of methylation or demethylation occurs, perhaps owing to the presence of ten-eleven translocation (TET) proteins in these regions, although this remains to be shown. They also have NDRs, and the flanking nucleosomes have the signature H3K4me1 mark and also the histone variant H2A.Z32,100. The binding of proteins such as CTCF to insulators can be blocked by methylation of their non-CGI recognition sequences, thus leading to altered regulation of gene expression, but the generality of this needs further exploration. The sites flanking the CTCF sites are strongly nucleosome-depleted, and the flanking nucleosomes show a remarkable degree of phasing. The figure does not Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös show the structure of CpG-depleted promoters or silenced CGIs, although in both cases the silent state is associated with nucleosomes at the TSS. LMR, low-methylated region. R E V I E W SBound DNMTs maintain NDR limits de novo methylation methylation Repetitive DNA Unmethylated CpGstrongly anti-correlated H3K4me1 with H2A.Z, both of which are DNMT3A H2A.Z 46,47 Methylated CpG DNMT3B DNA methylation . The occurrence ofH3K4me3 the H3K4me3 Variable CpG methylation mark in mice is possibly maintained by the presence of CXXC finger protein 1 (CXXC1; also known as CFP1), Figure 1 | Molecular anatomy of CpG sites intochromatin and Nature their roles in gene Reviews | Genetics which recruits the H3K4 methyltransferase the expression. Aboutthat 60%the of human have CpG islands (CGIs) at their promoters region, thus ensuring +1 and genes –1 nucleosomes and frequently have nucleosome-depleted regions (NDRs) at the transcriptional start contain marks that are incompatible with de novo DNA site (TSS). The nucleosomes flanking the TSS are marked by trimethylation of histone H3 methylation 48. The unmethylated state of the CpG at lysine 4 (H3K4me3), which is associated with active transcription, and the histone island is also presumably ensured by the presence of the variant H2A.Z, which is antagonistic to DNA methyltransferases (DNMTs). Downstream TET1 protein, which is isfound atCpG-depleted a high proportion of the TSS, the DNA mostly and isofpredominantly methylated in the TSSs of high-CpG-content promoters. Presumably, repetitive elements and in gene bodies. CGIs, which are sometimes located in gene TET1 converts anyremain 5mC unmethylated that might be but in this region acquire 5-methylcytosine (5mC) bodies, mostly occasionally 49 intoin5-hydroxymethylcytosine . The molecular anat- elongation, unlike initiation, is not a tissue-specific manner (not shown). Transcription omyblocked of activebyCGIs therefore explain they methylation are genecan body methylation, andwhy variable may be involved in resistant to methylation (FIG. 1)bodies . controlling splicing. Gene are preferential sites of methylation in the context Of course, notHall genes are expressed CHG (where is CGI-promoter A, C or T) in embryonic stem cells5, but the function is not understood in ESCs, and many suppressed by the Polycomb (not shown). DNAare methylation is maintained by DNMT1 and also by DNMT3A and/or complex, so why are are these not de novo methylated? The methylated DNA99. Enhancers DNMT3B, which bound to nucleosomes containing answer lies inand theshow factincomplete that they contain tendprobably to be CpG-poor methylation, suggesting a dynamic process of methylation or demethylation occurs, perhaps owing to the presence of ten-eleven the antagonistic H3K4me3 (REF. 12) and H2A.Z marks46,47 translocation (TET) proteins in these regions, although and are also bound by TET1, which would ensure that this remains to be shown. They have 5mC-free. NDRs, and the flanking nucleosomes have the signature H3K4me1 mark theyalso remain Interestingly, this protection and histone variant H2A.Z32,100.50The seems toalso breakthe down during immortalization , andbinding these of proteins such as CTCF to can besusceptible blocked byto methylation of their non-CGI recognition sequences, thus CGIsinsulators become highly de novo methylation, 41–43 but the generality of this needs further leading to altered regulationtransformation of gene expression, which increases after oncogenic . exploration. The sites thethe CTCF sites are strongly nucleosome-depleted, and This model predicts thatflanking the higher level of expresflanking nucleosomes remarkable degree of phasing. The figure does not sionthe is, the less likely it is that ashow CGI isa to become de novo show the structure of CpG-depleted promoters or silenced CGIs, although in both cases methylated. Direct evidence in support of this prediction the silent state is associated with nucleosomes at the TSS. LMR, low-methylated region. has recently come from several exciting papers that have shown that monoallelic methylation of CGIs preferentially occurs on the allele that is less highly expressed. For example, Hitchins et al.51 showed that an allele of the MLH1 gene containing a single-nucleotide in present, and then this is followed nucleosomevariant becomes the promoter, which was less active the more combythan the recruitment of DNMT3A to this nucleosome and, mon allele in transfection experiments, was more likely to methylation occurs. Whether a subsequently, de novo become methylated in the somaticsimilar cells of cancer-affected sequence of events occurs in cells that are not families. In other words, the less active allele was the one is not yet known. expressing DNMT3L that was more likely to acquire de novo methylation. Furthermore, Ooi et al. 12 showed that de novo 52 An alternative scenario was shown by Boumber et al. , methylation could not occur on a nucleosome bearing who found that an allele of RIL (also as PDLIM4) Jones Nat Rev Genet 13 :484 (2012) theknown H3K4me2 or H3K4me3 marks, which are associbearing a polymorphism in the promoter that created an ated with active genes. The nucleosomes flanking the additional binding site for the transcription factor SP1 or nucleosome-depleted start site often contain both SP3 was much less likely to become de novo methylated the histone mark H3K4me3 and the histone variant than the allele without this polymorphism. The extra SP1 site therefore confers resistance of this allele to de novo vii methylation, although the authors could not demonstrate NATURE REVIEWS | GENETICS that the extra transcription factor binding site increased TET1 conv into 5-hyd omy of acti resistant to Of cours in ESCs, a complex, so answer pr the antagon and are als they remai seems to bre CGIs becom which incre This mo sion is, the l methylated has recently shown that tially occur For exampl MLH1 gen the promot mon allele i become me families. In that was m An alternat who found bearing a p additional b SP3 was mu than the alle site therefo methylation that the ext gene expres Gene body Most gene ylated and elements. M is a major ing to dise cancer-cau tant to rea at gene pro genes54 and tions here r that these Méthylation et cellules souches ICLES mCG mCHG mCHH b mCG mCHG OCT4 1 kb Chr 6: 31,246,431 c 0.25 IMR90 mCG H1 mCG mCHG mCHH H1 mCHG H1 mCHH IMR90 mCG 0.2 0.15 17.3% 0.1 75.5% Fraction of total mC Fraction of total mC c 0.25 7.2% 17.3% H1 mCG H1 mCHG H1 mCHH 0.2 0.15 mCG = 4.7 x 107 0 mC = 6.2 × 107 H1 mCG/CG IMR90 mCG/CG * * IMR90 Chr 3: 100,016,095– BMP4 (H1) 100,016,287 (W) 0 25 50 75 Methylation level (%) H1 mCHG/CHG H1 mCHH/CHH 100 100 80 0 * +4 iPS (IMR90) Chr 1: 200,015,530– 200,015,725 (W) * * 25 50 75 Methylation level (%) 60 Chr 10: 30,837,441– 30,837,664 (W) Figure 2 | Bisulphite-PCR validation of non-CG DNA methylation in Chr 1: 200,015,530– differentiated and stem cells. DNA methylation sequence context is 200,015,725 displayed according to the key and the percentage methylation(W) at each position is represented by the fill of each circle (see Supplementary Table 2 for values). Non-CG methylated positions indicated by an asterisk are unique to that cell type and ‘14’ indicates a mCHH that is shifted 4 bases downstream in H9 cells. iPS, induced pluripotent stem cell. Chr 3: 100,016,095– 100,016,287 (W) Chr 10: 30,837,441– 30,837,664 (W) Figure 2 | Bisulphite-PCR validation of non-CG DNA methylation in differentiated and stem cells. DNA methylation sequence context is 100 displayed according the key CHH methylation identified in H1 cells and absentto in IMR90 cells isand the percentage methylation at each not simply due to genetic differences between the two cell types, position is represented by thebutfill of each circle (see Supplementary Table 2 rather that the presence of non-CG methylation is characteristic of an forstate. values). Non-CG methylated embryonic stem-cell For each cell type, two biological replicates positions indicated by an asterisk are 100 were performed with cells of different passage number (see Supuniqueandtocomparison that cellof type and ‘14’ indicates a mCHH that is shifted 4 bases plementary Information), the methylcytosines 80 identified independently in each replicate revealed a highiPS, concorddownstream in H9 cells. induced pluripotent stem cell. Lister, Pelizzola & al Nature 462 :315 (2009) (H1 = cellules souches ; IMR90=lung fibroblast) mC per 10 kb 0.04 40 Watson 20 0.02 0 H1 0.02 mCG/CG Crick IMR90 0.04 mCG/CG 2x10 7 4x107 H1 mCHG/CHG H1 mCHH/CHH 6x10 8x10 7 Chromosome 12 7 0 1x108 1.2x108 alized mC density per 100 kb Watson 0 H9 +4 BMP4 (H1) 0.05 0.05 d H9 IMR90 0.1 mC = 6.2 × 107 C H1 * Normalized mC density per 100 kb H1 mCHH iPS (IMR90) mCG = 4.7 x 107 75.5% IMR90 mCG mC = 4.5 × 107 IMR90 7.2% IMR90 mC mCG = 4.5 × 107 mCHG H1 H1 99.98% mC = 4.5 × 107 mCG mCHH H1 mC mCG = 4.5 × 107 C IMR90NATURE | Vol 462 | 19 November 2009 IMR90 mC a mCHH H1 ARTICLES 99.98% mCHG MethylC-Seq 1 kb Chr 6: 31,246,431 H1 mC R90 H1 mCG OCT4 Bisulphite PCR b mCHH MethylC-Seq mCHG Bisulphite PCR mCG → méthylation de cytosine aussi à de sites CH (H = {A, CNATURE , T}) | Vol 462 | 19 November 2009 → change lors de la différentiation de cellule Figure 1 | Global trends of human DNA methylomes. a, The percentage of methylcytosines identified for H1 and IMR90 cells in each sequence context. ance of cytosine methylation status between replicates (Supplemen60 b, AnnoJ browser representation of OCT4. c, Distribution of the tary Fig. 2). For each cell type, the final DNA methylation map methylation level in each sequence context. The y axis indicates the fraction presented in this study represents the composite of the two biological 40 of all methylcytosines that display each methylation level (x axis), where replicates. The OCT4 gene (also called POU5F1) exemplifies both methylation level is the mC/C ratio at each reference cytosine (at least 10 cell-specific differential methylation and the presence of non-CG 20 reads required). d, Blue dots indicate methylcytosine density in H1 cells in methylation (Fig. 1b), and in addition displayed a ,50-fold reduc10-kb windows throughout chromosome 12 (black rectangle, centromere). transcript in IMR90 cells (data not shown). The Smoothed lines represent the methylcytosine density in each context in H1 0tion in OCT4 Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös absence of mCHG and mCHH methylation in IMR90 cells coincided and IMR90 cells. Black triangles indicate various regions of contrasting with significantly lower transcript abundance of the de novo DNA trends in CG and non-CG methylation. mC, methylcytosine. CHH methylation identified in H1 cells and absent in IMR90 cells is not simply due to genetic differences between the two cell types, but viii rather that the presence of non-CG methylation is characteristic of an tion information is erased during amplification. Some investigators have suggested that it could be feasible to maintain the pattern of methylation during PCR if an appropriate DNA methyltransferase were present in the tion enzyme has an accompanying DNA methyltransferase that protects the endogenous DNA from the restriction defence system by methylating bases in the recognition site. Some restriction enzymes are inhibited Détection de méthylation Table 1 | Main principles of DNA methylation analysis Pretreatment Analytical step Locus-specific analysis Enzyme digestion HpaII-PCR Affinity enrichment MeDIP-PCR Sodium bisulphite MethyLight EpiTYPER Pyrosequencing Gel-based analysis Southern blot RLGS MS-AP-PCR AIMS Sanger BS MSP MS-SNuPE COBRA Array-based analysis NGS-based analysis DMH MCAM HELP MethylScope CHARM MMASS Methyl–seq MCA–seq HELP–seq MSCC MeDIP mDIP mCIP MIRA MeDIP–seq MIRA–seq BiMP GoldenGate Infinium RRBS BC–seq BSPP WGSBS AIMS, amplification of inter-methylated sites; BC–seq, bisulphite conversion followed by capture and sequencing; BiMP, bisulphite methylation profiling; BS, bisulphite sequencing; BSPP, bisulphite padlock probes; CHARM, comprehensive high-throughput arrays for relative methylation; COBRA, combined bisulphite restriction analysis; DMH, differential methylation hybridization; HELP, HpaII tiny fragment enrichment by ligation-mediated PCR; MCA, methylated CpG island amplification; MCAM, MCA with microarray hybridization; MeDIP, mDIP and mCIP, methylated DNA immunoprecipitation; MIRA, methylated CpG island recovery assay; MMASS, microarray-based methylation assessment of single samples; MS-AP-PCR, methylation-sensitive arbitrarily primed PCR; MSCC, methylation-sensitive cut counting; MSP, methylation-specific PCR; MS-SNuPE, methylation-sensitive single nucleotide primer extension; NGS, next-generation sequencing; RLGS, restriction landmark genome scanning; RRBS, reduced representation bisulphite sequencing; –seq, followed by sequencing; WGSBS, whole-genome shotgun bisulphite sequencing. | MARCH 2010 | VOLUME 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös Laird Nat Rev Genet 11 :191 (2010) ix BMC Bioinformatics 2009, 10:232 http://www.biomedcentral.com/1471-2105/10/232 Séquençage bisulphite Watson Crick >>ACmGTTCGCTTGAG>> <<TGCmAAGCGAACTC<< Cm methylated C Un-methylated 1) Denaturation Watson >>ACmGTTCGCTTGAG>> Crick <<TGCmAAGCGAACTC<< BSC <<TGCmAAGUGAAUTU<< 2) Bisulfite Treatment BSW >>ACmGTTUGUTTGAG>> 3) PCR Amplification BSW BSWR >>ACmGTTTGTTTGAG>> <<TG CAAACAAACTC<< <<TGCmAAGTGAATTT<< BSCR >>ACG TTCACTTAAA>> BSC Figure 1of bisulfite sequencing Pipeline Pipeline of bisulfite sequencing. 1) Denaturation: separating Watson and Crick strands; 2) Bisulfite treatment: converting un-methylated cytosines (blue) to uracils; methylated cytosines (red) remain unchanged; 3) PCR amplification of bisulfitetreated sequences resulting in four distinct strands: Bisulfite Watson (BSW), bisulfite Crick (BSC), reverse of 10 :232 (2009) Xi & Li complement BMC Bioinformatics BSW (BSWR), and reverse complement of BSC (BSCR). Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös detect the methylation pattern of every C in the genome. Nevertheless, the mapping of millions of bisulfite reads to another 19% are Gs, only ~1.8% of dinucleotides are CpG dinucleotides. Because C methylation occurs almost x as mismatches, where a C in the BS-read is aligned to a T in the reference [2]. Although this all-inclusive C/T conversion is effective for reads derived from the C-poor strands, it is not appropriate for reads derived from the Gpoor strands, where all the Cs are actually transcribed from Gs by PCR amplification and thus could not be converted to Ts during bisulfite treatment. During shotgun sequencing, however, a bisulfite read is almost equally likely to be derived from either the C-poor or the G-poor strands. There is no precise way to determine the original tions have to be recorded, even the non-unique mappings. Therefore, this approach is only practical for small reference sequences, where only the C-poor strands are sequenced. For example, Meissner et al. used this mapping strategy for reduced representation bisulfite sequencing (RRBS) [2], where the genomic DNA was digested by the Mspl restriction enzyme and 40–220 bp segments were selected for sequencing. The reference sequence (~27 M nt) is only about 1% of the whole mouse genome, covering 4.8% of the total CpG dinucleotides. Séquençage bisulphite 2 Alignement asymétrique . . . 1) Multiple Mapping >>ATTTCG>> Bisulfite Read Reference ATTTCG ATTTCG ATTTCG >>ATACTTCGATGATCTCGCAAGACTCCGGC>> 2) Mapping Asymmetry Bisulfite Read Reference C C T T Figure 2of bisulfite reads Mapping Mapping of bisulfite reads. 1) Increased search space due to the cytosine-thymine conversion in the bisulfite treatment. 2) Mapping asymmetry: thymines in bisulfite reads can be aligned with cytosines in the reference (illustrated in notBioinformatics the Xiblue) & Li but BMC 10 :232 (2009) reverse. Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös Page 3 of 9 (page number not for citation purposes) xi BSMAP BMC Bioinformatics 2009,de 10:232bits pour détecter match/mismatch http://www.biomedcentral.com/1471-2105/10/232 hachage + manipulation (stocker un masque pour la référence) Figure algorithm BSMAP 3 BSMAP algorithm. A) Bisulfite seed table, using the original seed and bisulfite variants as keys and corresponding Xi & LicoordiBMC Bioinformatics 10 :232 (2009) nates in the reference genome as values. Each read was looked up in the seed table for potential mapping positions. B) A positional specific mask of the corresponding reference sequence was generated by setting 01 to C(light blue) and 11 to A, G, T(black). The original read was masked by a bitwise AND operation with the positional specific mask. C) The reference sequence and the masked read were compared with a bitwise XOR operation. Non-zero XOR results were counted as mismatches (red). Bisulfite alignment is marked in green. Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös xii are unmethylated and thus converted. Finally, we set T & 10/ln(10), which is the same scale as ‘phred’ scores (10). We used the score matrix in Table 1, which approximately fits these settings. First, we rand cytosine in bot cytosine received (Table 2). A m this cytosine is m which the On peut aussi trouver une pénalisation propre (LODS) à la conversion bisulfite : DNA the probability Table 1. Score matrix for aligning bisulfite-converted DNA reads to depended wheth a reference genome sequence (Table 2). Second, we ra a c g t genome, by pick obtained from a 6 #18 #18 #18 c #18 6 #18 3 Genome Databa g #18 #18 6 #18 match tutions, but also C:T ou T:T t #18 #18 #18 3 sertions are larg read to come en Columns refer to bases in the read, and rows refer to bases in the genome. original genome réfé ren ce Pénalisation modifiée Frith, Mori & Asai Nucleic Acids Res 40 :e100 (2012) Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös xiii ransformed transformation transformed me statistical e transformation the meβ-value statistical ted range thevalue β-value bimodal ited value range the cost of yatbimodal ogical ) at the cost of ty. logical lity. REVIEWS Figure 2 | Two alternative strategies for bisulphite alignment. a | An illustrative example of bisulphite sequencing for a DNA fragment with known DNA methylation levels at four CpGs and a total of eight bisulphite-sequencing reads. For easier visualization, the sequencing reads are four bases long (realistic numbers would be 50 to 200 bases), and the size of the genomic DNA sequence is just 23 bases (3 gigabases would be a realistic number for the human genome). b | Alignment of the bisulphite-sequencing reads (centre) to the reference sequence (top) using a wild-card aligner that tolerates zero mismatches and zero gaps. The aligner replaces each C in the reference sequence by the wild-card letter Y, which can match both C and T in the read sequences. Reads with more a Setupone of the example Figure 2 | Two alternative strategies for bisulphite than perfect alignment with the reference sequence a Setup of the example Figure 2 | Two strategies bisulphite alignment. a | alternative An illustrative example for of bisulphite are discarded (greyed out), and for each CpG in the Genomic DNA sequence CCGATGATGTCGCTGACGCACGA alignment.for a |aAn illustrative of bisulphite sequencing DNA fragmentexample with known DNA genomic DNA sequence, the DNA methylation level CCGATGATGTCGCTGACGCACGA Genomic DNA sequence DNA methylation level 100% 50% 50% 0% sequencing levels for a DNA fragment witha total known methylation at four CpGs and ofDNA eight (bottom) is calculated as the percentage of aligning Cs DNA methylation level 100% 50% 50% 0% methylation levels at four CpGs a total of eight bisulphite-sequencing reads. Forand easier visualization, among all uniquely mapped reads. Note that the third bisulphite-sequencing reads. For easier the sequencing reads are four bases longvisualization, (realistic CpG is incorrectly assigned aselective DNA methylation level of DNA fragmentation, the sequencing reads are four basesand long (realistic numbers would be 50 to 200 bases), the size of the 100%, which is due to factselective that the unmethylated conversion of the unmethylated DNA fragmentation, Cs into Ts, DNA sequencingwhereas the numbersDNA would be 50 tois200 and(3the size of the conversion unmethylated genomic sequence justbases), 23 bases gigabases read was discarded asofambiguous, Csread into could Ts, DNAbesequencing genomic sequence is just 23 bases (3 genome). gigabases would be aDNA realistic number for the human methylated uniquely mapped. c | The be a realistic number for the humanreads genome). bwould | Alignment of the bisulphite-sequencing same alignment carried out by a three-letter aligner, b | Alignment the bisulphite-sequencing reads (centre) to the of reference sequence (top) using a which also tolerates zeroACGT,ATGA,ATGA,ATGT, mismatches and zero gaps. Bisulphite-sequencing reads (centre) to the reference sequence using a and wild-card aligner that tolerates zero(top) mismatches The aligner replaces eachTCGA,TCGA,TCGT,TTGT C in the reference sequence Bisulphite-sequencing reads ACGT,ATGA,ATGA,ATGT, wild-card aligner thatreplaces tolerateseach zeroCmismatches and zero gaps. The aligner in the reference by an upper-case T and each C in the sequencing TCGA,TCGA,TCGT,TTGT zero gaps.byThe replaces C in can the match reference sequence thealigner wild-card lettereach Y, which reads by a lower-case t, with no distinction being made sequence the wild-card letter Y, which both C andby T in the read sequences. Reads can withmatch more between upper-case T and lower-case t during the bothone C and T in the read sequences. Reads withsequence more Wild-card alignment b alignment. than perfect alignment with the reference As a result of the reduced sequencing Wild-card alignment bcomplexity than one perfect alignment with reference are discarded (greyed out), and forthe each CpG in sequence the with only YYGATGATGTYGYTGAYGYAYGA three letters remaining, a larger Reference sequence are discarded (greyed out), formethylation each CpG inlevel the genomic DNA sequence, theand DNA number of reads align YYGATGATGTYGYTGAYGYAYGA to more than one position in the Reference sequence genomicisDNA sequence, thepercentage DNA methylation levelCs (bottom) calculated as the of aligning reference sequence and are discarded. The three-letter TCGA Read alignment (bottom) calculated as thereads. percentage of aligning among all is uniquely mapped Note that the thirdCs alignment avoids incorrect TCGAresults in this example, but TCGA Read alignment among all uniquely mappeda reads. Note that the third CpG is incorrectly assigned DNA methylation level of TCGA for theTCGT it fails to provide any values first and third CpG. CpG iswhich incorrectly assigned DNAthe methylation level of TCGT reads, it is 100%, is due to the facta that unmethylated TTGT (As an alternative to discarding ambiguous 100%, which is dueas toambiguous, the fact thatwhereas the unmethylated TTGT ACGT read was discarded the also possible to assign them randomly to one of the ACGT read was discarded as be ambiguous, ATGT ATGT methylated read could uniquely whereas mapped.the c | The best-matching positions; in the current example, ATGT ATGT ATGA methylated read could be uniquely mapped. c | The same alignment carried out by a three-letter aligner, the wild-card alignment would provide correct results ATGA ATGA same alignment carried by a three-letter aligner, which also tolerates zeroout mismatches and zero gaps. 50% of the time, whereas the three-letter alignment ATGA which also replaces tolerateseach zeroCmismatches and zero gaps. The aligner in the reference sequence exhibits higher uncertainty 50%be correct 100% only 0% DNA methylation level 100% and would The replaces each CC inin the reference sequence by analigner upper-case T and each the sequencing 6.25% of the time.) 50% 100% 0% DNA methylation level 100% M values by anby upper-case T and eachnoCdistinction in the sequencing reads a lower-case t, with being made Logistically transformed reads by upper-case a lower-caseTt,and with no distinction being made between lower-case t during the β-values. The transformation between upper-case t during the alignment. As a result Tofand thelower-case reduced sequencing c advisable Three-letter alignment to process samples in an order that minimizes mitigates some statistical alignment. with As a only resultthree of the reduced sequencing c confounding Three-letter alignment complexity letters remaining, larger problems of theaβ-value betweenTTGATGATGTTGTTGATGTATGA potential sources of batch effects Reference sequence complexity withalign only three letters remaining, a larger number of reads to more than one position in range the (namely, limited value (for example, processing date and microarray batch) Reference sequence TTGATGATGTTGTTGATGTATGA number ofsequence reads align more than position reference andtoare discarded. The three-letter and one strongly bimodalin the TtGA cases verand phenotype ofTtGA interest (for example, Readthe alignment reference avoids sequence and are discarded. distribution) atthree-letter the cost of alignment incorrect results in thisThe example, but TtGA TtGA TtGA TtGA Read alignment sus controls) and to use tools for batch effect removal, reduced biological avoidsany incorrect results in this example, but italignment fails to provide values for the first and third CpG. TtGA TtGTTtGA which can substantially increase robustness and statisinterpretability. it fails to provide to anydiscarding values forambiguous the first and thirditCpG. (As an alternative reads, is TtGT TTGT tical power 50,52,53. Other common biases in bisulphite (As an alternative to discarding ambiguous also possible to assign them randomly to onereads, of theit is AtGTTTGT AtGT Batch effects microarray data include nonspecific of DNA also possible topositions; assign them randomly to one of the best-matching in the current example, AtGT binding AtGT ATGT ATGT Systematic biases in the data best-matching positions; in theprovide currentcorrect example, fragments to multiple probes (which shown the wild-card alignment would results ATGT has been ATGT ATGA to that are unrelated to the the wild-card alignment provide correct results 50% of the time, whereas would the three-letter alignment ATGA ATGA cause false positives for sex-specific DNA methylation research question but that 54 50% of the time, whereas the alignment ATGA exhibits higher uncertainty andthree-letter would be undesirable correct only arise from (and on the autosomes variDNA methylation level) and N/A the presence 50% of genetic N/A 0% exhibits higher uncertainty and would be correctdifferences only 6.25% of the time.) often unrecognized) ants probe impact DNAaffecting methylation levelbinding N/A or read-out. 50% The N/A 0% of 6.25% of the time.) in sample handling. Ambiguı̈té a Setup of the example Genomic DNA sequence CCGATGATGTCGCTGACGCACGA DNA methylation level 100% 50% 50% 0% DNA fragmentation, selective conversion of unmethylated Cs into Ts, DNA sequencing Placement est difficile : plus deR Ematches C :T et T :T, régions de compléxité réduite VIEWS REVIEWS (CpG) (mismatch asymétrique: C:T T:T OK, mais non pas T:C) these technical issues can be minimized by removing all Nature Reviews | Genetics probes that exhibit a high sequence identity with mulNature Reviews | Genetics advisable to process samples in an order that minimizes Methylated DNA can be enriched using methylationtiple genomic regions as well as those overlapping with A nonrandom relationship advisable to process in sources an orderofthat minimizes Methylated DNA can bemethylated enriched using methylationconfounding betweensamples potential batch effects specific antibodies (in DNA immunobetween the phenotype of common genetic variants. confounding potential sources ofexternal batchbatch) effects precipitation specific antibodies (in methylated DNAsequencing immuno(for example, between processing date and microarray coupled with high-throughput interest and factors (forthe example, processing date(for and microarray batch) precipitation coupled with high-throughput sequencing (forexample, example, batch effects or and phenotype of interest cases ver(MeDIP–seq)), methyl-CpG-binding domain (MBD) Processing enrichment-based data. Enrichment-based population structure) that can proteins andcontrols) the phenotype of interest (for example, cases ver(MeDIP–seq)), domain (MBD) sus and to use tools for batch effect removal, (in DNA MBDmethyl-CpG-binding sequencing mapping (MBD-seq)) or a restricassays for methylation use various methtoeffect spurious Méthylation ?riseIFT6299 H2014 ?(inUdeM ? Miklós Csűrös sus controls) and to useincrease tools forgive batch removal, proteins MBD sequencing (MBD-seq)) orDNA a restricwhich can substantially robustness and statistion enzyme that specifically cuts methylated (in associations. ods for enriching DNA in a methylation-specific manner. 50,52,53 which can substantially increase robustness and statis- methylation-dependent tion enzyme that specifically cuts methylated DNA (in tical power . Other common biases in bisulphite restriction enzyme sequencing Confounding 50,52,53 Bisulphite-sequencing reads ACGT,ATGA,ATGA,ATGT, TCGA,TCGA,TCGT,TTGT b Wild-card alignment Reference sequence YYGATGATGTYGYTGAYGYAYGA Read alignment TCGA TCGA TCGT TTGT ATGT placement ambigu DNA methylation level 100% 50% ACGT ATGT ATGA ATGA 100% 0% c Three-letter alignment Reference sequence TTGATGATGTTGTTGATGTATGA Read alignment TtGA TtGA TtGA TtGA TtGT TTGT AtGT AtGT ATGT ATGT ATGA ATGA DNA methylation level N/A 50% N/A 0% Nature Reviews | Genetics (mismatch symétrique: alphabet réduit avec C converti en T partout) Methylated DNA can be enriched using methylationspecific antibodies (in methylated DNA immunoprecipitation coupled with high-throughput sequencing (MeDIP–seq)), methyl-CpG-binding domain (MBD) proteins (in MBD sequencing (MBD-seq)) or a restriction enzyme that specifically cuts methylated DNA (in methylation-dependent restriction enzyme sequencing (McrBC-seq)). Alternatively, unmethylated DNA can be enriched using restriction enzymes that specifically cut unmethylated DNA (for example, in HpaII tiny fragment enrichment by ligation-mediated PCR coupled with sequencing (HELP–seq)). Next-generation sequencing of the resulting DNA libraries counts the frequency of specific DNA fragments in each library and provides the raw data from which DNA methylation levels can be inferred. In contrast to bisulphite sequencing, the DNA methylation information is not contained in the read sequence but in the enrichment or depletion of sequencing reads that map to specific regions of the genome. As a result, enrichment-based methods require careful Bock Nat Rev Genet 13 :705 (2012) xiv Inférence On veut surtout détecter régions de méthylation différente (differentially R E VmethylaIEWS ted region, DMR ) entre échantillons (100s de génomes) a Genomic DNA sequence … CG … CG 0% 1% 1% 42% 78% 0% 1% 0% 0% 38% 85% 86% 2% 0% 0% 0% 41% 67% 8% 1% 12% 3% 15% 8% 36% 72% 4% 5% 2% 15% 5% 33% 11% 39% 94% 0% 2% 13% 1% 19% 2% 24% 22% 33% 92% Single-CpG analysis CG1 CG2 CG3 CG4 CG5 CG6 CG7 CG8 CG9 CG10 Higher in cases (q value) 0.333 0.993 0.085 0.068 0.993 0.993 0.993 0.993 0.196 0.993 Higher in controls (q value) 0.993 0.732 0.993 0.993 0.070 0.104 0.104 0.110 0.993 0.351 Cases … CG Controls CG … CG Sample 1 3% Sample 2 … … … CG … CG … … … CG … CG 6% 80% 57% 1% 2% 0% 50% 74% Sample 3 0% 1% 95% Sample 4 0% 2% Sample 5 1% Sample 6 … … CG b c Genome-wide: tiling analysis Tiling region 1 Tiling region 3 TilingDiscovery region 5 Tiling region 7 q-value estimation de taux de faux positives (False Rate, FDR ) Tiling region 2 Tiling region 4 Tiling region 6 Higher in cases (q value) 0.549 0.048* 0.988 0.988 Higher in controls (q value) 0.768 0.993 0.067 0.067 d Méthylation ? IFT6299 H2014 ? UdeM ? Miklós Csűrös Enhancer Annotated genome analysis Promoter region Tiling region 8 0.549 0.988 0.299 0.299 Bock Nat Rev Genet 13 :705 (2012) xv a Cases Genomic DNA sequence Sample 1 CG … CG 3% Inférence 2 … … … CG … CG … … … CG … CG 6% 80% 57% 1% … CG … CG … CG … … CG 0% 1% 1% 42% 78% 2% 0% 50% 74% 0% 1% 0% 0% 38% 85% Sample 3 0% 1% 95% 86% 2% 0% 0% 0% 41% 67% Sample 4 0% 2% 8% 1% 12% 3% 15% 8% 36% 72% 1% 4% 5% 2% 15% Sample 5plus fort dans (signal fenêtres ou selon annotation) 5% 33% 11% 39% 94% Controls Sample 2 a 0% CG 2% … CG Sample 1 3% 6% 80% 57% 1% 0% 1% 1% 42% 78% b Sample 2 Single-CpG analysis Sample 3 2% CG1 0% 0% CG2 1% 50% CG3 95% 74% CG4 86% 0% CG5 2% 1% CG6 0% 0% CG7 0% 0% CG8 0% 38% CG9 41% 85% CG10 67% Higher in cases (q value) Sample 4 Higher in controls (q value) Sample 5 0.333 0% 0.993 1% 0.993 2% 0.732 4% 0.085 8% 0.993 5% 0.068 1% 0.993 2% 0.993 12% 0.070 15% 0.993 3% 0.104 5% 0.993 15% 0.104 33% 0.993 8% 0.110 11% 0.196 36% 0.993 39% 0.993 72% 0.351 94% 0% 2% 13% 1% 19% 2% 24% 22% 33% 92% Controls Cases Sample 6 Genomic DNA sequence Sample 6 … … … 13% 1% CG … CG … … … 19% 2% CG … CG REVIEWS 24% 22% 33% … CG … CG … CG … 92% … CG c Genome-wide tiling analysis Tiling region 1 Tiling region 3 Tiling region 2 b Tiling region 5 Tiling region 7 Tiling region 4 Tiling region 6 Tiling region 8 Single-CpG analysis Higher in cases (q value) CG10.549 CG2 CG30.048*CG4 CG5 0.988 CG6 CG7 0.988 CG8 0.549CG9 0.988CG10 Higher in controls (q value) Higher in cases (q value) 0.768 0.333 0.993 0.993 0.085 0.068 0.067 0.993 0.993 0.067 0.993 0.299 0.993 0.196 0.299 0.993 Higher in controls (q value) 0.993 0.993 0.070 0.104 0.110 0.732 d Enhancer Annotated genome analysis c Genome-wide tiling analysis Tiling region 1 0.549 0.024* 0.993 0.048* 0.104 0.993 0.351 Promoter region Tiling region 3 Tiling region 2 Higher in cases (q value) Higher in controls (q value) Higher in cases (q value) 0.993 First exon Tiling region 7 Tiling region 5 Tiling region 4 Tiling region 6 0.986 0.988 0.045* 0.988 0.353 Tiling region 8 0.299 0.549 0.988 Figure 3 | Effective identification0.993 of differentially methylated regions in a highly annotated 0.768 0.067 0.067 0.299 genome. 0.299 a | An illustrative example of differences in DNA methylation within the promoter region of aBock gene and at an upstream Nature Reviews Nat Rev Genet 13| Genetics :705 (2012) enhancer. For easier visualization, DNA methylation data are shown for only three cases and three controls (a realistic number would be hundreds of samples) and for ten CpGs in total (dozens to hundreds of CpGs are realistic numbers for a d typical promoter region). b | When the DNA methylation levels between cases and controls are compared at the resolution Enhancer Annotated genome analysis Promoter region of single CpGs, all multiple-testing-corrected q values exceed 0.05 and are therefore considered to be insignificant. exon in green),xvi Méthylation ? IFT6299 H2014 UdeM ? statistical Miklós Csűrös c | When?combining evidence from neighbouring CpGs over a fixed distance (tiling regions First highlighted one region is identified as being significantly more highly methylated among the cases compared to the controls Higher in controls (q value) Inférence avec HMM ? Figure 1. Overview of Bisulfighter. (a) mC calling. Bisulfite-converted reads are aligned to a reference genome, and the mC level is estimated Méthylation IFT6299 UdeMis ?theMiklós Csűrös as a ratio of? C–C matches.H2014 A major?feature utilization of alignment probability for filtering out unreliable alignments, and for Downloaded from http://nar.oxfordjournals.org/ at Universite de Montreal ent protocols for hibitively expensive es for both of two n many biological tion, some of them fore library gener8). This makes it ne correspondence database entries nce Read Archive here biological repain, such as retro(9). has been the lack ing and DMR demple, performance ively evaluated for sequencing depths is common that competitors even cly available. software package a. Bisulfighter uses mC calling, and a based on hidden automated adjustulfighter does not R detection, and without biological ensive experiments a, and demonstrate etter accuracy than → 2 échantillons → émissions : restreintes à CpG seulement distribution binomiale (m reads avec méthylation sur n, niveau θ spécifique à l’état) m θm (1 − θ)n−m n → duration géométrique de rester dans un état → LODS score d’une région : n o P région UP o log n P région NoCH Saito, Tsuji & Mituyama Nucleic Acids Res (2014) xvii