Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning for Epigenetics: some initial results Guido Sanguinetti School of Informatics, University of Edinburgh ICMS Big data in Science Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 1 / 37 Talk outline 1 Background 2 Statistical testing in ChIP-Seq data (G. Schweikert) 3 MMDiff: Results 4 Transcription factors and histone modifications (with D. Sproul) 5 Shape-based testing for methylation profiles (T. Mayo) Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 2 / 37 The central dogma Where does variability come into play? What can we measure? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 3 / 37 Transcription as a hybrid system Given the promoter state, the proteins obey the following dynamical model of transcription (linear SDE) dx(t) = (Aµ(t) + b − λx(t)) dt + σdW (1) For a long time, I’ve been interested in inference in this type of system Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 4 / 37 Transcription as a hybrid system Given the promoter state, the proteins obey the following dynamical model of transcription (linear SDE) dx(t) = (Aµ(t) + b − λx(t)) dt + σdW (1) For a long time, I’ve been interested in inference in this type of system What else is happening around this abstraction? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 4 / 37 Epigenetics Genetics and transcription cannot be all; spatial organisation of chromosomes plays a role. This is determined by chemical modifications to DNA and histones. Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 5 / 37 A more accurate picture? Genome ....CCACCGAACGCGCGCGGGAACGGCACGAGCGGGGCGCCG... DNA sequence trans-factors eg. Cfp-1 Epigenome Transcriptome RNA-Seq / Pol-II Zhou et al., Nat Rev Genet, 2011 Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 6 / 37 The modelling cycle Informatics will provide the synthesis! Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 7 / 37 Epigenetics: what the data looks like Each row is a tiny fraction of a next-generation sequencing experiment’s data. Each row ≥1GB of data. How do we determine relationships between the rows? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 8 / 37 What the data looks like after QC, mapping, alignment, Histone modification data Guido Sanguinetti (University of Edinburgh) DNA Methylation data ML for epigenetics ICMS, 06/05/15 9 / 37 Obvious problems Small data, with each data point being very big Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 10 / 37 Obvious problems Small data, with each data point being very big Even restricting to regions (e.g. genes), the data is high dimensional and non-trivial Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 10 / 37 Obvious problems Small data, with each data point being very big Even restricting to regions (e.g. genes), the data is high dimensional and non-trivial How can we even determine statistical differences? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 10 / 37 Obvious problems Small data, with each data point being very big Even restricting to regions (e.g. genes), the data is high dimensional and non-trivial How can we even determine statistical differences? What is a suitable probability model for each of these high-dimensional, non-Gaussian items? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 10 / 37 Obvious problems Small data, with each data point being very big Even restricting to regions (e.g. genes), the data is high dimensional and non-trivial How can we even determine statistical differences? What is a suitable probability model for each of these high-dimensional, non-Gaussian items? Data associated with different genes may be of intrinsically different dimensionality. How can I do even basic things like clustering? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 10 / 37 Obvious problems Small data, with each data point being very big Even restricting to regions (e.g. genes), the data is high dimensional and non-trivial How can we even determine statistical differences? What is a suitable probability model for each of these high-dimensional, non-Gaussian items? Data associated with different genes may be of intrinsically different dimensionality. How can I do even basic things like clustering? How can we model in the presence of very strong redundancies (dimensionality reduction)? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 10 / 37 Issues in statistical testing Statistical testing: procedure to assess the significance of differences between groups of measurements in the light of normal variation Given two sets of replicate measurements, we compute a statistic and determine the probability that a difference as large or larger in the statistic could be due to chance (p-value) Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 11 / 37 Issues in statistical testing Statistical testing: procedure to assess the significance of differences between groups of measurements in the light of normal variation Given two sets of replicate measurements, we compute a statistic and determine the probability that a difference as large or larger in the statistic could be due to chance (p-value) Boring? Misleading (see controversy in Nature earlier this year)? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 11 / 37 Issues in statistical testing Statistical testing: procedure to assess the significance of differences between groups of measurements in the light of normal variation Given two sets of replicate measurements, we compute a statistic and determine the probability that a difference as large or larger in the statistic could be due to chance (p-value) Boring? Misleading (see controversy in Nature earlier this year)? How do we carry out (thousands of) tests when each measurement is high dimensional? Careful choice of 1-dimensional summaries (statistics) is essential. Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 11 / 37 Example: ChIP-Seq DNA - binding protein - Cross-linking- Cross-linking DNA - DNA fragmentation - DNA fragmentation - Enrichment with specific antibody (ChIP) - Enrichment with specific antibody (ChIP) - Profiling of enriched DNA Individual (Seq) sequencing read (tag) Read (tag) density - Profiling of enriched DNA (Seq) Kim and Park, 2011 Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 12 / 37 What the data looks like after QC, mapping, alignment, Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 13 / 37 What the data looks like after QC, mapping, alignment, Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 13 / 37 What the data looks like after QC, mapping, alignment, The shape of the peak seems highly conserved across replicates. Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 13 / 37 Formulate the test question Suppose for a peak i we are given n observations (i.e. reads) in data set s (e.g. WT) X s = {xs1 , ..., xsn } m observations in data set s 0 (e.g. Null), 0 0 0 X s = {xs1 , ..., xsm } 0 where xs , xs random variables drawn i.i.d. from probability distributions p and p 0 . Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 14 / 37 Formulate the test question Suppose for a peak i we are given n observations (i.e. reads) in data set s (e.g. WT) X s = {xs1 , ..., xsn } m observations in data set s 0 (e.g. Null), 0 0 0 X s = {xs1 , ..., xsm } 0 where xs , xs random variables drawn i.i.d. from probability distributions p and p 0 . Can we decide whether p 6= p 0 ? Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 14 / 37 Formulate the test question Suppose for a peak i we are given n observations (i.e. reads) in data set s (e.g. WT) X s = {xs1 , ..., xsn } m observations in data set s 0 (e.g. Null), 0 0 0 X s = {xs1 , ..., xsm } 0 where xs , xs random variables drawn i.i.d. from probability distributions p and p 0 . Can we decide whether p 6= p 0 ? Define test statistic: should summarize the data, preferably in a single number should capture higher order moments → use Kernel trick Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 14 / 37 MMD Test statistics 0 Nonlinear kernel function k(xs , xs )→ the mean embedding of a distribution p (in the RKHS F) contains the information of all higher-order moments. Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 15 / 37 MMD Test statistics 0 Nonlinear kernel function k(xs , xs )→ the mean embedding of a distribution p (in the RKHS F) contains the information of all higher-order moments. The maximum mean discrepancy, (MMD) is the distance between mean embeddings MMD[F, p, p 0 ] = supf ∈F (Ex∼p [f (x)] − Ex∼p0 [f (x 0 )]) Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 15 / 37 MMD Test statistics 0 Nonlinear kernel function k(xs , xs )→ the mean embedding of a distribution p (in the RKHS F) contains the information of all higher-order moments. The maximum mean discrepancy, (MMD) is the distance between mean embeddings MMD[F, p, p 0 ] = supf ∈F (Ex∼p [f (x)] − Ex∼p0 [f (x 0 )]) 0 Theorem: MMD p,p = 0 if and only if p = p 0 Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 15 / 37 MMD Test statistics 0 Nonlinear kernel function k(xs , xs )→ the mean embedding of a distribution p (in the RKHS F) contains the information of all higher-order moments. The maximum mean discrepancy, (MMD) is the distance between mean embeddings MMD[F, p, p 0 ] = supf ∈F (Ex∼p [f (x)] − Ex∼p0 [f (x 0 )]) 0 Theorem: MMD p,p = 0 if and only if p = p 0 Finite sample estimates of MMD will be different from zero, but their distribution can be estimated (by bootstrapping) MMD can be efficiently computed in terms of Kernel functions 12 1 2 1 (s,s 0 ) s s s s0 s0 s0 MMD = k(x , x ) − k(x , x ) + 2 k(x , x ) (n)2 n·m m Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 15 / 37 MMDiff: how we use it MMD values are computed for each peak independently Every time, we compare two sets of observations: e.g. WT vs Null, WT vs Resc etc. Each read mapping to a given peak is considered an observation The feature we use is the 5’ end of the alignment Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 16 / 37 MMDiff: how we use it MMD values are computed for each peak independently Every time, we compare two sets of observations: e.g. WT vs Null, WT vs Resc etc. Each read mapping to a given peak is considered an observation The feature we use is the 5’ end of the alignment We use RBF Kernels to capture neighbourhood information The Kernel width is chosen to be the median distance between all observations Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 16 / 37 MMDiff: how we use it MMD values are computed for each peak independently Every time, we compare two sets of observations: e.g. WT vs Null, WT vs Resc etc. Each read mapping to a given peak is considered an observation The feature we use is the 5’ end of the alignment We use RBF Kernels to capture neighbourhood information The Kernel width is chosen to be the median distance between all observations Empirical p-Values are determined on peaks with similar total counts Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 16 / 37 Experiments on ENCODE data Compare ChIP-Seq marks across different cell types Studied two different marks: broad histone mark H3K27ac and transcription factor CTCF binding Cell types: human K562 (leukaemia) vs GM12878 for H3K27ac, mouse brain cortex, cerebellum and liver for CTCF Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 17 / 37 ENCODE results 20 normalized counts 0 60 40 20 40 20 normalized counts 0 60 56318000 56322000 56326000 40 20 normalized counts 0 20 40 CTCF, Liver 0 normalized counts 0 CTCF, Cerebellum H3K27ac, Gm12878 normalized counts 40 CTCF, Cortex H3K27ac, K562 107833000 chr12:56318019−56327372 107833500 107834000 107834500 107835000 chr7:107832815−107834854 Both called by MMDiff and not DESeq. DESeq does not find any differences between cortex and cerebellum. Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 18 / 37 Main application: H3K4me3 upon deletion of Cfp1 Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 19 / 37 Example Peaks AB.1 AB.2 We use WT and Resc as replicates and treat AB1 and AB2 separately. Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 20 / 37 Reproducibility MMDiff gives reproducible results (as much as ChIP-Seq does) Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 21 / 37 Sequence enrichment Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 22 / 37 Indirect binding via E2F transcription factors? Wilson, Molecular Cell, 2007 Tyagi, Molecular Cell, 2007 Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 23 / 37 Mechanistic traces in big data? If TF binding determined the histone mark signal, it should be possible to predict histone modifications from TF binding At a simpler level, it should be possible to predict the presence/ absence of marks from TF ChIP-Seq data This does NOT provide a mechanistic proof; rather it is a necessary but not sufficient condition Isolated examples of interactions between TFs and histone modifiers are known Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 24 / 37 Testing the hypothesis: data We interrogated the ENCODE data sets in the three Tier I cell lines (GM12878, K562, H1 hESC) Outputs: five histone modifications, H3K4me1, H3K4me3, H3K9ac, H3K27ac, H3K27me3 found near transcription start sites. Genomic regions defined positive if they intersect with a histone peak Inputs: normalised read counts for ALL TF chipped in the Tier I cell lines Prediction method: logistic regression. Probabilistic predictor which computes relative importance of input features as a weight vector Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 25 / 37 TFs can predict very accurately ROC curves for predictions of histone modifications at promoters in H1 cells (left). TF-based predictions vs sequence based predictions in H1 cells (right) Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 26 / 37 TFs can predict genome wide Table: Predictions of histone modification presence in H1 cells Mark H3K4me1 H3K4me3 H3K9ac H3K27ac H3K27me3 Seq. (promoters) N.D. 0.918 (± 0.001) 0.867 (± 0.001) 0.828 (± 0.002) 0.808 (± 0.002) Guido Sanguinetti (University of Edinburgh) TF promoters N.D. 0.950 (± 0.001) 0.921 (± 0.001) 0.909 (± 0.001) 0.877 (± 0.002) ML for epigenetics DNase 0.854 ± 0.001 0.974 (± 0.001) 0.976 (± 0.001) 0.968 (± 0.001) 0.916 (± 0.001) Enhancers F5 0.842± 0.003 0.962 (± 0.001) 0.961 (± 0.001) 0.950(± 0.001) 0.918 (± 0.002) ICMS, 06/05/15 27 / 37 Methylation Data Bisulfite conversion: unmethylated Cytosine to Uracil NGS, conversion aware alignment RRBS: focus on CpG-rich regions Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 28 / 37 A look at the data Data exhibits strong spatial correlations conserved across replicates Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 29 / 37 Choice of Kernel Each mapped cytosine is an individual data point: xj = (Cj , Methj ) Composite kernel kfull (xi , xj ) = kRBF (xi , xj )kSTR (xi , xj ) kRBF (xi , xj ) = exp[−(Ci − Cj )2 /2σ 2 ] kSTR (xi , xj ) = 1 if Methi = Methj , 0 else σ is modelled from the data as σ 2 = x̄ 2 /2 where x̄ is the median observed distance in the region. Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 30 / 37 Handling Coverage The MMD tests whether samples are drawn from the same distribution. The frequency that data is drawn - the coverage - is independent of the methylation profile. We adapt the method by subtracting an appropriate ’coverage only’ metric. The MMD with an RBF kernel on genomic location only (no methylation considered) Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 31 / 37 Test-statistic M3 D test-statistic M 3 D[X , Y ] = MMD[X , Y , kfull ] − MMD[X , Y , kRBF ] The test statistic over all replicate pairs forms our testing distribution For a given region, the mean of the inter-group comparisons is tested against this distribution This gives the empirical probability of finding the cross-group difference in methylation profiles among the replicates Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 32 / 37 M 3D produces nice histograms M 3 D statistic between replicates (left) and between different conditions (K562 vs H1 cells). Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 33 / 37 M 3D is robust to low replication/ coverage M 3 D test results is robust to low coverage (left) and low replication (right). Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 34 / 37 Conclusions MMD-based statistics enable more powerful tests than currently used approaches MMDiff is complementary to count-based methods: changes that only alter counts (keeping shape fixed) cannot be captured MMD is potentially of use in other scenarios where distributions arise naturally, e.g. methylation or metagenomics Machine learning can help extract patterns from high-throughput epigenomic data which may suggest biological functions Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 35 / 37 Thanks School of Informatics Gabriele Schweikert Dan Benveniste Tom Mayo Wellcome Trust Centre for Cell Biology Adrian Bird IGMM Duncan Sproul Hans-Joachim Sonntag Funding: EU FP7 Marie Curie Actions, ERC, EPSRC. Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 36 / 37 References G. Schweikert, B. Cseke, T. Clouaire, A. Bird and G.S., MMDiff: quantitative testing for shape changes in ChIP-Seq data sets, BMC Genomics 14:826, 2013 MMDiff bioconductor package http://www.bioconductor.org/packages/release/bioc/html/MMDiff.html D. Benveniste, H.-J. Sonntag, G.S. and D. Sproul, Transcription factor binding predicts histone modifications in human cell lines, PNAS 111(37), 13367-13372, 2014 T. Mayo, G. Schweikert and G.S., M 3 D: a kernel-based test for spatially correlated changes in methylation profiles, Bioinformatics 31(6), 809-816, 2015 M3D bioconductor package http://www.bioconductor.org/packages/devel/bioc/html/M3D.html Guido Sanguinetti (University of Edinburgh) ML for epigenetics ICMS, 06/05/15 37 / 37