* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1. Interpreting rich epigenomic datasets
Skewed X-inactivation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Transgenerational epigenetic inheritance wikipedia , lookup
Genome (book) wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Minimal genome wikipedia , lookup
Neocentromere wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics in stem-cell differentiation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
1. Interpreting rich epigenomic datasets %Genome TSS CpG hiCpG-TSS loCpG-TSS TSS Transcribed TES Dnase Conservation ZNF Lamina Repeats Expression L1 repeat Alu repeat Interpreting chromatin states How many states are meaningful: agreement between cell types Ratio vs. background H1-H9 H9-H1 H1/9-IMR90 IMR90-H1 IMR90-H9 Background • Distinctions remain recoverable between cell types, even after 40-50 chromatin states (IMR90-H1-H9) Preferential enhancer-promoter interactions IMR90 – Same chromosome interactions Transcribed 3’ Transcribed 5’ Transcribed strong Transcribed weak Transcribed enhancer Enhancer poised Enhancer Active Strongest Enhancer Strong Enhancer Weak Enhancer Low signal Heterochromatin Repressed Bivalent promoter Active Promoter Transcribed Enhancer IMR90 – diff chrom Off Prom H1 – same chrom H1 – diff chrom • Different enhancer states show different interactions • Enhancers/transcribed/promoters interact • Inactive regions show fewer interactions overall (both to active states, and to each other) • H3K9me3 states interact between chromosomes in ES cells 4 2. Prioritizing experiments Ever-expanding dimensions of epigenomics Additional dimensions: Environment Thousands of whole-genome Genotype datasets Disease Gender Chromatin marks Stage Age Cell types • Today: Cell-type and chromatin-mark dimensions • Next: Personal epigenomes: genotype/phenotype • Complete matrix of conditions, individuals, alleles Prioritize experiments for additional cell types 2 methods Method 1 • Based on unique information Method 2 • Based on chromatin state recovery (1) Quantify state recovery using subsets of marks (2) Capture additional information from mark intensity 7 Beyond marks: Trade-offs of >cell types vs. >depth Mark Prediction Error2 Method 1 example: Rank chromatin marks for a new cell type Hardest to predict Prioritize these marks? IMR90 Using all marks Easiest to predict (redundant) • Hardest marks to predict using all other IMR90 marks: H3K3me3, etc • Match the marks usually identified as the most useful: a good metric? Method 2 example: Rank additional marks for existing cell type Extend IMR90 set beyond initial 22 marks 22 Marks common with CD4T data H2AK5ac H3K27ac H3K27me3 H3K9me3 H2BK120ac H3K4ac H3K36me3 H4K20me1 H2BK12ac H3K9ac H3K4me1 H2BK20ac H4K5ac H3K4me2 H3K14ac H4K8ac H3K4me3 H3K18ac H4K91ac H3K79me1 H3K23ac H3K79me2 19 Marks only in CD4T data H2AK9ac H2BK5me1 H3K9me2 CTCF H2BK5ac H3K27me1 H3R2me1 H2AZ H3K36ac H3K27me2 H3R2me2 PolII H4K12ac H3K36me1 H4K20me3 H4K16ac H3K79me3 H4R3me2 9 3. Completing epigenomes computationally Chromatin mark imputation 10 Predicting signal for missing marks • Question: Can we predict signal intensity of one mark given other sets of marks • Datasets used: – H1, IMR90 (+H9, K562, GM12878, HSMM) • Methodological decisions: – – – – Focus on common set of marks Downsample one replicate to 10 million reads Split reads equally between training and test data Bin genome into 2kb bins • Model/metrics: – Use a linear regression model for predictions – Used square error loss on mark signal as objective Eg: Predicting H3K9ac signal H3K9ac Predicted H3K9ac True • How good is the prediction? • How similar to other marks? • How does it compare to biological replicate? Mark Coeff H3K56ac 0.32 H3K4me3 0.29 H3K4ac 0.22 H3K4me2 0.15 H3K27ac 0.14 H2AK5ac 0.14 H4K8ac 0.14 H3K23ac 0.13 H3K14ac 0.13 H3K79me2 0.12 H4K5ac 0.06 H3K36me3 0.04 H4K91ac 0.01 H3K4me1 -0.01 H3K18ac -0.01 H3K27me3 -0.02 H4K20me1 -0.04 H2BK120ac -0.05 H3K9me3 -0.05 Input -0.07 H2BK15ac -0.1 H3K79me1 -0.15 H2BK12ac -0.15 H2BK20ac -0.22 Intercept -0.16 Impute missing datasets / predict new cell types Predict missing mark from many others Predict many marks in new cell type Prediction of K27ac,K9ac,K4me1… in GM from DNase Prediction of H3K4me1 from DNase across cell types • Use mark correlations to predict missing datasets as matrices become denser • Applications: (1) Prediction in difficult to access conditions. (2) Detecting failed experiments/replicates. (3) Finding unexpected prediction/raw differences 13 4. Allele-specific chromatin marks Known imprinted genes confirm allele specific methodology Method • Map to phased GM12878 haplotypes • Count maternal vs. paternal reads, Validation • Known imprinted genes are allelic • X-inactivation only one chromosome • Requires sufficient SNPs and sufficient reads for significance Discover allelic genes genome-wide Aggregate by gene / chromatin state Allelic activity supported by many marks, Pol2, TFs • Includes X-inactivated paternal chromosome genes Genome-wide correlations for pairs of marks • Aggregate signal across chromatin states • Active marks positively correlated • H3K27me3 negatively correlated Zoom in on indiv. examples Active/repressive marks on paternal/maternal alleles Pol2 reads on paternal chromosome Active transcription of paternal chromosome Repressive marks on maternal chromosome • Strong repressive signal (K27me3): reads mostly maternal • Strong active signal (K79me2 tx): reads mostly paternal Allele-specific chromatin marks: cis-vs-trans effects • Maternal and paternal GM12878 genomes sequenced • Map reads to phased genome, handle SNPs indels • Correlate activity changes with sequence differences 5. Linking enhancers to promoters using many cell types Power should increase with additional cell types Chromatin State Gene expression Chance of spurious correlation decreases Power to predict links increases with more cell types • True enhancers show excess • Number of non-random of high correlation links increases linearly with number of cell types • Can estimate number of non22 random links at any FDR • 30 cell types: 15,000 links Visualizing 10,000s predicted enhancer-gene links • Overlapping regulatory units, both few and many • Both upstream and downstream elements linked • Enhancers correlate with sequence constraint 23 6. Disease enrichments across 1000s of enhancers Full T1D association spectrum 1000s of causal SNPs • Rank all SNPs by P-value • Find chromatin states with enrichment in high ranks • Signal spans 1000s of SNPs GM12878 Lymphoblastoid K562 Myelogenous leukemia GM12878 enhancer enrichment now seen Could bias in array design contribute to these enrichments? Evaluate all 1000 genomes SNPs by imputing those in LD Cell type specific: GM and K562 enhancers Chromatin state specific: Enhancers/promoters Imputing SNPs in LDstronger cell/state separation Enhancers across cell types Chromatin states in GM12878 Promoters: 462 (excess 81) Enhancers: 2049 (excess 392) 1940 distinct loci (R^2<.8) Transcribed: 4740 (excess 522) Insulator: 240 (excess 23) Repressed: 1351 (excess 76) Other: 21k (deplete 1093) • Excess of 30,000 SNPs2049 enhancers (excess 392) • Mostly found in independent loci (1730 with R2<0.2) Systematically measure their regulatory contributions