Download Evaluation of existing motif detection tools on their

DETECTION OF REGULATORY MOTIFS BASED ON COEXPRESSION AND PHYLOGENETIC FOOTPRINTING PhD presentation Valerie Storms March 29th, 2011 Promoters Prof. Dr. Ir. Kathleen Marchal Prof. Dr. Ir Bart De Moor Overview 1. Introduction on transcriptional regulation 2. The effect of orthology and coregulation on detecting regulatory motifs 3. PhyloMotifWeb: workflow for motif discovery in eukaryotes 4. De novo motif discovery in vitamin D3 regulated genes Genetic information All living organisms consists of one or more cells • E.g. humans: – Built of multiple cells like nerve cells, muscle cells, skin cells – Every cell: contains identical genetic information G- C Genetic information Sugar-Phosphate Backbone C - G • Stored as DNA (deoxyribose nucleic acid) Base pair A-T • Double helix with sugar-phosphate backbone • 4 building blocks = “base” – A: adenine – C: cytosine – G: guanine – T: thymine / U: uracil • Complementary base pairing -> hydrogen bounds • Presentation: ACCTGCTAG….ATTGACGGAC G- C A-T T -A C- G Base pair G-C G- C T -A A-T G- C G- C T -A A-T T -A Genetic dogma GENEXPRESSIE DNA contains genes = specific sequences of bases that encode instructions on how to make proteins = work units of a cell ….AAATTTGGTTGTTGTCTCCCAGCTGTTTATTTCTGT Gene DNAAACAGATCTTGGAGGCTGCGGTCTGGATCCCTCGCC AAGAACCAGATCCAGGAGAAAACGTGCTCAACGTGC AGCTCTGCTCCTACTGATTATAGCCCCACAGATGACA TCGCTCCATAGTCACACCAAGTCTCCTGTGGGAGTC TTGCTCCTCGTTCTCAGTGTCTGTTACAGCTCGGTAT TRANSCRIPTION TTTAGTGTCAGGACGTCGGCTCCCAGCCCGCATCTC CGCTCAGCAATGCCATTATCTTCTCAGCCAAGTCCTA GAAATGGGTTGGCTTCCCATTTGCAAAAACATCGCT CCATAGTCACACCAAGTCTCCTGTGGGAGTCTTGCT CCTCGTTCTCAGTGTCTGTTACAGCTCGGTATTTTAG mRNA TGTCAGGACGTCGGCTCCCAGCCCGCATCTCCGCT CAGCAATGCCATTATCTTCTCAGCCAAGTCCTAGAAA TGGGTTGGCTTCCCATTTGCAAAAACATCGCTCCATA GTCACACCAAGTCTCCTGTGGG…. TRANSLATION protein TRANSCRIPTIONAL REGULATION DIFFERENT LEVELS OF REGULATION Main players in Transcriptional regulation 1. Recruitment of the RNA POLYMERASE COMPLEX to the promoter region of the target gene Co-activator TF RNA polymerase complex TSS TARGET GENE DNA Promoter region This process can be activated or repressed by: • Transcription Factors (TFs) – activators and repressors  Bind DNA directly by recognizing specific regions • Co-activators and co-repressors  Recruited by protein-protein interactions Main players in Transcriptional regulation 2. Chromatin structure Eukaryotic cells • Nucleus • Linear DNA molecules organized into chromosomes • Chromatin = complex of DNA and proteins Histones Linear DNA molecule Influences Transcriptional Regulation TF Heterochromatin Euchromatin Main players in Transcriptional regulation Co-activator TF DNA TF RNA polymerase complex TSS REGULATORY MOTIF Chromatin remodeling complex TARGET GENE ATTGCCAT - Modify chromatin structure: TF-DNA INTERACTION - DNA methylation - Histone modifications like methylation, acetylation • TFs bind specific non-coding sequences in the DNA to control the expression of their target genes  TF binding sites • All genes regulated by the same TF contain a similar TF binding site in their promoter region • REGULATORY MOTIF models the TF-DNA binding specificity and captures the variability of TF binding sites Regulatory motif TF REGULATORY MOTIF G G G G G T T A T T G G G G C A A A T A C C C C G G C G G G Alignment of TF binding sites A 0.01 0.01 0.01 0.97 0.01 0.01 C 0.01 0.01 0.01 0.01 0.97 0.29 G 0.97 0.01 0.97 0.01 0.01 0.69 T 0.01 0.97 0.01 0.01 0.01 0.01 p1 p2 p3 …. Construction of frequency matrix pn Motif logo Computational motif discovery ? TF De novo Motifmotif scanning discovery 1. Motif scanning: known motif model  Different algorithms to predict TF binding sites 2. De novo motif discovery: search for novel, uncharacterized motifs  Two different computational approaches! Algorithms classified based on the information sources they use: - Coregulation information - Orthology information - Co-localization of different TF binding sites - Chromatin structure Overview 1. Introduction on transcriptional regulation 2. The effect of orthology and coregulation on detecting regulatory motifs 3. PhyloMotifWeb: workflow for motif discovery in eukaryotes 4. De novo motif discovery in vitamin D3 regulated genes Different information spaces 1. Coregulation space 2. Orthologous space Next generation of motif discovery tools integrates orthology with coregulation information 3. Combined coregulation-orthology space Study Research goal: – Extent of information in coregulation or orthologous space – Conditions under which complementing both spaces improves motif detection Method: – Synthetic and real benchmark datasets – Select motif detection tools  flexible enough to perform in each of the three spaces - Phylogibbs (Siddharthan et al., 2005) - Phylogenetic sampler (Newberg et al., 2007) - MEME (Bailey and Elkan, 1994) Theoretical comparison Overview Phylogibbs Phylogenetic sampler MEME Simulated annealing + tracking A Gibbs sampler Expectation => local optimum Maximization => global optimum (= MAP solution) => Ensemble centroid solution => local optimum Short Long (>multiple re-initializations) Short Phylogenetic relatedness between the orthologous sequences No evolutionary model  Tree-based evolutionary model  Alignment of the orthologous sequences needed  Unaligned sequences Theoretical comparison Assignment and scoring of motif sites Unaligned Phylogibbs Single independent motif sites Window principle Prealigned -> more flexible in case of a bad prealignment Phylogenetic sampler Block principle -> very sensitive to bad prealignments -> leave out phylogenetic distant orthologs Multiple orthologous motif sites Tree-based evolutionary model (F81) Performance assessment Construction of Synthetic datasets 1 Motif WMs with a different IC TC…T 2 Background sequences 3 TT…T … TC…C 4 Ancestor species Seq 1 Seq 2 Use a phylogenetic tree and an evolutionary model to create the orthologs for different species Seq 1 Seq 2 5 … Seq 10 REF SPECIES SPECIES 1 SPECIES 2 SPECIES 3 SPECIES 4 Coregulation Orthologous Combined … Seq 10 Performance assessment Construction of Real datasets Biological datasets: 1. Prokaryotic data -> Gamma-proteobacteria LexA TyrR 2. Eukaryotic data -> yeast species Urs1H Rap1 Performance assessment Results (1) … COREGULATION SPACE  Depends on the degeneracy of the embedded motif  Does adding orthologs improve the performance for the LOW IC motif? Performance assessment Results (2) COMBINED SPACE … … 1. Evolutionary distance between the added orthologs Performance assessment Results (3) 2. Phylogenetic tree => Tree based on neutral evolution rate 3. The number of added orthologs and the topology of the tree => low impact 4. Noise => Orthologous direction: performance drop depends on the species distance and the algorithm characteristics Performance assessment Results (4) ORTHOLOGOUS SPACE  Room for improvement! -Number of added orthologs larger effect than in combined space -PS Almost no output when orthologs are prealigned (No centroid solution) Conclusions Phylogibbs Phylogenetic sampler MEME Quality of predicted motifs depends on correctness of prealignments  Challenge: accounting for phylogenetic relatedness, independent of a prealignment Ensemble centroid strategy  Useful with low signal/noise  Computationally limiting Phylogenetic tools may perform better than the more basic MEME tool BUT  More parameters to tune  Performance strongly depends on the prealignment quality, the phylogenetic tree, the relationship between the orthologs etc… Overview 1. Introduction on transcriptional regulation 2. The effect of orthology and coregulation on detecting regulatory motifs 3. PhyloMotifWeb: workflow for motif discovery in eukaryotes 4. De novo motif discovery in vitamin D3 regulated genes PhyloMotifWeb Motif finders with different algorithmic background performance diversity Ensemble strategy combine results of multiple algorithms Progress of experimental technologies Growing number of sequenced genomes Orthology information Epigenetic information Chromatin structure information Ensemble phylogenetic motif finders Create orthologs alignments phylogenetic tree Automatic parameter sweep Easy reduction of search space PhyloMotifWeb – Ensemble strategy • Three motif finders: Phylogibbs, Phylogenetic sampler and MEME • Run each motif finder across multiple parametersettings (e.g. different motif numbers, motif widths etc.)  Large collection of output matrices • FuzzyClustering algorithm – summarizes all these output matrices into a set of non-redundant ensemble motifs – Works on the TF binding site level <-> matrix level PhyloMotifWeb Motif finders with different algorithmic background performance diversity Ensemble strategy combine results of multiple algorithms Progress of experimental technologies Growing number of sequenced genomes Orthology information Ensemble phylogenetic motif finders Create orthologs alignments phylogenetic tree Epigenetic information Chromatin structure information Important for motif discovery in eukaryotes! Automatic parameter sweep Easy reduction of search space PhyloMotifWeb - Eukaryotes Restrict search space to regions with higher regulatory potential based on epigenetic information like chromatin structure BUT: Tissue and condition dependent! Annotation of regulatory regions > Regulatory build pipeline of Ensembl • Multi-cell type: – DNase hypersensitivity -> open chromatin – CTCF binding sites -> enhancer/insulator marker – Binding sites of other TFs • Cell-type specific: – Histone modifications PhyloMotifWeb – Webserver PhyloMotifWeb – Webserver PhyloMotifWeb – Webserver Results page - Motif logo - Individual binding sites of the ensemble solution - p-value for the overrepresentation of the ensemble motif in the sequence set versus random sequence sets - Comparison with database motifs Overview 1. Introduction on transcriptional regulation 2. The effect of orthology and coregulation on detecting regulatory motifs 3. PhyloMotifWeb: workflow for motif discovery in eukaryotes 4. De novo motif discovery in vitamin D3 regulated genes Vitamin D3 - metabolism • Source: Diet and produced in skin when exposed to sunlight • Role in regulating many physiological and cellular processes: - Bone health - Prevention of autoimmune diseases - Anti-proliferative effect on different cell types like cancer cells Vitamin D3 - mode of action VitD3 VDR 1. Vitamin D3 enters the cell and binds to the vitamin D receptor (VDR), which dimerizes with RXR VitD3 RXR 2. Ligand-activated VDR/RXR binds the DNA at Vitamin D Regulatory elements (VDRE) VDR VDRE 3. Recruitment of co-activators and chromatin remodelers:  open chromatin structure Chromatin remodeling complex Co-activator complex VitD3 RXR VDR 4. Transcription of the VDR target gene DRIP RXR VitD3 Transcription machinery VDR Target gene Vitamin D3 - dataset Mouse bone cells VitD3 Target gene VDRE VERSUS Ctr VitD3 RXR VDR Human breast cancer cells ANTIPROLIFERATIVE PHENOTYPE GOAL: get insight in molecular mechanism underlying anti-proliferative effect of vitD3 - Human and mouse cell lines treated with vitD3 versus no vitD3 (Control) - Measured the expression of all genes in the human and mouse cells using microarrays for both conditions over different time points - Select differentially expressed genes (vitD3 versus Control) -> phenotype - Group per species all genes with similar behavior in coexpression clusters  focus on genes with a conserved co-expression behavior across human and mouse interesting for common anti-proliferative phenotype Vitamin D3 - Dataset Conserved coexpression cluster: - 10 genes - Upregulated after vitD3 Assume: conserved transcriptional regulation Conserved regulatory motifs responsible for expression behavior  De novo strategy  Screening: Co-localization of TF binding sites Vitamin D3 - de novo motifs METHOD: PhyloMotifWeb RESULTS: 1. Very common motifs • Low specificity for coexpressed cluster • Match with TFs involved in cell cycle regulation – – • Well conserved TF binding sites, present in many genes! e.g. SP1, ZF5, NRF1 TF involved in B-cell differentation – EBF Vitamin D3 - de novo motifs 2. Motifs specific for the conserved coexpression cluster -> higher overrepresentation in the cluster compared to the genome -> match with following TFs: ZEB1 - Transcriptional activator of VDR protein - Role in cancer metastasis VDR - Putative direct regulation by VDR - VDRE hard to discover de novo: only one conserved half-site! •Two conserved half sites with variable spacer C1 C2 C1 C2 •Diverse configurations [DR, IR, ER] •Located far up-/down-stream TSS NHR-scan: specific for nuclear hormone receptor binding sites Vitamin D3 – Cis-regulatory modules TF1 TF2 TF1 TF2 TF1 TF2 Higher eukaryotes: -> TFs act in cooperation to modulate gene expression -> Find co-localized binding sites for de novo predicted motifs => CRMs Vitamin D3 – Cis-regulatory modules METHOD: CPModule INPUT: • De novo predicted motif models • Constraint: module size ranging between 150bp and 400bp RESULTS: • 3 CRMs highly specific for the coexpressed genes (p-value < 0.001): SP1-EBF 7 genes NRF1-EBF 7 genes VDR-ZEB1-EBF 10 genes • Each CRM contains the EBF motif -> degenerated -> many hits -> using a motifspecific score threshold • Most interesting is the ZEB1-VDR module Vitamin D3 - perspectives • Motifs predicted for the conserved coexpression cluster -> investigate their presence for larger species-specific clusters or maybe for the full genome • The availability of cell-type specific epigenetic information can help to retrieve the functional binding sites • Besides a transcriptome analysis -> integrate extra omics data like ChIP-seq and protein profiling to reconstruct the regulatory network of vitD3 Acknowledgements CMPG-Bioi ESAT-Bioi • Prof. Dr. Kathleen Marchal • Prof. Dr. Bart De Moor • Dr. Pieter Monsieurs • Prof. Dr. Yves Moreau • Marleen Claeys • Wouter Van Delm • Carolina Fierro • Aminael Sanchez LEGENDO • Hong Sun • Dr. Lieve Verlinden • Prof. Dr. Mieke Verstuyf CMPG • Dr. Guy Eelen • • Els Vanoirbeek Prof. Dr. Jan Michiels Extra slides Theoretical comparison Phylogibbs Algorithm (1) Procedure: 1. start with a random configuration C, based on prior information on the number of motif sites/TFs 2. construct the set of all possible configurations C’ that differ in one single move from C (designed moveset) 3. calculate for each C’ the posterior probability score 4. sample a new configuration from this score distribution  This procedure is repeated for two phases : 1. Simulated annealing: iterating to configuration C* with the highest posterior probability (=MAP) (temperature parameter β) 2. Tracking: posterior probabilities are assigned to the windows in C* -> One initialization is sufficient -> Very short running time (minutes/hours) Theoretical comparison Phylogibbs Algorithm (2) 3. Calculate the posterior probability score: P(C|S) Bayes’ Theorem:  P(C|S) ~ P(S|C) = probability that the motif sites of C are drawn from the motif WM and that the background sequence is drawn from the background model  EVOLUTIONARY MODEL  The motif WM = unknown!! -> integral over all possible WMs : with prior P(WM) modeled by Dirichlet prior distribution Dir(γ) The approximation to solve this integral requires that the tree topologies are reduced to collections of star topologies Theoretical comparison Phylogenetic sampler Algorithm (1) Procedure: 1. start with a random positioning of blocks (based on prior information on the expected number of motif sites/TFs and max number of motif sites per sequence) 2. update the motif model based on the current blocks (<-> PG) 3. scoring: leave out the blocks for one sequence (<-> PG) and calculate for each possible block the conditional probability score 4. first sample the number of motif sites for the sequence, then sample this number of blocks from the score distribution (3)  This iteration procedure is repeated for: 1. Burn-in phase: to converge to local optimum 2. Sampling phase: keep track of all sampled blocks to construct the centroid afterwards -> multiple initializations (seeds) recommended to avoid getting trapped in local maximum -> long running time (hours/days) Theoretical comparison Phylogenetic sampler Algorithm (2) 2. Update the motif model -> Sample a new motif model from a Dirichlet distribution Dir(β+c) adjusted with phylogenetically weighted counts (based on phylogenetic tree) -> Accept the new motif with a probability proportional to the Metropolis Hastings ratio 3. Calculate the conditional probability score The conditional probability => proportional to the probability that the block is drawn from the motif model (inferred) divided by the probability that the block is drawn from the background model  EVOLUTION MODEL  The Felsenstein tree-likelihood algorithm is used to handle all tree topologies (<->PG) Theoretical comparison Solution Phylogibbs  Maximum a posteriori (MAP) solution -> set of motif sites (configuration) with the highest posterior probability Phylogenetic sampler  Centroid solution -> report all those motif sites that appear in at least half the sampling iterations -> keeps track of all motif sites sampled during sampling iterations to calculate posterior probabilities -> does not take into account joint occurrences of the motif sites Figure from Newberg et al., 2007 Theoretical comparison Evolutionary model Adapted Felsenstein (F81) model -> Describes the substitution process at the nucleotide level -> Assumes that all positions evolve independently and at equal rates (u) -> Probability that a is mutated to b is dependent on the time (t) -> Fixation of b is dependent on its frequency in the motif WM Phylogibbs  proximity = q = exp(-ut) = probability that no substitution took place per site Phylogenetic sampler  branch length = b = ut AND a different normalization for their branch lengths (k) Convert proximities to branch lengths::: b=-3/4ln(q) Introduction Main players in Transcriptional regulation Prokaryotic cells (bacteria): • No nucleus, circular ‘naked’ DNA molecule Eukaryotic cells: • Linear DNA molecules organized into chromosomes • Chromatin > complex of DNA and proteins (Histones) Chromatin function: – Storage of long DNA molecules into nucleus Nucleus Chromosome – Role in Transcriptional regulation: euchromatin and heterochromatin DNA Nucleosome Chromatin Histone proteins Main players in Transcriptional regulation 2. Chromatin structure (eukaryotes) Co-activator TF DNA RNA polymerase complex Chromatin remodeling complex TSS TARGET GENE Promoter region Theoretical comparison Input format SPACE Phylogibbs Phylogenetic sampler COREGULATION: Non-coding regions for a set of coregulated genes from one species Unaligned ORTHOLOGOUS: Non-coding regions for a set of Prealigned orthologs orthologous genes from multiple species -PG => Dialign -PS => ClustalW COMBINED: Combination of both Phylogenetic tree MEME Unaligned Theoretical comparison Assignment and scoring of motif sites Unaligned Phylogibbs Single independent motif sites Window principle Prealigned -> more flexible in case of a bad prealignment Phylogenetic sampler Block principle -> very sensitive to bad prealignments -> leave out phylogenetic distant orthologs Multiple orthologous motif sites Tree-based evolutionary model (F81) Performance assessment Results (3) 2. Phylogenetic tree => Tree based on neutral evolution rate 3. The number of added orthologs and the topology of the tree 4. Noise => Orthologous direction: performance drop depends on the species distance and the algorithm characteristics Spec 3 Spec M Phylogibbs ↓ Phylogenetic sampler ↓ -Weighting scheme -Block principle PhyloMotifWeb - webserver PHYLO-MOTIF-WEB ENSEMBL CORE STEP 1 Select the non-coding regions STEP 2 ENSEMBL COMPARA AND REGULATORY BUILD Additional information sources Mask repeats Multi-species alignments DNA features like chromatin structure STEP 3 Motif discovery by using an ensemble strategy TRANSFAC and JASPAR UCSC GENOME BROWSER MEME Phylogibbs Phylogenetic sampler STEP 4 Clover Post-processing of the predicted ensemble motif matrices MotifComparison External Database External Software PhyloMotifWeb - Webserver Vitamin D3 - de novo motifs RESULTS: 1. Very common motifs -> low overrepresentation in the cluster compared to the genome -> match with following TFs: SP1 - Involved in vitD3 response –> regulation of genes without VDRE binding site MEME - Regulator of TFs involved in cell cycle regulation ZF5 NRF1 - TF particularly abundant in differentiated tissues with low proliferation MEME - Growth suppressive activity PG - Involved in cell proliferation MEME PG EBF - B-cell differentation  SP1, ZF5 and NRF1 are cell cycle regulators -> well conserved binding sites, present in many genes! PS PhyloMotifWeb – Ensemble strategy • Three motif finders: Phylogibbs, Phylogenetic sampler and MEME • Run each motif finder across multiple parametersettings (e.g. different motif numbers, motif widths etc.)  Large collection of output matrices • FuzzyClustering algorithm -> summarizes all these output matrices into a set of non-redundant ensemble motifs - Works on TF binding site level -> fine tuning sensitivity/specificity - Integration of TF binding site scores assigned by the original motif finder - Trace back the different motif finders that contributed to the final solution Vitamin D3 - de novo motifs METHOD: PhyloMotifWeb - 4000 bp centered around TSS  Restrict to regions with regulatory potential - Use evolutionary conservation information  human-mouse pairwise alignment  six species alignment - Use Phylogibbs, Phylogenetic sampler and MEME => Ensemble solution - Predicted ensemble motifs were compared to database motifs from TRANSFAC and JASPAR to retrieve TFs potentially involved in the coexpression behavior Vitamin D3 - dataset Mouse bone cells VitD3 Target gene VDRE VERSUS Ctr VitD3 RXR VDR Human breast cancer cells ANTIPROLIFERATIVE PHENOTYPE GOAL: get insight in molecular mechanism underlying anti-proliferative effect of vitD3 - Human and mouse cell lines treated with vitD3 versus no vitD3 (Control) - Measured the expression of all genes in the human and mouse cells using microarrays for both conditions over different time points - Select differentially expressed genes (vitD3 versus Control) -> phenotype - Group per species all genes with similar behavior in coexpression clusters  Focus on similarity between human COMMON antiproliferative phenotype and mouse cells as interesting for General perspectives Integration of multiple information sources to improve de novo motif discovery • Orthology information – Ortholog alignments, evolutionary models – Evolution in how algorithms exploit this information source • New information sources like epigenetic information become available – How to exploit this new information? – More knowledge on which chromatin modifications co-locate with transcriptionally active regions like promoters, enhancers or TF binding sites will improve usability

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Evaluation of existing motif detection tools on their