* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Promoter sequence analysis
Survey
Document related concepts
Point mutation wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genomic library wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Human genome wikipedia , lookup
Genome evolution wikipedia , lookup
Transcription factor wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome editing wikipedia , lookup
Oncogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Pathogenomics wikipedia , lookup
Primary transcript wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Transcript
Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Workshop on Regulatory Sequence Motif Discovery, November 10th 2006. The Linnaeus Centre for Bioinformatics, SLU-UU, Sweden. Giovanna Ambrosini Christoph Schmid Components of transcriptional regulation Distal transcription-factor binding sites (enhancer) cis-regulatory modules Wasserman 5, 276-287 (2004) EPD The Eukaryotic Promoter Database Current Release 88 (SEPT-2006) • founded in 1986 (Bucher and Trifonov; Nucleic Acids Res, 14, 10009-10026) • originally exclusively based on literature, carefully maintained and regularly updated • in recent years started with consideration of mass sequencing data • aim at high precision of mapping of transcription start site (+/- 5bp) • promoter sequences of 139 different species, still relatively low coverage (i.e. 1871 human entries) • format of annotation of TSS: DR EMBL; ZZ999999.1; HS28BP; [-19, 9]. -15 -10 -5 0 5 ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ‘ a c c c g c c t g c a c c c g a t t c A T G T G A G A A • one or several alternative transcription start sites per gene EPD format ID XX AC XX DT DT XX DE OS XX HG AP NP XX DR DR DR DR DR DR HS_RPS3 standard; multiple; VRT. EP74176; 10-JAN-2003 (Rel. 73, created) 13-SEP-2004 (Rel. 80, Last annotation update). Ribosomal protein S3. Homo sapiens (human). none. none. none. GENOME; NT_033927.7; NT_033927; [-5333322, 12577805]. [ ENSEMBL; UCSC HapMap ] CLEANEX; HS_RPS3. EMBL; AP000744.4; [-90138, 35862]. [ EMBL; GenBank; DDBJ ] SWISS-PROT; P23396; RS3_HUMAN. RefSeq; NM_001005 [ DBTSS ]. MIM; 600454. Frequency of full-length transcripts TSS determined by modelling Gaussian distributions (MADAP) 10 bp 45 bp R R 84046905-84046987 84047148-84047231 Genomic position The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Schmid, C.D., Praz, V., Delorenzi, M., Perier, R. and Bucher, P. (2004) Nucleic Acids Res, 32, D82-85. EPD 70 36 [-10;10] [-400;400] 0.83 1 RefSeq mRNA 933 0.32 0.95 Genome annot. 890 0.31 0.95 DBTSSv1 (human) 933 0.13 0.68 Eponine 494 0.12 0.46 Superior precision of in silico primer extension (ISPE) virtual counts (2** log ratio)-1 New data sources for EPD ChIP-chip Kim et al. (2005) Nature, 436, 876880 GEO: GSE2672 (remapped!) ENSEMBL chro12: 6.8 – 6.94 Mb ChIP-chip data with insufficient resolution FP Hs USP5 :+R EU:NC_000012.10 1+ 6831557; 74339. 1.5 1.0 0.5 0.0 Frequency 2.0 2.5 3.0 Distribution of T SS 6831200 6831400 6831600 Genomic position 6831800 6832000 EPD webserver: http://www.epd.isb-sib.ch/ • find EPD entry(-ies) using gene symbols,... – extraction of promoter sequences in user-defined ranges – direct transfer to Signal Sequence Analysis (SSA) • download of complete (reference!) promoter sets http://www.epd.isb-sib.ch/seq_download.html SSA Signal Search Analysis Giovanna Ambrosini ISREC Swiss Institute for Experimental Cancer Research History: Signal Search Analysis is a method developed by P Bucher in the early eighties (Bucher, P. and Bryan B., E.N.; Nucleic Acids Res, v.12(1 Pt 1): 287–305) Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences. Signal search analysis programs: 1. CPR: generates a “constraint profile” for the neighborhood of a functional site 2. SList: generates lists of over and under-represented motifs in particular regions relative to a functional site 3. OProf: generates a “signal occurrence profile” for a particular motif 4. PatOP: optimizes a weight matrix description of a locally over-represented sequence motif Recent events: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites Locally Over-represented Sequence Motifs Definition of a Locally Over-represented Sequence Motif Concept A motif which preferentially occurs at a characteristic distance (range) from a certain type of functional position Example: the TATA-box is a locally over-represented sequence motif of the -30 region of eukaryotic POL II transcription initiation sites Components of the formal motif description 1. A weight matrix or consensus sequence defining the motif 2. A cut-off value determining which subsequence constitutes a motif match 3. A preferred region of occurrence defined by 5’ and 3’ borders relative to a functional site, e.g. a transcription initiation site Locally Over-represented Sequence Motifs Input Data Structure Primary experimental data (Functional Position Set) annotated functional positions in DNA sequences stored in a database Work data A DNA sequence matrix a set of fixed-length sequence segments with an experimentally defined site at a fixed internal position The Motif Search Problem Statement For a given DNA sequence matrix find locally optimal combination of using a given quality criterion Quantitative motif description Cut-off value Region of preferential occurrence TATA-box Signal Occurrence Profile for EPD and ENSEMBL Drosophila Promoters CCAAT-box Signal Occurrence Profile for Vertebrate and ENSEMBL Drosophila Promoters SSA webserver: http://www.isrec.isb-sib.ch/ssa Provides access to precompiled functional position sets Collections of transcription initiation sites (promoters) from eukaryotic species Collections of translation initiation sites from large variety of prokaryotic genomes Provides access to the four signal search analysis programs Application to a bacterial translational control signal: the Shine-Dalgarno ribosome binding-site motif Compare the strength and location of the Shine-Dalgarno mRNA-rRNA interaction motif in E. coli and B. subtilis in a qualitative manner. Result: the Shine-Dalgarno interaction motif is stronger in B. subtilis than in E .coli and centered about two bases further upstream in the former species. More than hundred bacterial genomes are now available to perform this type of analysis. Studying transcription regulatory processes with specialized bioinformatics resources – and example Biological question: Do genes that are generally up-regulated in cancer cells have different types of promoters? Procedure: Define cancer up- and down-regulated gene sets using CleanEx Extract corresponding promoter regions from EPD Analyse the signal content of the two promoter sequence sets using SSA Comparative analysis of cancer up- and downregulated promoters Signals considered: Initiator preferred position approx. frequency Initiator TATA-box GC-box CCAAT-box 0 -30 to -25 -200 to 0 -200 to -50 25% - 50% ~30% ~50% ~20% Positional distribution of Initiator motif in cancer upand down-regulated promoters Positional distribution of TATA-boxes in cancer up- and down-regulated promoters Positional distribution of GC-boxes in cancer up- and down-regulated promoters Positional distribution of CCAAT-boxes in cancer upand down-regulated promoters Comparative analysis of cancer up- and downregulated promoters: Summary of results Signal content Initiator Frequency in cancer-up genes Frequency in cancer-down genes Initiator TATA-box GC-box CCAAT-box no change up no change up no change down no change down Next questions: Are TATA-box and CCAAT-box binding factors disregulated in cancer cells ? Or do cancer-specific transcription factors (binding to adjacent sites) preferentially interact with TATA-box and CCAAT-box binding factors? Concluding remarks Signal search analysis has played an instrumental role in the characterization of eukaryotic promoter elements The method has originally been developed for the analysis of eukaryotic promoters but has a much broader application potential (e.g. Shine-Dalgarno signal analysis) Rapidly growing collection of complete genomes and high-throughput methods for genomic analysis increase the statistical power to discover new motifs, or better characterize already known control signals Aligning sequence sets with respect to a well characterized motif might allow the detection of binding sites of cooperating transcription factors positionally correlated with the known motif Confirm or challenge commonly accepted hypotheses originally derived from small sets