* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Regulatory sequences
Neuronal ceroid lipofuscinosis wikipedia , lookup
Genome (book) wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
DNA vaccination wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Non-coding RNA wikipedia , lookup
Epitranscriptome wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Transposable element wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Epigenetics of depression wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome editing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Point mutation wikipedia , lookup
Transcription factor wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Designer baby wikipedia , lookup
Primary transcript wikipedia , lookup
Helitron (biology) wikipedia , lookup
Regulatory Sequences (Basics) Alexander Kel Senior Vice President of Genome Informatics, BIOBASE GmbH, Halchtersche Strasse 33 D-38304 Wolfenbuettel Germany www.biobase.de Pathway builder Array analyser TRANSPATH - mechanistic - semantic S/MARt DB Patho DB TRANSFAC Match Patch Catch CMFinder TRANSCompel Cytomer TRANSGenome TRANSPLORER BIOBASE customers* TRANSFAC Syngenta Celera Monsanto Pfizer Merck Sharp & Dome Amgen Takeda Novartis GlaxoSmithKline TRANSPATH Vertex More than 200 academic labs including: Harvard Stanford Tokyo University Riken Labs Max Planck More than 7000 registered users on our portal gene-regulation.com Both Aventis Eli Lilly Schering Plough Hoffmann La Roche Akzo Nobel * not complete Same blocks - different structures LEGO system Concepts of gene regulation DNA amplification, methylation, chromatin structure transcription RNA information carrier 1 carrier organization transformation splicing, degradation translation protein modification, degradation information carrier 2 Gene structure TRANSFAC Regulatory Elements Gene Contig 3‘ 5‘ Transcription primary transcript Splicing Splice Variants mRNA altern. exon 5’-UTR CDS 3’-UTR Gene structure TRANSFAC Regulatory Elements Gene Contig 3‘ 5‘ Transcription primary transcript Splicing Splice Variants mRNA altern. exon 5’-UTR CDS 3’-UTR General schema of the modular hierarchical structure of transcription regulatory regions of eukaryotic genes. TSS enhancer 2 enhancer 1 box A‘ promoter box C box B composite element box A‘‘ box G box F box D‘ box E box D box A TATA box initiator Inr trans cis … Human genes Sequences and positions of AP-1 binding sites glutathione Ptransferase enhancer at -2500 hemoglobin, epsilon TGAСTTT -80 н.п. TGACATC Akt-2 IFN- -100 н.п. TGTCACC -89 н.п. Apo АII TGACTCA -792 н.п. TGAGTCA Melanotransferin -2013 н.п. Collagenase TGAGTCA -72 н.п. proto-oncogene c-myc porphobilinogen deaminase TGATTTA -335 н.п. TGACTCA -162 н.п. GM-CSF TGACTCA enhancer at -3500 What is a transcription factor? A transcription factor is a protein that regulates transcription after nuclear translocation by specific interaction with DNA or by stoichiometric interaction with a protein that can be assembled into a sequence-specific DNA-protein complex. Transcription factors Sequencespecific DNA binding Non-DNA binding HAT Layer III Co-activator Layer II Layer I DNA adapter TF1 TF2 TF3 TF4 Structure of transcription factors USF-1, dimer Structure of transcription factors oligomerization domain Ligandbinding domain Activation domain Protein-protein interaction domain DNA binding domain N Gene 1. Scavenger receptor, Homo sapiens Schema and positions of a CE TRANSCompel accession number C00080 Ets AP-1 Enhancer –4500/-4100 2. -53 : GM-CSF, Mus musculus -40 : 3. 4. Collagenase, Homo sapiens -89 : -82 : C00081 Ets AP-1 -72 : Ets -66 : C00083 AP-1 IgH , Mus musculus C00133 Ets AP-1 Enhancer at 3’ flank 5. 6. 7. 8. 9. 10. Interleukin 2, Homo sapiens Interleukin 2, Homo sapiens Интерлейкин 2, Mus musculus -283 : NFAT IRF-1, Mus musculus AP-1 -167 : C00109 -142 : NF-B AP-1 -167 : IgH, Homo sapiens Сывороточный амилоид А1, Rattus norvegicus -268 : C00165 -142 : AP-1 Oct-2 Ets CBF C00158 C00173 -117 : -73 : C/EBP -123 : -113 : STAT-1 С00101 NF-B -49 : -40 : NF-B C00192 Ternary complex NFATp - AP1 - DNA Composite elements Minimal functional units where both protein-DNA and protein-protein interactions contribute to a highly specific pattern of gene expression and provide cross-coupling of different signal transduction pathways. F2 F1 Low level of transcription Low level of transcription F1 F2 Synergistic activation of transcription F1 F2 Integration of signals. Cross-coupling of signal transduction pathways Membrane receptor Ca2+ dependent canal Src Ras SH2 Ras SH3 Phosphorylation Ca2+ Ca2+ GTP GDP PLC Adaptors PI3-K Ca2+ cytoplasm IP3 Calcineurin PKB/Akt P NFATp ERK JNK NFATp ERK NFATp Nucleus c-Fos P38MAPK JNK c-Fos IL-2 P P c-Jun с-Fos c-Jun Composite element P38MAPK c-Jun ATF-2 c-Jun ATF-2 ATF-2 Mechanisms of functioning of synergistic composite elements 1) F1 S1 F2 F1 F2 S2 S1 S2 F2 F1 F2 S2 S1 S2 2) F1 S1 Cooperative binding to DNA and ternary complex formation A new protein surface for DNA recognition could be formed 3) F1 F2 S1 S2 Simultaneous interaction of activation domains with the components of the basal complex Mechanisms of functioning of synergistic composite elements 4) F1 F2 S1 S2 Forming a new protein surface for interaction with the basal complex 5) F1 F1 s1 F2 F2 s2 Relief of autoinhibition as a result of proteinprotein interactions Mechanisms of functioning of synergistic composite elements 6) DNA bending by one of the transcription factors F1 S1 F2 S2 7) DNA wrapping around a nucleosome allows transcription factors to interact F1 F2 8) HAT complex F1 F2 S1 S2 Recruitment of a HAT complex by one of the transcription factors Mechanisms of functioning of antagonistic composite elements 1) HAT complex Mutually exclusive binding of factor F1(activator) and F2 (repressor) HDAC complex Mechanisms of functioning of antagonistic composite elements 2) HAT complex Binding of F2 (repressor) results in the conformational changes of F1 (activator) HDAC complex Mouse IL-4 promoter AP-1 HMG Y STAT 6 AP-1 NF-Y AP-1 HMG Y AP-1 c-MAF TATA NFAT NFAT NFAT NFAT CE -249 -180 -150 -114 -88 ST NFAT CE -60 -28 +1 AP-1 AP-1 AP-1 CBF AP-1 NF-B c-Rel/p65 NF-B p50/p65 GM-CSF Homo sapiens CBF AP-1 TATTT NFAT NFAT CE NFAT CE T-cell specific inducible enhancer at –3500 bp NFAT CE NFAT HMG Y(I) -114 -88 CD28 response element -54 CE Promoter ST +1 Enhanceosome Recruitment of CIITA to MHC-II promoters. A prototypical MHC-II promoter (HLA-DRA) is represented schematically with the W, X, X2, and Y sequences conserved in all MHC-II, Ii, and HLA-DM promoters. RFX, X2BP, NF-Y, and an as yet undefined W-binding protein bind cooperatively to these sequences and assemble into a stable higher order nucleoprotein complex referred to here as the MHC-II enhanceosome. CIITA is tethered to the enhanceosome via multiple weak protein-protein interactions with the W, X, X2, and Y-binding factors. The octamer site found in the HLA-DRA promoter (O), and its cognate activators (Oct and OBF-1) are not required for recruitment of CIITA. CIITA is proposed to activate transcription (arrow) via its amino-terminal activation domains (AD), which contact the RNA polymerase II basal transcription machinery. Masternak K et al., Genes Dev 2000 May 1;14(9):1156-66 Closed nucleosomes Site-specific TF Acetylase Acetilase PCAF Co-activator p300/CBP Acetilation Acetylation TFIID TFIIA TFIIB TFIIF TFIIE RNA pol II TFIIH S/MARs Scaffold/matrix attached regions (S/MARs) are regions of the DNA strand that are found the basis of chromatin loops. They anchor the DNA to the proteinaceous nuclear matrix. Each loop is considered to be a functional domain. S/MARs genes residual DNA S/MARs may act as border elements and thus, protect gene expression from position effects. S/MARs open chromatin promoter enhancer gene compact chromatin (transcribed region) SAR LCR (regulated) SAR SAR LCR SAR nuclear scaffold J. Bode / E. Wingender 1993 Databases on gene regulation BKL: collected information is displayed in a ‘one page per protein’ format = Protein Reports • Clear identification of where you are (which species and which protein). • Tabular presentation of controlled-vocabulary terms. • Annotations linked to PubMed references. • Clear paths of navigation between protein reports, within a species and between species. • Links to ‘public domain’ databases. N 1. 2. Databases containing gene regulation information EMBL Nucleotide sequence database GeneBank 3. 4. 5. 6. SWISS-PROT PIR: Protein Information Resourсe PDB EPD - Eukaryotic promoter database 7. 8. 9. 10. 11. TRANSFAC TRRD COMPEL TFD - Transcription factor database RegulonDB 12. SCPD - The Promoter Database of Saccharomyces cerevisiae 13. Muscle-Specific Regulation of Transcription (A Catalogue of Regulatory Elements) 14. EpoDB. (Database of genes that relate to vertebrate red blood cells) 15. GENET URL http://www.ebi.ac.uk/embl.html http://www.ncbi.nlm.nih.gov/Web/Genbank/inde x.html http://www.expasy.ch http://www-nbrf.georgetown.edu/pir http://www.pdb.bnl.gov/ http://www.epd.isb-sib.ch http://transfac.gbf.de/TRANSFAC http://www.bionet.nsc.ru/trrd/ http://compel.bionet.nsc.ru/ http://www.ifti.org/ http://www.cifn.unam.mx/Computational_Biolog y/regulondb http://cgsigma.cshl.org/jian/ http://agave.humgen.upenn.edu/MTIR/HomePage .html http://agave.hum-gen.upenn.edu/epodb/ http://www.iephb.ru/~spirov/genet00.html 16. PlantCARE http://sphinx.rug.ac.be:8080/PlantCARE/ 17. PLACE 18 DBTSS http://www.dna.affrc.go.jp/htdocs/PLACE/ http://dbtss.hgc.jp/ EMBL data library Feature gene Definition region of biological interest identified as a gene and for which a name has been assigned; Optional Qualifiers /allele="text" /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /label= /map="text" /note="text" /product="text" /pseudo /phenotype="text" /standard_name="text" /usedin=accnum:feature_label Comments the gene feature describes the interval of DNA that corresponds to a genetic trait or phenotype; the feature is, by definition, not strictly bound to it's positions at the ends; it is meant to represent a region where the gene is located. EMBL data library Feature promoter Definition region on a DNA molecule involved in RNA polymerase binding to initiate transcription; Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /phenotype="text" /pseudo /standard_name="text" /usedin=accnum:feature_label Molecule Scope DNA or look for: (start of) mRNA, or precursor_RNA, or prim_transcript, or exon /number=1, ... EMBL data library Feature misc_feature Definition region of biological interest which cannot be described by any other feature key; a new or rare feature; Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /number=unquoted /phenotype="text" /product="text" /pseudo /standard_name="text" /usedin=accnum:feature_label Comments this key should not be used when the need is merely to mark a region in order to comment on it or to use it in another feature's location; use the '-' pseudo-key instead. e.g.: FT misc_feature FT FT 4538 /note="transcription initiation site« /gene="CDC6" EMBL data library Feature enhancer a cis-acting sequence that increases the utilization of (some) eukaryotic Definition promoters, and can function in either orientation and in any location (upstream or downstream) relative to the promoter; Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /label=feature_label /gene="text /map="text" /note="text" /standard_name="text" /usedin=accnum:feature_label Organism Scope eukaryotes and eukaryotic viruses EMBL data library Feature protein_bind Definition non-covalent protein binding site on nucleic acid; Mandatory Qualifiers /bound_moiety="text" Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /standard_name="text" /usedin=accnum:feature_label Comments note that RBS is used for ribosome binding sites. EMBL data library Qualifier bound_moiety Definition moiety bound Value Format "text" Example /bound_moiety="repressor" Qualifier usedin Definition indicates that the feature is used in a compound feature in another entry Value Format Accession-number:feature-name or Database_name::Acc_number:feature_label Example /usedin=X10087:proteinx Comment database_name is an abbreviation for the name of the database in which the entry for the accession number can be found. EMBL data library FH FH FT FT FT FT FT FT FT FT FT FT FT Key Location/Qualifiers source 1..4734 /db_xref="taxon:9606„ /sequenced_mol="DNA„ /organism="Homo sapiens„ 4495..4502 /bound_moiety="E2F„ 4529..4537 /bound_moiety="E2F„ 4538 /note="transcription initiation site« /gene="CDC6" protein_bind protein_bind misc_feature experimentally confirmed sites, though no /evidence qualifier is given EMBL data library FH FH FT FT FT FT FT FT FT FT FT ... FT FT FT FT ... FT FT FT FT FT FT FT Key Location/Qualifiers source 1..3204 /db_xref="taxon:9606„ /sequenced_mol="DNA„ /organism="Homo sapiens„ 1..3201 /note="melanocortin-1 receptor„ /gene="MC1R„ 570..575 /note="E-BOX„ FT FT FT FT FT FT misc_binding promoter misc_signal TATA_signal protein_bind 922..941 1343..1350 /evidence=EXPERIMENTAL /bound_moiety="AP-1„ TATA_signal misc_binding 1553..1559... 1957..1964 /evidence=EXPERIMENTAL /bound_moiety="AP-2„ 2060..2067 /evidence=EXPERIMENTAL /bound_moiety="AP-2„ misc_binding misc_binding 2069..2074 /evidence=EXPERIMENTAL /bound_moiety="SP-1„ 2603..2608 /evidence=EXPERIMENTAL /bound_moiety="SP-1" Here: misc_signal "E-BOX" and TATA_signal are identified by homology and positional reasoning, AP-1 and AP-2 binding sites are suggested by homology, Sp1 sites are confirmed by gel shift analysis EMBL data library Feature TATA_signal TATA box; Goldberg-Hogness box; a conserved AT-rich septamer found about 25 bp before the start point of each eukaryotic RNA polymerase II transcript Definition unit which may be involved in positioning the enzyme for correct initiation; consensus=TATA(A or T)A(A or T) [1,2]; Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /gene="text" /label=feature_label /map="text" /note="text" /usedin=accnum:feature_label Organism Scope eukaryotes and eukaryotic viruses Molecule Scope DNA References [1] Efstratiadis, A. et al. Cell 21, 653-668 (1980) [2] Corden, J., et al. "Promoter sequences of eukaryotic protein-encoding genes" Science 209, 1406-1414 (1980) EMBL data library Feature CAAT_signal CAAT box; part of a conserved sequence located about 75 bp up-stream of the Definition start point of eukaryotic transcription units which may be involved in RNA polymerase binding; consensus=GG(C or T)CAATCT [1,2]. Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /gene="text" /label=feature_label /gene="text" Feature GC_signal /map="text" /note="text" GC box; a conserved GC-rich region located upstream of the start point of /usedin=accnum:feature_label Definition eukaryotic transcription units which may occur in multiple copies or in either consensus=GGGCGG; eukaryotes and eukaryotic viruses Organism Scope orientation; Optional Qualifiers DNA Molecule Scope /citation=[number] /db_xref="<database>:<identifier>" [1] Efstratiadis, A. et al. Cell 21, 653-668 (1980) References /evidence=<evidence_value> /gene="text" [2] Nevins, J.R. "The pathway of eukaryotic mRNA formation" Ann Rev /label=feature_label Biochem 52, 441-466 (1983) /map="text" /note="text" /usedin=accnum:feature_label EMBL data library Feature misc_signal any region containing a signal controlling or altering gene function or expression that cannot be described by other signal keys (promoter, Definition CAAT_signal, TATA_signal, -35_signal, -10_signal, GC_signal, RBS, polyA_signal, enhancer, attenuator, terminator, and rep_origin). Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /phenotype="text" /standard_name="text" /usedin=accnum:feature_label EMBL data library ID XX AC ... FT FT FT FT MMIGHALP ID XX AC XX SV XX DT DT XX DE XX KW FT FT FT FT FT FT FT FT SSLCREG1 standard; DNA; MUS; 17956 BP. X96607; enhancer 4537..6107 /note="locus control region„ /note="alpha„ /gene="IgH" standard; DNA; MAM; 1190 BP. X86793; X86793.1 10-MAY-1995 (Rel. 43, Created) 30-MAY-1995 (Rel. 43, Last updated, Version 3) S.scrofa locus control region (1190 bp) locus control region. ... source 1..1190 /chromosome="9„ /db_xref="taxon:9823„ /organism="Sus scrofa„ /clone_lib="clonetech„ /map="p2.4„ 5..1190 /note="locus control region (HSI)" Eukaryotic Promoter Database (EPD) Praz et al., Nucleic Acids Res. 30, 322-324 http://www.epd.isb-sib.ch Eukaryotic Promoter Database (EPD) All EPD 4809 Vertebrate promoters 2540 Arthropode promoters 2000 Plant promoters 198 Viral 129 Nematode promoters 26 Praz et al., Nucleic Acids Res. 30, 322-324 (2002) http://www.epd.isb-sib.ch Eukaryotic Promoter Database (EPD) ID XX AC XX DT DT HS_MYC_1 standard; single; VRT. DE DE OS XX HG AP NP XX DR DR DR ... DR DR ... DR c-myc (cellular homologue of myelocytomatosis virus 29 oncogene), promoter 1, MYC gene. Homo sapiens (human). EP11146; ??-APR-1987 (Rel. 11, created) 10-OCT-2001 (Rel. 69, Last annotation update).XX Homology group 52; Mammalian c-myc proto-oncogene, promoter 1. Alternative promoter #1 of 2; exon 1; site 1. none. EPD; EP11148; HS_MYC_2; alternative promoter; [+162; +]. EPDEX; HS_MYC. EMBL; X00364.2; HSMYCC; [-2327, 8669]. [ EMBL; GenBank; DDBJ ] SWISS-PROT; P01106; MYC_HUMAN. TRANSFAC; R01157; HS$CMYC_01; [-49,-27]; by position. MIM; 190080. Eukaryotic Promoter Database (EPD) ... DR XX RN RX RA RA RT RT RL XX ME ME ME XX SE XX TX TX TX TX XX KW KW XX FP XX DO DO RF // MIM; 190080. [1] MEDLINE; 84026482. Battey J., Moulding C., Taub R., Murphy W., Stewart T., Potter H., Lenoir G., Leder P.; "The human c-myc oncogene: structural consequences of translocation into the IgH locus in Burkitt lymphoma"; Cell 34:779-787(1983).... Nuclease protection [2]. Nuclease protection; transfected or transformed cells [3]. Primer extension [2]. aatctccgcccaccggccctttataatgcgagggtctggacggctgaggACCCCCGAGCT 6. Vertebrate promoters 6.1. Chromosomal genes 6.1.5. Hormones, growth factors, regulatory proteins 6.1.5.16. Various cellular protooncogenes Proto-oncogene, Nuclear protein, DNA-binding, Glycoprotein, Transcription regulation. Hs c-myc P1 :+S EM:X00364.2 1+ Experimental evidence: 3,3#,6 Expression/Regulation: +mitogen;+IL-2 Cell34:779 PNAS80:6307 MCB7:1393 2328; 11146.052 010*1 MCB7:2988 RegulonDB Salgado et al., Nucleic Acids Res. 29, 72-74 (2001) http://www.cifn.unam.mx/Computational_Genomics/regulondb/ SCPD Zhu & Zhang, Bioinformatics 15, 607-611 (1999) http://cgsigma.cshl.org/jian/ PlantCARE Rombauts et al., Nucleic Acids Res. 27, 295-296 (1999) http://sphinx.rug.ac.be:8080/PlantCARE/cgi/index.html Schematic representation of "Oligo-capping" method TRRD Kolchanov et al., Nucleic Acids Res. 30, 312-317 (2002) http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/ TRRD Kolchanov et al., Nucleic Acids Res. 30, 312-317 (2002) http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/ TRANSFAC® a database on gene transcription regulation contains SITE GENE binds to and regulates is used to construct encodes for FACTOR is an attribute of MATRIX interacts TRANSFAC structure CLASS SPECIES FEATURES interacting factor SYNONYMS FACTOR MATRIX CELL gene METHOD expression SITE regulatory region SEQUENCE FUNCTIONAL ELEMENT GENE coding region Manual annotation of the databases: input client TRANSFAC: FACTOR table, protein sequence TRANSFAC: FACTOR table, protein domains TRANSFAC: FACTOR table, structural and functional features TRANSFAC: FACTOR table, links to other databases TRANSFAC: classification of transcription factors TRANSFAC: CLASS table TRANSFAC 8.1 (2004-03-31): number of factor entries for different species 1400 human plants 1200 1000 mouse other vertebrates 800 600 Fungi rat Other 400 fruit fly 200 0 TRANSFAC 8.1 (2004-03-31): distribution of experimentally known TFBS in 5‘ regions of genes. 800 700 600 500 400 300 200 100 15 00 30 00 50 00 50 0 30 0 10 0 -5 0 -1 50 -3 50 -2 50 -6 00 -4 50 -4 00 0 -2 00 0 -1 00 0 -8 00 -1 00 00 0 TRANSFAC: FACTOR table, protein-DNA and protein-protein interactions TRANSFAC: MATRIX table TRANSFAC® : accompanying tools PatchTM- pattern search MatchTM- PWM-based search gATTGGCGCGAAGtttt gATTGGCGCGAAGtttt aCAGGGCGCCAAAcgcg aCAGGGCGCCAAAcgcg aTTTCGCGCCAAActtg aTTTCGCGCCAAActtg aTTTCGCGCCAAActtg aTTTCGCGCCAAActtg aTTTCGCGCCAAActtg aTTTCGCGCCAAActtg GGCTGCGGCCAAAtctc ATCTCCCGCCAGGtcag aGTTCGCGGGCAAatgc GGCTGCGGCCAAAtctc ATCTCCCGCCAGGtcag aGTTCGCGGGCAAatgc cTTCGGCGCGCGGtgtt cTTCGGCGCGCGGtgtt tTTTCGCGCCAAAgtca tTTTCGCGCCAAAgtca tTTTGCCGCGAAAagac tTTTGCCGCGAAAagac q1 q2 TM Selection of DNA binding sites by regulatory proteins Statistical-mechanical theory O.G. Berg and P.H von Hippel Match Mutational drift Mismatch 1 2 A T 0.5 0.9 0.0 0.1 G C 1.2 0.0 0.1 0.8 ... l ... s l0 0 lB 1) Binding affinity of protein to DNA in some useful range 2) Number of sequences is large. 3) All possible sequences are equiprobable 4) lB - express the decrease in binding energy when cognate base pare is replaced by B 5) Individual base-pare contributions are independent and therefore additive The loss in the binding affinity in one position may be gained in the other position. Sites have binding affinity in a limited range E around a requred level E E In such set of sites the local contribution from every positions must sum to E l What is the frequency f lB with wich certain base pair B apeares at a certain position in a site? The same question is askeb in statistical mechanics: S independent particles in a system and a given total energy E. What a probability to that the particle lB will have the energy lB ? 1 2 f lB ( E ) n 1 4 ql e lB - is determined by the density of potential sites, i.e. by the number of possible sequence combinations that have the required descrimination energy E obs lB ln( f l obs f 0 lB ) For any sequence X of the length s the actual discrimination energy: s E ( X ) lBl l 1 s obs obs ln( f f l0 lBl ) 1 l 1 Small-sample effect nlB 1 f lB N 4 nl 0 1 lB ln nlB 1 1 Problems: 1. 2. 3. 4. Small sets of sites Homology between sites Specific function of nucleotides in certain positions Correlations between positions (not additive effect) TFS identification L L min q I (i ) f i ,bi I (i ) f i i 1 i 1 L max I ( i ) f i i 1 with: bi, nucleotide b found in the i-th position of test sequence, fbi, frequency of nucleotide b in the i-th position of the aligned training sequences, fimin, minimum frequency in position i, fimax, maximum frequency in position i, and I (i ) f i,B B{ A,T ,G ,C } ln( 4 f i , B ), i 1,2,..., L Calculating the Ci-values 100 Ci i Pi, B ln Pi, B ln 5 ln 5 BA,C ,G ,T , gap A C G T gap P P P P P (A) (C) (G) (T) (gap) Position 1 1 1 1 1 1 2 2 2 1 0 0 3 0 0 0 5 0 4 0 0 5 0 0 5 4 1 0 0 0 6 0 3 2 0 0 7 0 0 0 5 0 8 1 4 0 0 0 9 5 0 0 0 0 10 2 0 3 0 0 11 2 2 1 0 0 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.2 0 0 0 0 0 1 0 0 0 1 0 0 0.8 0.2 0 0 0 0 0.6 0.4 0 0 0 0 0 1 0 0.2 0.8 0 0 0 1 0 0 0 0 0.4 0 0.6 0 0 0.4 0.4 0.2 0 0 Ci (A) Ci (C) Ci (G) Ci (T) Ci (gap) P(B)*lnP(B)+ln(5) -0.32 -0.32 -0.32 -0.32 -0.32 0.00 Ci 0 -0.37 0 0 -0.18 0 0 0 0 -0.37 -0.37 -0.37 0 0 -0.32 -0.31 0 0 0 0 -0.37 -0.32 0 0 0 -0.37 0 0 0 -0.31 -0.32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.55 1.61 1.61 1.11 0.94 1.61 1.11 1.61 0.94 0.55 34 100 100 69 58 100 69 100 58 34 Scoring of the match To make it fast Preselection with the core: Position A C G T - 1 1 1 1 1 1 Ci 0 2 2 2 1 0 0 3 0 0 0 5 0 4 0 0 5 0 0 34 100 5 4 1 0 0 0 100 69 T core G A 6 0 3 2 0 0 7 0 0 0 5 0 8 1 4 0 0 0 58 100 69 C T 9 5 0 0 0 0 10 2 0 3 0 0 11 2 2 1 0 0 100 58 34 TRANSFAC: MatchTM tool TRANSFAC: MatchTM output Selection of optimal cut-offs 100 90 80 70 60 undeprediction error 50 overprediction error 40 error sum 30 20 10 0 0,75 minFN 0,8 0,85 0,9 minSUM 0,95 minFP 1 Example of a search using cut-offs to minimize false negative matches In this example we searched the homo sapiens angiotensinogen gene (5`region and exon1) for all bindings sites listed in the features of its Genebank entry. For that search we used cut-offs to minimize false negative matches as these cut-offs are recommended to reduce the probability that Match misses a potential binding site. Corresponding hits for all of the entries in the feature table, which concern a binding site, could be found in the Match output. Matrix-Identifier Position Feature table of Genebank entry Core Similarity Matrix Similarity Sequence Factor Name Corresponding hits found by Match TRANSPLORER (TRANScription exPLORER) is a software package for the analysis of transcription regulatory sequences. Currently, TRANSPLORER site prediction tool uses position weight matrices (PWM) collections. It is able to use several matrix sources: the largest and most up-to-date library of matrices derived from TRANSFAC® Professional database, other matrix libraries as well as any user-developed matrix libraries. This means that it provides an opportunity to search for a great variety of different transcription factor binding sites. A search can be made using all or subsets of matrices from the libraries. Search for most probable binding sites regulating gene expression Search for binding sites coinciding with SNPs TRANSCompel® a database on composite regulatory elements Key topics •pairs of closely situated binding sites for TFs; •cooperative functioning of transcription factors; •direct protein-protein interactions; •combinatorial regulation of gene transcription. individual entry Description of an evidence (experiment, cell type, two individual interactions) Link to the TRANSFAC GENE table Link to the EMBL Link to the TRANSFAC FACTOR table TRANSCompel® combinatorial regulation, more than 360 CEs N 1. Gene IgH , Mus musculus 2. IL-2, Homo sapiens Scheme of CE Ets -283 : -268 : NFAT 3. 4. -167 : IL-2, Homo sapiens -167 : IgH , Homo sapiens 6. Serum amyloid А1, Rattus norv. 7. IRF-1, Mus musculus AP-1 -142 : AP-1 5. AP-1 -142 : NF-B Il-2, Mus musculus AP-1 Ets Oct-2 CBF -117 : -73 : NF-B C/EBP -123 : STAT-1 -113 : -49 : -40 : NF-B TRANSCompel® functional classification of the composite elements inducible/inducible - Ca2+ and PKC response - IFN-gamma and TNF-alpha response NFAT / AP1 NF-kappaB / IRF inducible/constitutive - cholesterol level response - acute-phase response SREBP / Sp1 STAT-3 / Sp1 inducible/tissue-restricted - TGF-beta response in B-cells SMAD / AML tissue-restricted/tissue-restricted - pancreas islet beta-cells (insulin-producing) HNF3 / BETA2 - pituitary gonadotropes Ptx1 / SF-1 tissue-restricted/ubiquitous - macrophages PU.1 / Sp1 Inducible/inducible 19 CE‘s ETS / AP-1 providing cross-coupling of Ras/Raf- and PKC-dependent signalling pathways; 15 CE‘s NFATp / AP-1 providing cross-coupling of Ca2+ - and PKC-dependent signalling pathways; Tissue-specific 32 Inducible 44 Cell-cycle dependent Dev. stagedependent Ubiquitous constitutive F1 F2 14 CE‘s NF-B / C/EBP NF-B is inducible by IL-1 and TNF-; C/EBP is inducible by IL-6. 119 1 2 39 Tissuespecific 2 3 60 Inducible 2 Cell-cycle dep. 12 Dev. stagedependent Ubiquit. constitut. Inducible/constitutive 9 CE‘s ETS / Sp1 ETS factors are inducible through Ras/Raf- dependent signalling pathway; 5 CE‘s Smad / TEF3 Smads are inducible by TGF- signalling. Tissue-specific 32 Inducible 44 Cell-cycle dependent Dev. stagedependent Ubiquitous constitutive F1 F2 119 1 2 39 Tissuespecific 2 3 60 Inducible 2 Cell-cycle dep. 12 Dev. stagedependent Ubiquit. constitut. Inducible/tissue-restricted CE‘s Pit-1 / AP-1 Pit1 is pituitary-restricted transcription factor whereas AP-1 and Ets are ubiquitous inducible factors; Tissue-specific 32 Inducible 44 Cell-cycle dependent Dev. stagedependent Ubiquitous constitutive F1 F2 119 1 2 39 Tissuespecific 2 3 60 Inducible 2 Cell-cycle dep. 12 Dev. stagedependent Ubiquit. constitut. TRANSCompel® antagonistic type of CEs SRF mediates the rapid, transient induction of the c-fos protooncogen by serum growth factors. human c-fos SRF acaggaTGTCCATATTAGGacatctgcg YY-1 YY1 diminishes both basal and serum-induced expression of the cfos. Antagonistic composite elements COMPEL: C00006 Chicken embryonic -globin gene Sp1 Sp1 cooperatively with NF-Y activates transcription in primitive erythroid cells NF-Y GGTGGGcctccggagtgaccaatgagtgTGGACAGATGCCA NF-1 NF-1 represses transcription in adult cells COMPEL: C00009 Human c-fos protooncogene SRF mediates the rapid, transient induction of the cfos protooncogen by serum growth factors. SRF acaggaTGTCCATATTAGGacatctgcg YY1 diminishes both basal and serum-induced expression YY-1 of the c-fos. COMPEL: C00054 Rat serum amyloid A1 gene C/EBP NF-B C/EBP and NF-B synergistically activate transcription in liver cells during acute phase response TGGTAGTCTTGCACAGGAAATGACATggtGGGACTTTCCCcaggg YY-1 YY1 represses inducible transcription of this gene. Catch® pattern-based search for potential composite elements in DNA sequences • All CE‘s are used as individual searching patterns; • Several parameters are available restricting the search: mismatches in the site 1 and site 2, distance between two sites, composite score TRANSCompel® CEs of similar structure can be used to construct models 1. matrix rule Set of CEs CCACCCATTTCCTC ACAGGAATgacctggtgcCTCGCCC TTCCTCctgtgccttag...ctgtttttctaaCCGCCC M1 M2 qM1 > n1 qM2 > n2 GAAGGGCGGGGAcagtt...aagcaaaaAAAGGGAACTGA AAAGGGAACTGAgtggctgcgaaAGGGTGGGG GGAAgcaaccagCCCACCA CCGGAAGCaaccagCCCACC aaAAGGAAGTGGGCGTGGTttaaag 2. distance rule rules 3. orientation rule ACTTCCTC...GGCTCCTCCTCC Search for the potential CEs in 100 000 bp Application of CE models for promoter analysis 180 promoters -350/+50 160 exons_3d h_chr_15(whole) 140 120 h_chr_15_Alu h_chr_15_L1 100 h_chr_15_L2 80 60 40 20 0 Myb/Aml NF-kB/Sp1 Ets/Sp1 E2F/Sp1 Four CE types are over-represented in promoters in comparison with several biological sequences tested. Gene expression profiling GENE ONTOLOGY TM TRANSGENOME TRANSGENOME provides the hierarchical structure of the most important elements of a genome in coding regions as well as in regulatory regions. This structure provides the possibility to have a unique reference sequence and to store the location of all gene regulatory and structural elements. TF binding sites Composite elements Regulatory regions Genome Reference Sequence Gene Repeats S/MARs Transcripts Splicing variants Polypeptides TRANSFAC derived start of transcription (by relative site positions) site RefSeq derived potential starts of transcription (first exons) Gene pre-mRNAs (from RefSeq) DBTSS derived start of transcription EPD derived start of transcription spliced mRNAs CDS 5’UTR 3’UTR Bronchial tree and Intrapulmonary Airways Human body Lung Bronchial tree Main bronchus Lobar bronchus Segmental bronchus Bronchus Bronchiolus Terminal bronchiolus Alveolar sac Pulmonary alveolus Alveolar pore Alveolar epithelium Pneumocytes Cytomer/Content Respiratory bronchiolus Alveolar duct Alveolar septa CYTOMER structure Species ID Name CP Cell ID Name Description Organ ID Name Parent HUB ID Cytomer_no Organ_ID Cell_ID System_ID Period_ID Species_ID System ID Name Period ID T1 T2 Stage Stage2Period Stage_ID Period_ID ID T1 T2 description ID TFacc Cytomer_no Cacc CP Transfac Factor ID Acc CN ID TFacc Cytomer_no Cacc CN CYTOMER® A database on gene expression sources UniGene EST TRANSGENOME Gene expression group 1 Gene expression group 2 Gene expression group 3 The gene expression space Expression space E x Factors controlling transcription: TRANSFAC Expression space Eg of gene g (spatial axis: systems, organs, cells) c (conditional) t (temporal, developmental axis) Conditional determinants: TRANSPATH Spatio-temporal coordinates: CYTOMER Gene expression profiling Expression pattern of gene g1 Expression matrix: g1 g2 : gh -rows representing genes -columns representing samples (various tissues, developmental stage,...) x1,2 : E2 x2,1 x2,2 : .. ES .. xS,1 xS,2 .. : x1,h x2,h .. xh,h E1 x1,1 Expression profile of state E1 (e.g. in organ O at stage t) Gene expression profiling Expression state Gene Gene expression profiling