Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Comparative genomics: functional characterization of new genes and regulatory interactions using computer analysis Mikhail Gelfand Institute for Information Transmission Problems (The Kharkevich Institute), RAS Workshop at the Landau Instiute of Theoretical Physics, RAS September 27-28, 2007, Moscow The genome is decyphered! Is it? To intercept a message does not mean to understand it Fragment of a genome (0.1% of E. coli) A typical bacterial genome: several million nucleotides ~600 through ~9,000 genes (~90% of the genome encodes proteins) Propaganda 10000000 1000000 100000 10000 1000 sequences in GenBank 100 (~genes) articles in PubMed (~experiments) 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 год More propaganda Most genes will never be studied in experiment Even in E.coli: only 20-30 new genes per year (hundreds are still uncharacterized) • “Universally missing genes” – not a single known gene even for ~10% reactions of the central metabolism. No genes for >40% reactions overall. • “Conserved hypothetical genes” (5-15% of any bacterial genome) – essential, but unknown function. The local goal: to characterize the genes • What? – function (rather, role) • When? – regulation (conditions) • gene expression • lifetime (mRNA, protein) • Where? – Localization • Cellular/membrane/secreted • How? – Mechanism of action • Specificity, regulation (biochemistry) Propaganda-2: complete genomes 90 84 80 70 2007: > 1200 bacterial genomes 60 55 50 40 30 30 20 10 19 18 14 9 2 0 1995 4 1 2 1 3 2 1996 1997 1998 4 2 10 7 4 1999 2000 15 8 2001 2002 The global goal: to predict the organism’s properties given its genome (plus some additional information, e.g. the initial state after cell division) and “to understand” the evolution of genomes/organisms Haemophilus influenzae, 1995 Vibrio cholerae, 2000 The metabolic map, the bird’s view Metabolic pathways, the eagle’s view A submap (metabolism of arginine and proline) Approaches • Similarity => homology (common origin) • Homology => common function • “The Pearson Principle” (after Karl Pearson): important features are conserved – functional sites in proteins – regulatory (protein-binding) sites in DNA – not necessarily sequences: • structure of protein and RNA • gene localization on chromosomes • co-expression of genes • Allows one to annotate 50-75% of genes in a bacterial genome • Necessary first step, may be automated (to some extent) … but not so simple • Similarity ≠ homology – Low complexity regions, unstructured domains, transmembrane segments and other regions with non-strandard amino acid composition • The need for correct similarity measures – Does homology always follow from the structural similarity? • What is structural similarity? How can it be measured? • Convergent evolution of structures? Independent emergence of folds? • Homology ≠ same function – What is «the same function»? • Biochemical details and cellular role “The Fermi principle” (after Enrico Fermi) Purely homology-based annotation: boring (nothing radically new) It turns out, one can predict something completely new Comparative genomics Positional clustering • Genes that are located in immediate proximity tend to be involved in the same metabolic pathway or functional subsystem – caused by operon structure, but not only • horizontal transfer of loci containing several functionally linked operons • compartmentalisation of products in the cytoplasm – very weak evidence • stronger if observed in may unrelated genomes • May be measured – e.g. the STRING database/server (P.Bork, EMBL) – and other sources STRING: trpB – positional clusters Functionally dependent genes tend to cluster on chromosomes in many different organisms Vertical axis: number of gene pairs with association score exceeding a threshold. Control: same graph, random re-labeling of vertices More genomes (stronger links) => highly significant clustering Fusions • If two (or more) proteins form a single multidomain protein in some organism, they all are likely to be tightly functionally related • Very useful for the analysis of eukaryotes • Sometimes useful for the analysis of prokaryotes STRING: trpB – fusions Phyletic patterns • Functionally linked genes tend to occur together • Enzymes with the same function (isozymes) have complementary phyletic profiles STRING: trpB – co-occurrence (phyletic patterns) Phyletic patterns in the Phe/Tyr pathway shikimate kinase Archaeal shikimate-kinase Chorismate biosynthesis pathway (E. coli) Arithmetics of phyletic patterns Shikimate dehydrogenase (EC 1.1.1.25): AroE COG0169 aompkzyqvdrlbcefghsnuj-i-5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19) AroA COG0128 aompkzyqvdrlbcefghsnuj-i-Chorismate synthase (EC 2.5.1.19) AroC COG0082 aompkzyqvdrlbcefghsnuj-i-- Shikimate kinase (EC 2.7.1.71): Typical (AroK) COG0703 ------yqvdrlbcefghsnuj-i-Archaeal-type COG1685 aompkz-------------------+ aompkzyqvdrlbcefghsnuj-i-Two forms combined 3-dehydroquinate dehydratase (EC 4.2.1.10): Class I (AroD) COG0710 aompkzyq---lb-e----n---i-Class II (AroQ) COG0757 ------y-vdr-bcefghs-uj---+ aompkzyqvdrlbcefghsnuj-i-Two forms combined Distribution of association scores: monotonic for subunits, bimodal for isozymes Comparative analysis of regulation • Phylogenetic footprinting: regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions • Consistency filtering: regulons (sets of coregulated genes) are conserved => – true sites occur upstream of orthologous genes – false sites are scattered at random Riboflavin (vitamin B2) biosynthesis pathway PURINE BIOSYNTHESIS PATHWAY GTP ribA PENTOSE-PHOSPHATE PATHWAY ribA GTP cyclohydrolase II 2,5-diamino-6-hydroxy-4-(5`-phosphoribosylamino)pyrimidine ribG ribA Pyrimidine deaminase 5-amino-6-(5`-phosphoribosylamino)uracil ribulose-5-phosphate 3,4-DHBP synthase ribD ribB ribG 3,4-dihydroxy-2-butanone-4-phosphate ribD Pyrimidine reductase 5-amino-6-(5`-phosphoribitylamino)uracil ribH ribH Riboflavin synthase, -chain 6,7-dimethyl-8-ribityllumazine ribB ypaA ribE Riboflavin Riboflavin synthase, -chain 5’ UTR regions of riboflavin genes from bacteria BS BQ BE HD Bam CA DF SA LLX PN TM DR TQ AO DU CAU FN TFU SX BU BPS REU RSO EC TY KP HI VK VC YP AB BP AC Spu PP AU PU PY PA MLO SM BME BS BQ BE CA DF EF LLX LO PN ST MN SA AMI DHA FN GLU 1 2 2’ 3 =========> ==> <== ===> TTGTATCTTCGGGG-CAGGGTGGAAATCCCGACCGGCGGT AGCATCCTTCGGGG-TCGGGTGAAATTCCCAACCGGCGGT TGCATCCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT TTTATCCTTCGGGG-CTGGGTGGAAATCCCGACCGGCGGT TGTATCCTTCGGGG-CTGGGTGAAAATCCCGACCGGCGGT GATGTTCTTCAGGG-ATGGGTGAAATTCCCAATCGGCGGT CTTAATCTTCGGGG-TAGGGTGAAATTCCCAATCGGCGGT TAATTCTTTCGGGG-CAGGGTGAAATTCCCAACCGGCAGT ATAAATCTTCAGGG-CAGGGTGTAATTCCCTACCGGCGGT AACTATCTTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT AAACGCTCTCGGGG-CAGGGTGGAATTCCCGACCGGCGGT GACCTCTTTCGGGG-CGGGGCGAAATTCCCCACCGGCGGT CACCTCCTTCGGGG-CGGGGTGGAAGTCCCCACCGGCGGT AATAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGCGGT TTTAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGTGGT GAAGACCTTCGGGG-CAAGGTGAAATTCCTGATCGGCGGT TAAAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGTGGT ACGCGTGCTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT -AGCGCACTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT GTGCGTCTTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT GTGCGTCTTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT TTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT GTACGTCTTCAGGG-CGGGGTGGAATTCCCCACCGGCGGT GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT TCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT GCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT CAATATTCTCAGGG-CGGGGCGAAATTCCCCACCGGTGGT GCTTATTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT GCGCATTCTCAGGG-CAGGGTGAAAGTCCCTACCGGTGGT GTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT ACATCGCTTCAGGG-CGGGGCGTAATTCCCCACCGGCGGT AACAATTCTCAGGG-CGGGGTGAAACTCCCCACCGGCGGT GTCGGTCTTCAGGG-CGGGGTGTAAGTCCCCACCGGCGGT GGTTGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT AAACGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT TAACGTTCTCAGGG-CGGGGTGCAACTCCCCACCGGCGGT TAACGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT TAAAGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT AAGCGTTCTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT GCTTGTTCTCGGGG-CGGGGTGAAACTCCCCACCGGCGGT ATCAATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT GTCTATCTTCGGGG-CAGGGTGAAAATCCCGACCGGCGGT ATTCATCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT AATGATCTTCAGGG-CAGGGTGAAATTCCCTACCGGCGGT GAAGATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT GTTCGTCTTCAGGGGCAGGGTGTAATTCCCGACCGGTGGT AAATATCTTCAGGG-CACCGTGTAATTCGGGACCGGCGGT GTTCATCTTCGGGG-CAGGGTGCAATTCCCGACCGGTGGT AAGAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGCGGT AAGTGTCTTCAGGG-CAGGGTGTGATTCCCGACCGGCGGT AAGTGTCTTCAGGG-CAGGGTGAGATTCCCGACCGGCGGT ATTCATCTTCGGGG-TCGGGTGTAATTCCCAACCGGCAGT TCACAGTTTCAGGG-CGGGGTGCAATTCCCCACTGGCGGT ACGAACCTTCGAGG-TAGGGTGAAATTCCCGACCGGCGGT AATAATCTTCGGGG-CAGGGTGAAATTCCCGACCGGTGGT ---TGTTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT Add. 3’ -><<=== 21 AGCCCGTGAC-19 AGTCCGTGAC-20 AGCCCGCGA--19 AGTCCGTGAC-23 AGCCCGTGAC-2 AGCCCGCAA--2 AGCCCGCG---6 AGCCTGCGAC-2 AGCCCGCGA--2 AGCCCACGA--3 AGCCCGCGAG-15 AGCCCGCGAA-3 AGCCCGCGAA-2 AGTCCGCGA--2 AGTCCGCGA--20 AGCCCGCGA--2 AGTCCACG---3 AGTCCGCGAC-3 AGTCCGCGAC-30 AGCCCGCGAGCG 21 AGCCCGCGAGCG 31 AGCCCGCGAGCG 21 AGCCCGCGAGCG 17 AGCCCGCGAGCG 67 AGCCCGCGAGCG 20 AGCCCGCGAGCG 2 AGCCCACGAGCG 14 AGCCCACGAGCG 13 AGCCCACGAGCG 40 AGCCCGCGAGCG 25 AGCCCACGAGCG 18 AGCCCGCGAGCG 16 AGCCCGCGAGCA 34 AGCCCGCGAGCG 13 AGCCCGCGAGCG 17 AGCCCGCGAGCG 19 AGCCCGCGAGCG 19 AGCCCGCGAGCG 19 AGCCCGCGAGCG 16 AGCCCGCGAGCG 34 AGCCCGCGAGCG 17 AGCCCGCGAGCG 18 AGCCCGCGA--27 AGCCCGCGA—-20 AGCCCGCGA--2 AGCCCGCGAG-2 AGCCCGCG---3 AGTCCACGAC-21 ACTCCGCGAT-3 AGTCCACGAT-125 AGTCCGTG---14 AGTCCGCG---104 AGTCCGCG---6 AGCCTGCGAC-14 AGCCCGCGC--20 AGCCCGCAAC-2 AGTCCACG---28 AGCCCGCGAGCG Variable 4 4’ 5 5’ 1’ -> <====> <==== ==> <== <========= 8 4 8 -----TGGATTCAGTTTAA-GCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAT 8 5 8 -----TGGATCTAGTGAAACTCTAGGGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATATG 3 4 3 -----AGGATCCGGTGCGATTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATGCC 10 4 10 ----–TGGACCTGGTGAAAATCCGGGACCGACAGTGAA-AGTCTGGAT-GGGAGAAGGAAACG 8 4 8 ----–TGGATTCAGTGAAAAGCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAG 3 4 3 ------AGATCCGGTTAAACTCCGGGGCCGACAGTTAA-AGTCTGGAT-GAAAGAAGAAATAG 7 6 7 --------ATTTGGTTAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GGAAGAAGATATTT 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGTTAA-AGTCTGGAT-GGGAGAAAGAATGT 4 4 4 -----ATGATTCGGTGAAACTCCGAGGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAATA 3 4 3 -----ATGATTTGGTGAAATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAAAA 5 4 5 ----–TTGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAGAGCGTGA 8 12 9 ----–CCGATGCCGCGCAACTCGGCAGCCGACGGTCAC-AGTCCGGAC-GAAAGAAGGAGGAG 5 4 5 -----CCGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAAGGAGGGC 7 7 7 -----AGGAACCGGTGAGATTCCGGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGATGAAA 13 4 12 -----AGGAACTAGTGAAATTCTAGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGAGCAGA 3 4 3 -----AGGACCCGGTGTGATTCCGGGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTCGGC 5 4 5 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GGGAGAAGAATTAG 8 5 8 -----TGGAACCGGTGAAACTCCGGTACCGACGGTGAA-AGTCCGGAT-GGGAGGTAGTACGTG 8 5 8 -----TTGACCAGGTGAAATTCCTGGACCGACGGTTAA-AGTCCGGAT-GGGAGGCAGTGCGCG 137 GTCAGCAGATCTGGTGAGAAGCCAGAGCCGACGGTTAG-AGTCCGGAT-GGAAGAAGATGTGC 8 4 8 GTCAGCAGATCTGGTCCGATGCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGATGTGC 7 5 7 GTCAGCAGATCTGGTGAGAGGCCAGGGCCGACGGTTAA-AGTCCGGAT-GAAAGAAGATGGGC 11 3 11 GTCAGCAGATCCGGTGAGATGCCGGGGCCGACGGTCAG-AGTCCGGAT-GGAAGAAGATGTGC 8 4 8 GACAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAG-AGTCCGGAT-GGGAGAGAGTAACG 8 3 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGGGTAACG 8 4 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGAGTAACG 26 9 30 GTCAGCAGATTTGGTGAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAAAGAGAATAAAA 11 9 11 GTCAGCAGATTTGGTGAGAATCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAGAATAAGC 5 4 5 GTCAGCAGATCTGGTGAGAAGCCAGGGCCGACGGTTAC-AGTCCGGAT-GAGAGAGAATGACA 16 6 16 GTCAGCAGACCCGGTGTAATTCCGGGGCCGACGGTTAT-AGTCCGGAT-GGGAGAGAGTAACG 16 4 27 GTCAGCAGATTTGGTGCGAATCCAAAGCCGACAGTGAC-AGTCTGGAT-GAAAGAGAATAAAA 10 4 10 GTCAGCAGACCTGGTGAGATGCCAGGGCCGACGGTCAT-AGTCCGGAT-GAGAGAAGATGTGC 10 3 11 ---CGCAGATCTGGTGTAAATCCAGAGCCGACGGT-AT-AGTCCGGAT-GAAAGAAGACGACG 6 6 6 GTCAGCAGATCTGGTG 52 TCCAGAGCCGACGGT 31 AGTCCGGAT-GGAAGAGAATGTAA 7 3 7 GTCAGCAGATCTGGTGCAACTCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGGCGTCA 7 9 7 GTCAGCAGATCCGGTGAGAGGCCGGAGCCGACGGT-AT-AGTCCGGAT-GGAAGAGGACAAGG 19 4 18 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAC-AGTCCGGATGAAGAGAGAACGGGA 15 4 16 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAT-AGTCCGGATGAAGAGAGAGCGGGA 14 4 13 GTCAGCAGACCCGGTGCGATTCCGGGGCCGACGGTCAT-AGTCCGGATAAAGAGAGAACGGGA 8 5 8 GTCAGCAGATCCGGTGTGATTCCGGAGCCGACGGTTAG-AGTCCGGAT-GAAAGAGGACGAAA 8 3 8 GTCAGCAGATCCGGTCGAATTCCGGAGCCGACGGTTAT-AGTCCGGAT-GGAAGAGAGCAAGC 10 15 10 GTCAGCAGATCCGGTGAGATGCCGGAGCCGACGGTTAA-AGTCCGGAT-GGAAGAGAGCGAAT 5 4 5 -----AGGATTCGGTGAGATTCCGGAGCCGACAGT-AC-AGTCTGGAT-GGGAGAAGATGGAG 3 5 3 -----AGGATTTGGTGTGATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG 3 4 3 -----AGGATCCGGTGCGAGTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGAAG 3 4 3 ----TATGATCCGGTTTGATTCCGGAGCCGACAGT-AA-AGTCTGGAT-GAAAGAAGATATAT 6 4 6 -------GATTTGGTGAGATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAGAGAAGATATTT 5 3 5 ----ATTGAATTGGTGTAATTCCAATACCGACAGT-AT-AGTCTGGAT—-AAAGAAGATAGGG 4 4 4 ----–TTGAAGCAGTGAGAATCTGCTAGCGACAGT-AA-AGTCTGGAT-GGAAGAAGATGAAC 3 10 3 ----TTGACTCTGGTGTAATTCCAGGACCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGTTG 3 4 3 -------GATGTGGTGAGATTCCACAACCGACAGT-AT-AGTCTGGAT-GGGAGAAGACGAAA 3 4 3 -------GATGTGGTGTAACTCCACAACCGACAGT-AT-AGTCTGGAT-GAGAGAAGACCGGG 3 4 3 -------GATGTGGTGAAATTCCACAACCGACAGT-AA-AGTCTGGAT-GGGAGAAGACTGAG 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG 5 5 5 ------TGATCTGGTGCAAATCCAGAGCCAACGGT-AT-AGTCCGGAT-GGAAGAAACGGAGC 11 4 11 --CGACTGACTTGGTGAGACTCCAAGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTACAA 4 6 4 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GAGAGAAGAAAAGA 10 4 10 GTCAGCAGATCCGGTTAAATTCCGGAGCCGACGGTCAT-AGTCCGGAT-GCAAGAGAACC--- Conserved secondary structure of the RFNelement additional stemloop variable stem-loop Ag Y u C N rU G CRY G N GY G 3 G C c A A N UC C c N a * GGgN N c G Y 2 x G G g rC U Y Y 1 y N N N N 5’ * * * * GG A R R r N N N N KN R A RG K x Y yB RYC V Rr C 4 C G A U xN CRG N AG Y C UG A x R R 5 g x u Capitals: invariant (absolutely conserved) positions. A 3’ Lower case letters: strongly conserved positions. Dashes and stars: obligatory and facultative base pairs Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide or deletion RFN: the mechanism of regulation • Transcription attenuation • Translation attenuation Early observation: an uncharacterized gene (ypaA) with an upstream RFN element Phylogenetic tree of RFN-elements (regulation of riboflavin biosynthesis) no riboflavin biosynthesis duplications no riboflavin biosynthesis YpaA a.k.a. RibU: riboflavin transporter in Gram-positive bacteria • 5 predicted transmembrane segments => a transporter • Upstream RFN element (likely co-regulation with riboflavin genes) => transport of riboflaving or a precursor • S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin pathway => transport of riboflavin Prediction: YpaA is riboflavin transporter (Gelfand et al., 1999) Validation: • YpaA transports flavines (riboflavin, FMN, FAD): by genetic analysis (Kreneva et al., 2000) by direct measurement (Burgess et al., 2006; Vogl et al., 2007 ) • ypaA is regulated by riboflavin: by microarray expression study (Lee et al., 2001) • … via attenuation of transcription (and to some extent inhibition of translaition) (Winkler et al., 2003) Conserved structures of riboswitches (circled: X-ray) RFN-element Var B12-element THI-element Add I Add Ag Y CC N r UG G P3 A a N NU GY R A Y x B K N GA y YC R V Rr C C G P5 UxN A CRG N GG Y CU Ax G A u x g R R GA A R R r N N N N P4 RY G N YG CCc N G C c A G G g Nc P2 xG G g R P1 C U Y Y y N N N N K N u RG g t Gg P3 Add P5 U R R C P4 G GG P3 P2 G M P2 UN UCU P3 A C UA U R P1 C U U A Y G R C 3' 5' base stem P2 r N g k c tG y G h N yG Aa gg r Cc N CCCD P3 K G T r a y GgN g P2 A Ga Nc U A P1 Uu C u a H g G P4 U G C YAA N u c c N g car Ga A U R A G a N r gu y 3' 5' base stem P5 Var C C d box Add III LYS-element a A N a P6 r y Yu G G g R a A G C y yGC P5 ga k P5 P4 P3 a aG G r a ug a y a r r CG P2 y G GA G a u R P1 r C u a Y y a gN c U P7 G u CaY a G 3' 5' base stem A P7 CTG c gG GGY AG A C G M B12 k G C g A C a g P6 g c C r A G Y 5' 3' base stem c Gr c G C P4 h a C 3' 5' base stem S-box c AG G G A G A N A N R N N N A A G G G a N a a c C P1 D c C a A C G R G NUN R U R cg C c y G C d P1 G-box C GU C A AA CY GG U A CC A G G G A U 3' 5' base stem AU GG U A R aN t C g GuR Add II Mechanisms + Effector - Effector Antiterminator/Antisequestor RNA-element A 2 1 Regulatory hairpin 5’ (terminator of transcription and/or RBS-sequestor) 3 UUUUUUUU In the case of regulation of transcription 1 2 Antiterminator/Antisequestor 3 5’ UUUUUUUU GENES 3’ 3 2 1 gcvT: ribozyme, cleaves its mRNA (the Breaker group) THI-box in plants: inhibition of splicing (the Breaker and Hanamoto groups) GENES 3’ In the case of regulation of translation B RNA-element 5’ 3’ GENES 5’ 5’ GENES 3’ RNA-element C Regulatory hairpin 1 2 GENES 1 5’ 5’ 2 GENES 3’ 3’ UUUUUUUU GENES 3’ Characterized riboswitches (more are predicted) RFN Riboflavin biosynthesis and transport FMN (flavin Bacillus/Clostridium group, mononucleotide) proteobacteria, actinobacteria, other bacteria THI Biosynthesis and transport of thiamin and related compounds TPP (thiamin pyrophosphate) Bacillus/Clostridium group, proteobacteria, actinobacteria, cyanobacteria, other bacteria, archea (thermoplasmas), plants, fungi B12 Biosynthesis of cobalamine, transport of cobalt, cobalamindependent enzymes Coenzyme B12 (adenosylcobalamin) Bacillus/Clostridium group, proteobacteria, actinobacteria, cyanobacteria, spirochaetes, other bacteria S-box SAM-II SAM-III Metabolism of methionine and cystein SAM (S-adenosylmethionine) Bacillus/Clostridium group and some other bacteria SAM-II (alpha), SAM-III (Streptococci) LYS Lysine metabolism lysine Bacillus/Clostridium group, enterobacteria, other bacteria G-box Metabolism of purines purines Bacillus/Clostridium group and some other bacteria glmS Synthesis of glucosamine-6phosphate glucosamine-6phosphate Bacillus/Clostridium group Catabolism of glycine glycine Bacillus/Clostridium group (ribozyme) gcvT (tandem) Properties of riboswitches • Direct binding of ligands • High conservation – Including “unpaired” regions: tertiary interactions, ligand binding • Same structure – different mechanisms: transcription, translation, splicing, (RNA cleavage) • Distribution in all taxonomic groups – diverse bacteria – archaea: thermoplasmas – eukaryotes: plants and fungi • Correlation of the mechanism and taxonomy: – attenuation of transcription (anti-anti-terminator) – Bacillus/Clostridium group – attenuation of translation (anti-anti-sequestor of translation initiation) – proteobacteria – attenuation of translation (direct sequestor of translation initiation) – actinobacteria • Evolution: horizontal transfer, duplications, lineage-specific loss • Sometimes very narrow distribution: evolution from scratch? Conserved signal upstream of nrd genes Identification of the candidate regulator by the analysis of phyletic patterns COG1327: the only COG with exactly the same phylogenetic pattern as the signal – “large scale” on the level of major taxa – “small scale” within major taxa: • absent in small parasites among alpha- and gammaproteobacteria • absent in Desulfovibrio spp. among delta-proteobacteria • absent in Nostoc sp. among cyanobacteria • absent in Oenococcus and Leuconostoc among Firmicutes • present only in Treponema denticola among four spirochetes COG1327 “Predicted transcriptional regulator, consists of a Zn-ribbon and ATP-cone domains”: regulator of the riboflavin pathway (RibX)? Additional evidence: co-localization nrdR is sometimes clustered with nrd genes or with replication genes dnaB, dnaI, polA Additional evidence: co-regulated genes In some genomes, candidate NrdRbinding sites are found upstream of other replicationrelated genes – dNTP salvage – topoisomerase I, replication initiator dnaA, chromosome partitioning, DNA helicase II Multiple sites (nrd genes): FNR, DnaA, NrdR Mode of regulation • Repressor (overlaps with promoters) • Co-operative binding: – most sites occur in tandem (> 90% cases) – the distance between the copies (centers of palindromes) equals an integer number of DNA turns: • mainly (94%) 30-33 bp, in 84% 31-32 bp – 3 turns • 21 bp (2 turns) in Vibrio spp. • 41-42 bp (4 turns) in some Firmicutes Experimental validations Acknowledgements • Dmitry Rodionov (comparative genomics) • Andrei Mironov (software) • Alexei Vitreschak (riboswitches) • Funding: – – – – Howard Hughes Medical Institute Russian Foundation of Basic Research RAS, program “Molecular and Cellular Biology” INTAS