Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Plant Physiol. (1998) 117: 1129–1133 Update on Genomics Genome Sequencing and Informatics: New Tools for Biochemical Discoveries Milton H. Saier, Jr.* Department of Biology, University of California at San Diego, La Jolla, California 92093–0116 During the past 3 years, we have experienced a major revolution in the biological sciences resulting from a tremendous flux of information generated by genomesequencing efforts. Our understanding of microorganisms, the metabolic processes they catalyze, the genetic apparatuses encoding cellular proteinaceous constituents, and the pathological conditions caused by these organisms has greatly benefited from the availability of complete microbial genomic sequences. Many research institutes around the world are now devoting their efforts solely to genome sequencing and to analysis of the data produced. Dozens of international conferences have been held with the primary purpose of keeping the scientific community abreast of recent developments. In this Update I will summarize some of the exciting information reported at a recent conference on microbial genomics.1 Compared with microbe genomics, plant genomics is still in its infancy, since the sequencing of only the Arabidopsis genome is currently under way (Bevan et al., 1998). However, within a few years we can expect that the genomic sequences of at least two plants, Arabidopsis and rice (Oryza sativa), will be available. Meanwhile, as of January 1998, 12 microbial genomes have already been sequenced and published. These include genomes of representative organisms from the three domains of life: bacteria, archaea, and eukarya. The genomes of one eukaryote (the brewers’ yeast Saccharomyces cerevisiae) as well as three archaea and eight bacteria have been completely sequenced. By examining what has been learned from sequencing of microbial genomes, we can get some idea about what kinds of information we can expect from plant genome-sequencing efforts. MICROBIAL GENOME SEQUENCING Most of the bacterial genomes that have been sequenced are small, of 2 Mbp or less. However, three large bacterial genomes have been sequenced: those of the prototypic * E-mail [email protected]; fax 1– 619 –534 –7108. 1 The conference, entitled “Microbial Genomics II,” was organized by Claire Fraser of the Institute for Genomic Research in the United States and Bart Barrell of the Sanger Center in England, and took place at Hilton Head, South Carolina, from January 31 to February 3, 1998. The program and abstracts of the meeting are presented in Microbial and Comparative Genomics (1998) 3: 1–96. gram-negative bacterium Escherichia coli (Blattner et al., 1997); the best-characterized gram-positive bacterium, Bacillus subtilis (Kunst et al., 1997); and a representative cyanobacterium, Synechocystis PCC6803 (Kaneko et al., 1996). In addition, six microbial genomes have been completely sequenced but not yet published. Pathogens with fully sequenced genomes include Mycobacterium tuberculosis, the causative agent of tuberculosis; Treponema pallidum, the spirochete that causes syphilis; and Borrelia burgdorferi, another spirochete that causes Lyme disease. Two interesting nonpathogens that have recently been sequenced are Deinococcus radiodurans, the organism reported to be most resistant to UV irradiation, and Aquifex aeolicus, a marine hyperthermophile capable of growth at 95°C. Based on 16S RNA analyses, A. aeolicus may represent the deepest lineage within the bacterial domain, and its genome may therefore provide clues about primitive prokaryotic lifeforms that existed billions of years ago. More than 50 microbial genomes are currently being sequenced. These include the genomes of virtually every major pathogen. Industrially important bacteria such as Clostridium acetobutylicum, a principal producer of organic solvents, are also being sequenced, as is the deep-sea manganese-oxidizing bacterium Shewanella putrefaciens. The sequence of the first animal genome, that of the worm Caenorhabdeitis elegans, is expected to be completed in the middle of this year (July, 1998). Although only 3% of the human genome has been sequenced, this tiny fraction of the human genome represents more DNA than that of all of the completed microbial genomes sequenced to date. Table I summarizes some of the most impressive genome-sequencing efforts completed by the end of 1997. The first organism to have its genome sequenced was Haemophilus influenzae (Fleischmann et al., 1995). It has a genome size of 1.8 Mbp encoding 1743 recognized genes. One laboratory at the Institute for Genomic Research, involving about 40 people, completed the project in 1 year. B. subtilis, with a genome of 4.2 Mbp, was sequenced by an international consortium of 46 laboratories involving about 160 people in 5 years (Kunst et al., 1997). S. cerevisiae possesses about 13 Mbp of DNA, and 12 Mbp of this DNA has been fully sequenced. The remaining 1 Mbp consists of repetitive rDNA (encoding rRNA molecules) representing more than 100 repeats. This repetitive Abbreviation: Mbp, megabase pair. 1129 Downloaded from on August 3, 2017 - Published by www.plantphysiol.org Copyright © 1998 American Society of Plant Biologists. All rights reserved. 1130 Saier Plant Physiol. Vol. 117, 1998 Table I. Representative genome-sequencing efforts Organism Genome Size Genes Laboratories People Years 1743 4100 5900 4300 1 46 96 1 40 160 640 17 1 5 6 10 Mbp H. influenzae B. subtilis S. cerevisiae E. coli 1.8 4.2 12 4.6 DNA was not sequenced as part of the yeast-sequencing effort partly for technical reasons, and partly because it was expected to yield little new information. Ninety-six laboratories and 640 people completed this project in 6 years (Goffeau et al., 1997). The prototypical bacterium E. coli (4.6 Mbp) was sequenced by a single laboratory in an effort involving 17 people, but the project took about 10 years (Blattner et al., 1997). Because of recent technological advances, it is estimated that a microbial genome of about 2 Mbp can now be fully sequenced in 1 year by a single laboratory of 6 people with an expenditure of less than $1 million. GENOME SEQUENCING LEADS TO THE IDENTIFICATION OF NEW PROTEINS Tremendous benefits have resulted from the microbial genome-sequencing efforts completed to date (Table II). For example, important new proteins have been identified. Owen White at the Institute for Genomic Research reported that D. radiodurans is the first nonphotosynthetic organism to be shown to possess the light-sensing protein phytochrome. In D. radiodurans this molecule may function to regulate the synthesis of pigments that protect the organism from irradiation, and it also has a novel type of RecA protein involved in DNA repair-related recombination. Richard Roberts of New England Biolabs analyzed various genomes for DNA restriction-modification systems, enzyme systems that protect organisms from the potentially detrimental effects of foreign DNA (such as that of viruses). It was found that although B. subtilis and E. coli possess 3 to 4 such restriction-modification systems, some pathogenic bacteria have far more. Thus, H. influenzae has 7, Neisseria gonorrhoeae has 18, and Helicobacter pylori, which is a causative agent of peptic ulcers, has 23. The physiological significance attributed to the possession of large numbers of such systems in small-genome bacteria is a point of debate. Entirely new families of protein paralogs (families of proteins arising by gene duplication within a single organism) have been revealed by genome sequencing. For examTable II. Benefits of genomics New proteins identified New protein families discovered New pathways revealed Total metabolic capabilities estimated Virulence factors and mechanisms identified Potential drug targets revealed Gene transfer established ple, Claire Fraser, Sherwood Casjens, and Steven Norris reported that the genomes of the two pathogenic spirochetes T. pallidum and B. burgdorferi both exhibit large families of species-specific paralogs of unknown function. The same has been observed for methanogenic (methaneproducing) archaea. In the case of B. subtilis, a large family of Bacillus-specific protein paralogs have been identified as “Rap” phosphatases that release phosphate from phosphorylated aspartyl residues in response to regulatory proteins. These regulatory proteins control the initiation of sporulation, a gram-positive bacterial cell-differentiation process that leads to the generation of thick-walled, metabolically inert, environmentally resistant spores (Perego et al., 1994, 1996). DISCOVERY OF NEW METABOLIC PATHWAYS Entirely new metabolic pathways have been revealed by genome sequencing. Tyrrell Conway reported the discovery of a previously unrecognized pathway in E. coli for the metabolism of the sugar acid idonate, a compound that had not been known to be a substrate for the growth of this bacterium (C. Bausch, N. Peekhaus, C. Utz, T. Blais, E. Murray, T. Lowary, and T. Conway, unpublished data). Additionally, analysis of several bacterial genomes showed that many bacteria have all of the requisite enzymes of the ribulose-monophosphate–hexulose-monophosphate pathway, a pathway for the interconversion of five- and sixcarbon sugars (Reizer et al., 1997). Such a pathway had been previously demonstrated only in methanogenic bacteria. Finally, Owen White reported evidence that D. radiodurans has a novel pathway for the efficient repair of double-stranded DNA breaks, a fact that explains the ability of this octaploid organism to repair more than 150 double-stranded DNA breaks in its genome without loss of viability (Battista, 1997). The availability of a complete genomic sequence allows one to estimate the total metabolic capability of an organism. For example, knowledge of the complete gene complement permits estimation of the nutrients that can be taken up by the cell. Such knowledge reflects the natural lifestyle of the organism, and hence, ecological deductions and inferences can be made regarding the use of the bacterium for purposes of bioremediation (Clayton et al., 1997; Paulsen et al., 1998). Based on the complement of genes encoded within microbial genomes, Peter Karp of the Pangea Corporation in Oakland, CA, and his co-workers have created “EcoCyc” for E. coli and “HinCyc” for H. influenzae, which are computerized displays of the complete metabolic pathways of Downloaded from on August 3, 2017 - Published by www.plantphysiol.org Copyright © 1998 American Society of Plant Biologists. All rights reserved. Genome Sequencing and Informatics Table III. Technological advances in genomics Chips bearing probes for analysis of complete organismal gene complements Vectors for gene expression Gene knockouts and reporter-gene fusions Novel chips for sequencing these two closely related organisms (Karp et al., 1996). “EcoCyc,” which is available on the World Wide Web, is now being expanded to include transport and regulatory information as well as metabolic reactions. Results reported at the Microbial Genomics II conference clearly suggested that many obligate prokaryotic human parasites have condensed and streamlined their genomes, with the loss of regulatory functions and biosynthetic capabilities. Moreover, although strict human pathogens such as Mycoplasma genitalium, T. pallidum, and B. burgdorferi have adapted to an anaerobic life style using glycolytic sugar metabolism for their primary source of energy, as reported by Claire Fraser, Chlamydia (a common sexually transmitted disease agent [Peeling and Brunham, 1996]) and Rickettsia (the causative agent of Rocky Mountain spotted fever [Andersson and Andersson, 1997]) have adapted to a strictly aerobic life style. These two bacteria have lost their glycolytic enzymes and satisfy their energy needs by metabolizing organic acids via the Krebs cycle and electron flow, as reported by Richard Stephens and Siv Andersson, respectively (unpublished results). COMPARATIVE GENOMICS YIELDS NEW CLUES ABOUT PATHOGENESIS AND HORIZONTAL GENE TRANSFER Virulence factors and novel mechanisms of pathogenesis have been revealed as a result of the development of the new discipline of comparative genomics. Thus, Fred Blattner, who recently sequenced a pathogenic E. coli strain and compared it with the nonpathogenic E. coli K12 strain commonly used for laboratory research, reported that the former bacterium possesses 1.2 Mbp more DNA than the latter, an approximately 20% increase in genomic size. This additional genetic material codes for virulence factorbearing prophage, a virulence plasmid, and a “pathogenicity island” that allows the bacteria to secrete toxic proteins directly from the bacterial cytoplasm into that of the host animal cell (see Groisman and Ochman, 1996). In addition, three new tRNAs, a eukaryotic-type Ser/Thr protein kinase, and a novel iron-transport system, all possibly important for pathogenicity, were revealed. Potential drug targets were identified and studied. Lynn Miesel reported on studies characterizing the targets of the antituberculosis drug isoniazid. This drug appears to inhibit the growth of M. tuberculosis by blocking the function of a biosynthetic enzyme that makes a cell-surface fatty acid called mycolic acid. Isoniazid thereby disrupts the outer protective layer of this fastidious bacterium, rendering it sensitive to host defense mechanisms. Horizontal gene transfer between related gram-negative bacteria could be established using the comparative 1131 genomics approach. Pathogenicity islands, encoding virulence factor genes and the apparatus for their transfer to host cells, have now been identified in many bacteria (Groisman and Ochman, 1996). Fred Blattner noted that 30% of the new DNA present in pathogenic E. coli, but lacking in nonpathogenic E. coli, is shared by Yersinia pestis, the causative agent of plague. GENOMICS, PROTEOMICS, AND BIOINFORMATICS Tremendous technological advances are being made in the related fields of genomics, proteomics, and bioinformatics (Tables III–V). In the area of genomics (the study of an organism’s gene complement; Table III), novel chips bearing oligonucleotide probes for analysis of the expression of the complete complement of genes encoded within the genome of an organism have already been used and the results published (DeRisi et al., 1997; Wodicka et al., 1997). Such a chip costs about $200 and is good for a single information-rich experiment. However, the machine that reads the chip costs about $150,000. Expression vectors, gene knockouts, and reporter-gene constructs for all of the genes in an organism will soon be available for at least two major experimental organisms, the yeast S. cerevisiae, and the sporulating, gram-positive bacterium B. subtilis. Our laboratory has recently taken advantage of such technology to identify and characterize a protein kinase that controls catabolite repression and carbon metabolism in B. subtilis (Reizer et al., 1998). In earlier studies we had purified the protein and determined an N-terminal amino acid sequence, but the availability of this sequence was insufficient to allow us to clone the gene. When the complete genome of B. subtilis became available, its identification became almost automatic. Thus, when expression vectors, knockouts, and fusion constructs become available commercially, the academic molecular biologist will be largely obsolete. In proteomics (the study of an organism’s proteins; Table IV), technological advances are beginning to have an effect on basic research. Tremendous advances have been made in coupling two-dimensional gel analysis of the total protein complement of an organism to analysis by MS or to N-terminal sequence analysis. Chips are being developed for pico-quantity protein purification based on hydrophobic and ion-exchange chromatography as well as adsorption chromatography. Rapid procedures for studying protein-protein interactions are being developed. For example, a yeast two-hybrid system has recently been developed for the assay of whole libraries of genes to determine which of the encoded proteins interact with high affinity with a specific test protein (Williams et al., 1998). Additionally, novel chips for studying protein-ligand interactions are being developed. Table IV. Technological advances in proteomics Two-dimensional gel analysis coupled to MS Chips for full proteome protein purification Protein-protein interaction studies Protein-ligand interaction studies Downloaded from on August 3, 2017 - Published by www.plantphysiol.org Copyright © 1998 American Society of Plant Biologists. All rights reserved. 1132 Saier Finally, in the area of bioinformatics (the computational analysis of genetic and biochemical data; Table V), novel software is being developed for more refined homology searches of the databases, for protein and nucleic acid secondary and tertiary structural predictions, and for protein and DNA motif identification. Many of the programs used routinely in academic laboratories for characterizing families of proteins have been or are now being automated. Such advances will be required to keep up with the exponentially increased volume of sequence data that will be available in the near future. Genome-sequence analyses have revealed that most of the proteins encoded within the genome of a living organism belong to families of homologous proteins that share a common evolutionary origin and are present in many dissimilar organisms. Some of these families are very large, with hundreds of currently sequenced members (Pao et al., 1998). Other families are small, with only a few currently sequenced members (Saier, 1998). By constructing dendrograms (which show approximate relationships of the proteins to each other without providing numerical values for the relative phylogenetic distances that separate them), or instead by constructing phylogenetic trees (which not only cluster the proteins according to their relative degrees of sequence similarity, but also provide quantitative measures of their degrees of relatedness), one can estimate the probability that any two members of a protein family will prove to serve the same function. Phylogenetic trees thus indicate relative degrees of sequence similarity and provide a reliable guide to biochemical function. WHERE IS GENOMICS GOING? Where is the new discipline of genomics taking us? Directly into an exciting and ever-expanding new century of scientific discovery. The current sequencing explosion will lead to the development of novel disciplines such as “comparative genomics” and “molecular archaeology” (Table VI). Because about 20% to 30% of each newly sequenced genome consists of genes encoding proteins with no recognizable homologs in the current databases (Koonin et al., 1997), a tremendous effort must be devoted to the functional identification of these “orphan” proteins. Novel regulatory constraints and virulence mechanisms will be revealed (Ewald, 1996). New kingdoms of currently unculturable microbes (Bloomfield et al., 1998) are likely to be revealed. Discoveries leading to tremendous expansion Table V. Technological advances in informatics Novel software for More refined homology searches Three-dimensional structural predictions Motif identification Repetitive sequence analysis Automation for Data searching Multiple alignment generation Phylogenetic-tree construction Averaging similarity, hydropathy, amphipathicity Plant Physiol. Vol. 117, 1998 Table VI. Microbial genomics: future directions Sequencing explosion: genome sequencing will become the primary source of biological information Comparative genomics: functional assignments in one organism will be directly applicable to many others Molecular archeology: sequence comparisons and three-dimensional structural analyses will allow deduction of evolutionary origins of most macromolecules Functional identification: many new types of genes must be characterized biochemically and physiologically Regulatory constraints: identification of new genes requires an understanding of how their expression is regulated Virulence mechanisms: novel factors and mechanisms concerned with microbial pathogenesis will be discovered Unculturable organisms: microorganisms that cannot be currently cultured in the laboratory will be characterized Industrial applications: industry will use microorganisms to an ever-increasing degree for the production of useful products Novel technological advances: technologies will be developed for understanding and using the genetic information encoded within microbes Unforeseen discoveries: many novel discoveries will be made in almost every imaginable dimension, with far-reaching consequences of industrial applications will be made, and additional novel technological advances will undoubtedly appear in the not-too-distant future. But most importantly, genomics will lead to discoveries currently unforeseen and nearly unfathomable. Even 2 years ago we did not dream that microbial genomes would exhibit the degree of plasticity that has been revealed by genome sequencing, or that the “minimal genome” would be so limited in scope. Living organisms have solved the fundamental problems of life in many diverse ways. It is an exciting time to be working in the biological sciences, and this excitement is largely attributable to developments in genomics. The advances of the future will be limited only by the imaginations of current and future generations of molecular biologists. Received April 5, 1998; accepted April 18, 1998. Copyright Clearance Center: 0032–0889/98/117/1129/05. LITERATURE CITED Andersson JO, Andersson SGE (1997) Genomic rearrangements during evolution of the obligate intracellular parasite Rickettsia prowazekii as inferred from an analysis of 52015 bp nucleotide sequence. Microbiology 143: 2783–2795 Battista JR (1997) Against the odds: the survival strategies of Deinococcus radiodurans. Annu Rev Microbiol 51: 203–224 Bevan M, Bancroft I, Bent E, Love K, Goodman H, Dean C, Bergkamp R, Dirkse W, Van Staveren M, Stiekema W, and others (1998) Analysis of 1.9 Mb of contiguous sequence from chromosome 4 of Arabidopsis thaliana. Nature 391: 485–488 Blattner FR, Plunkett G III, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayrew GF, Downloaded from on August 3, 2017 - Published by www.plantphysiol.org Copyright © 1998 American Society of Plant Biologists. All rights reserved. Genome Sequencing and Informatics and others (1997) The complete genome sequence of Escherichia coli K-12. Science 277: 1453–1474 Bloomfield SF, Stewart GSAB, Dodd CER, Booth IR, Power EGM (1998) The viable but non-culturable phenomenon explained? Microbiology 144: 1–2 Clayton RA, White O, Ketchum KA, Venter JC (1997) The first genome from the third domain of life. Nature 387: 459–462 DeRisi JL, Iyer VR, Brown PO (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–686 Ewald PW (1996) Guarding against the most dangerous emerging pathogens: insights from evolutionary biology. Emerging Infect Dis 2: 245–257 Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, and others (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512 Goffeau A, Aert R, Agostini-Carbone ML, Ahmed A, Aigle M, Alberghina L, Albermann K, Albers M, Aldea M, Alexandraki D, and others (1997) The yeast genome directory. Nature 387(suppl): 1–105 Groisman EA, Ochman H (1996) Pathogenicity islands: bacterial evolution in quantum leaps. Cell 87: 791–794 Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, and others (1996) Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res 3: 109–136 Karp PD, Riley M, Paley SM, Pelligrini-Toole A (1996) EcoCyc: an encyclopedia of Escherichia coli genes and metabolism. Nucleic Acids Res 24: 32–39 Koonin EV, Mushegian AR, Galperin MY, Walker DR (1997) Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea. Mol Microbiol 25: 619–637 Kunst F, Ogasawara N, Moszer I, Albertini AM, Alloni G, Azevedo V, Bertero MG, Bessieres P, Bolotin A, Borchert S, and 1133 others (1997) The complete genome sequence of the Grampositive bacterium Bacillus subtilis. Nature 390: 249–272 Pao SS, Paulsen IT, Saier MH Jr (1998) The major facilitator superfamily. Microbiol Mol Biol Rev 62: 1–32 Paulsen IT, Sliwinski MK, Saier MH Jr (1998) Microbial genome analyses: global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities. J Mol Biol 277: 573–592 Peeling RW, Brunham RC (1996) Chlamydiae as pathogens: new species and new issues. Emerging Infect Dis 2: 307–319 Perego M, Glaser P, Hoch JA (1996) Aspartyl-phosphate phosphatases deactivate the response regulator components of the sporulation signal transduction system in Bacillus subtilis. Mol Microbiol 19: 1151–1157 Perego M, Hanstein C, Welsh KM, Djavakhishvili T, Glaser P, Hoch JA (1994) Multiple protein-aspartate phosphatases provide a mechanism for the integration of diverse signals in the control of development in B. subtilis. Cell 79: 1047–1055 Reizer J, Hoischen C, Titgemeyer F, Rabus R, Stülke J, Rivolta C, Karamata D, Saier MH Jr, Hillen W (1998) A novel protein kinase that controls carbon catabolite repression in bacteria. Mol Microbiol 27: 1157–1169 Reizer J, Reizer A, Saier MH Jr (1997) Is the ribulose monophosphate pathway widely distributed in bacteria? Microbiology 143: 2519–2520 Saier MH Jr (1998) Molecular phylogeny as a basis for the classification of transport proteins from bacteria, archaea and eukarya. In RK Poole, ed, Advances in Microbial Physiology Academic Press, San Diego, CA (in press) Williams JM, Chen G-C, Zhu L, Rest RF (1998) Using the yeast two-hybrid system to identify human epithelial cell proteins that bind gonococcal Opa proteins: intracellular gonococci bind pyruvate kinase via their Opa proteins and require host pyruvate for growth. Mol Microbiol 27: 171–186 Wodicka L, Dong H, Mittmann M, Ho MH, Lockhart DJ (1997) Genome-wide expression monitoring in Saccharomyces cerevisiae. Nature Biotechnol 15: 1359–1367 Downloaded from on August 3, 2017 - Published by www.plantphysiol.org Copyright © 1998 American Society of Plant Biologists. All rights reserved.