* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Heterogeneity of Genome and Proteome Content in Bacteria
Interactome wikipedia , lookup
Magnesium transporter wikipedia , lookup
Ridge (biology) wikipedia , lookup
Proteolysis wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Transposable element wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene regulatory network wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Evolution of metal ions in biological systems wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Expression vector wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene expression wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Non-coding DNA wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Genomic library wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Molecular evolution wikipedia , lookup
Theoretical Population Biology 61, 367–390 (2002) doi:10.1006/tpbi.2002.1606 Heterogeneity of Genome and Proteome Content in Bacteria, Archaea, and Eukaryotes Samuel Karlin1, Luciano Brocchieri1, Jonathan Trent2, B. Edwin Blaisdell1, and Jan Mrázek1 1 Department of Mathematics, Stanford University, Stanford, California 94305-2125 E-mail: karlin@math:stanford:edu 2 NASA Ames Research Center, Mail Stop 239-15, Moffett Field, California 94035 Received April 7, 2002 Our analysis compares bacteria, archaea, and eukaryota with respect to a wide assortment of genome and proteome properties. These properties include ribosomal protein gene distributions, chaperone protein contrasts, major variation of transcription/translation factors, gene encoding pathways of energy metabolism, and predicted protein expression levels. Significant differences within and between the three domains of life include protein lengths, information processing procedures, many metabolic and lipid biosynthesis pathways, cellular controls, and regulatory proteins. Differences among genomes are influenced by lifestyle, habitat, physiology, energy sources, and other factors. & 2002 Elsevier Science (USA) morphological structures and ‘‘operational’’ metabolic genes and proteins (Rivera et al., 1998). Other features make the phylogenetic cohesiveness of the three domains uncertain and their classification moot. To investigate the rigor and value of the current classification system, it is, therefore, of interest to catalog important genes and proteins in each domain and to show strong similarities and differences within and between domains. In this post-genomic era we can, in principle, study the complete inventory of cellular proteins. Five eukaryotic genome sequences are now complete or nearly complete: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, and Homo sapiens. In addition, more than 50 prokaryotic genome sequences are complete. These include human pathogens and microbes of industrial and commercial value. Our comparisons between the eukaryota, archaea, and bacteria can provide insights 1. INTRODUCTION Based primarily on rRNA sequence criteria, life has been broadly divided into three domains referred to as Eukaryota, Bacteria, and Archaea, which are believed to reflect phylogenetic relationships (Woese et al., 1990; Brown and Doolittle, 1997). This classification system, however, has not gone uncontested and many problems concerning inferences of evolutionary relationships from molecular sequence data have been identified (e.g., for different views, see Gupta, 1998, 2000; Poole et al., 1999; Lopez et al., 1999; Forterre and Philippe, 1999; Brocchieri, 2001; Gribaldo and Philippe, 2002). Furthermore, the identity of the three domains and the relationships between them is problematic. For example, the archaea and eukaryota putatively share many genes and proteins involved in information and cellular activity, whereas archaea and bacteria share many 367 0040-5809/02 $35.00 # 2002 Elsevier Science (USA) All rights reserved. 368 Karlin et al. into the workings of living cells by illuminating protein properties associated with specific structures and functions that will not only have evolutionary consequences, but also medical and environmental benefits. 2. GENOMIC FEATURES AND ORGANIZATION It is widely believed that in most prokaryotic organisms during exponential growth, ribosomal proteins (RP), transcription/translation processing factors (TF), and the major chaperone/degradation (CH) genes functioning in protein folding and trafficking are highly expressed based on two-dimensional polyacrylamide gel electrophoresis and mass spectrometry measurements for ECOLI (for genome species abbreviations, see Table I) (see VanBogelen et al., 1996, 1999), for BACSU (see Hecker and collaborators (e.g., Antlemann et al., 1997)), for METJA (Giometti, Argonne National Labs, pers. comm.) and for DEIRA (see Lipton et al., 2002). The three gene classes RP, CH, and TF record similar codon frequencies which show high codon biases relative to the average gene, and the codon usage differences among these three gene classes are low (Karlin and Mr!azek, 2000). Using these genes as a basis, a gene is Predicted Highly eXpressed (PHX) if its codon usage is rather similar to that of the RP, TF, and CH gene classes and deviates strongly from the average gene of the genome. Using these criteria, PHX genes in most prokaryotic genomes include genes of principal energy metabolism and often genes of amino acids nucleotide, and fatty acid biosyntheses. In the cyanobacterium SYNY3, the primary genes of photosynthesis TABLE I List of Species and Abbreviations Genome Abbreviationa Bacteria Escherichia coli K12 Escherichia coli O157 Salmonella typhimurium Pasteurella multocida Yersinia pestis Vibrio cholerae Xylella fastidiosa Haemophilus influenzae Buchnera sp. APS Pseudomonas aeruginosa Neisseria meningitidis ECOLI } SALTY PASMU YERPE VIBCH XYLFA HAEIN BUCSP PSEAU NEIME (Eco) (EcZ) (Vch) (Hin) (Pae) (Nme) Table I (continued) Genome Abbreviationa Agrobacterium tumefaciens Mesorhizobium loti Sinorhizobium meliloti Caulobacter crescentus Rhodobacter sphaeroides Rickettsia prowazekii Helicobacter pylori Campylobacter jejuni Bacillus subtilis Bacillus halodurans Staphylococcus aureus Lactococcus lactis Streptococcus pyogenes Streptococcus pneumoniae Listeria monocytogenes Clostridium acetobutylicum Mycobacterium leprae Mycobacterium tuberculosis Mycoplasma genitalium Mycoplasma pneumoniae Ureaplasma urealyticum Synechocystis sp. Deinococcus radiodurans Treponema pallidum Borrelia burgdorferi Chlamydia trachomatis Chlamydia pneumoniae Chlamydia muridarum Aquifex aeolicus Thermotoga maritima AGRTU MESLO SINME CAUCR RHOSP RICPR HELPY CAMJE BACSU BACHA STAAU LACLA STRPY STRPN LISMO CLOAC MYCLE MYCTU MYCGE MYCPN UREUR SYNY3 DEIRA TREPA BORBU CHLTR CHLPN CHLMU AQUAE THEMA Archaea Halobacterium sp. Methanococcus jannaschii Methanococcus voltae Methanosarcina barkeri Methanobacterium thermoautotrophicum Archaeoglobus fulgidus Pyrococcus horikoshii Pyrococcus abyssi Thermoplasma acidophilum Thermoplasma volcanium Pyrobaculum aerophilum Aeropyrum pernix Sulfolobus solfataricus Sulfolobus tokodaii HALSP METJA METVO METBA METTH ARCFU PYRHO PYRAB THEAC THEVO PYRAE AERPE SULSO SULTO Eukaryotes Homo sapiens Drosophila melanogaster Caenorhabditis elegans Saccharomyces cerevisiae Arabidopsis thaliana Trichomonas vaginalis Giardia lamblia Entamoeba histolytica HUMAN DROME CAEEL YEAST ARATH TRIVA GIALA ENTHI a (Mlo) (Ccr) (Rpr) (Hpy) (Cje) (Bsu) (Lla) (Spy) (Mle) (Mtu) (Mge) (Uur) (Syn) (Dra) (Tpa) (Bbu) (Ctr) (Cpn) (Aae) (Tma) (Hbs) (Mja) (Mth) (Afu) (Pho) (Pab) (Tac) (Ape) The SwissProt five-letter abbreviations are used in the text. The threeletter abbreviations are used in some tables. 369 Heterogeneity of Genome and Proteome Content are PHX and in methanogens those essential for methanogenesis are PHX. 2.1. Shine–Dalgarno (SD) Sequences and PHX Genes In prokaryotes, we find a correlation between gene expression levels and the presence of a Shine–Dalgarno (SD) sequence, which plays an important role in translation initiation (see below). In bacterial cells, translation initiation is generally considered the ratelimiting step of translation (reviewed in Gold, 1988; Draper, 1996). Initiation of gene translation in many bacteria involves interactions between a conserved SD sequence upstream of the start codon in the mRNA and an equally conserved anti-SD sequence at the 30 end of the 16S rRNA. Not all mRNAs possess a recognizable SD sequence, however. The consensus SD sequence features at its core the purine run AGGAGG or GGAGGA, generally traversing positions 10 to 5 relative to the start codon and the 16S rRNA gene which mainly carries the anti-SD sequence CACCTCCTTTC at its 30 end (see Ma et al., 2002 for details). We observed that the majority of prokaryotic genomes have at least one copy of the 16S rDNA gene that has the CCTCCT terminal motif. For our purposes, a strong SD sequence refers to the motif GGAG, GAGG, or sometimes AGGA, aligned within the positions 10 to 5 from the gene start codon. In several genomes, the proportion of PHX genes and average to low expression level genes with strong SD sequence has been investigated (Ma et al., 2002). As may be expected, the PHX genes as compared to genes with an average or low expression level are significantly more likely to possess a strong SD motif. This positive correlation between strong SD signal sequences and high expression genes can be found in almost all bacterial and archaeal genomes, whereas SD sequences do not exist in eukaryotes. archaea and bacteria (e.g., Mr!azek and Karlin, 1998; Frank and Lobry, 1999). For eukaryotic chromosomes including the YEAST genome, the chromosomes of CAEEL, of DROME, of HUMAN, and of ARATH show no strand asymmetry. What are the possible sources of strand composition asymmetry? Lobry (1996a, b) was the first to observe strand compositional asymmetry which he associates with differences in replication, mutation and repair biases in the leading vs the lagging strand. Francino and Ochman (1997) emphasize a mutational bias associated with transcription-coupled repair mechanisms and deamination events. Other possible sources of strand asymmetry include enzyme and architectural asymmetries at the replication fork, differences of replication processivity (Kunkel, 1992), intergenic differences in signal or binding sites in the two strands, differences in gene density coupled with amino acid and codon biases (Mr!azek and Karlin, 1998), and dNTP-pool fluctuations during the cell cycle (Thomas et al., 1996). Rocha et al. (1999), using a statistical linear discriminant function, observed compositional asymmetries between genes on the leading vs those on the lagging strand at the level of nucleotides, codons, and amino acids. The GC skew switches sign at the origin and terminus of replication in those bacteria possessing a single origin of replication (oriC) that is subject to bidirectional replication. Other factors that may contribute to strand asymmetry include gene function or expression level, operon organization, and differences in single-base or context-dependent mutations. Strand compositional asymmetry, in general, is not apparent in the genomes of organisms known to possess multiple origins of replication distributed, on average, every 50 kb: It may, therefore, be surmised that archaeal genomes, which apparently do not show compositional asymmetry (no GC skew bias), possess multiple replication origins. 2.2. Unique vs Multiple Origins of Replication 2.3. Periodic 30 bp Repeats in Archaea, Thermophiles, and Alkaliphiles The GC skew (biases in ðG CÞ=ðG þ CÞ counts) shows a strong difference between archaea and bacteria probably due to the existence of unique vs multiple origins of replication (Mr!azek and Karlin, 1998; Frank and Lobry, 1999). There is substantial evidence for a prevalence of G in excess of C in the leading strand relative to the lagging strand in most bacterial genomes. Exceptions include the genomes of SYNY3, DEIRA, THEMA, and all of the archaeal genomes. The counts of ðG CÞ=ðG þ CÞ show a strong difference between All current archaeal genomes, except Halobacterium sp., contain one or more unusual clusters of 24–32 bp repeat elements, usually in excess of 40 copies, separated by 40–60 bp of nonconserved spacers (see Table II). A similar repeat arrangement is present in the Gramnegative hyperthermophilic bacteria AQUAE and THEMA. The function of these repeats is unknown, although it is tempting to speculate that it is related to the thermophilic lifestyle. However, it is also observed in the genome of BACHA, a mesophilic bacterium 370 Karlin et al. TABLE II Periodic 24–32 bp Repeats Prokaryotic Genomes Genome Repeat sequence(s) Periodicity Repeat count in the genome (bp) Exact Allowing N errors Number of Max. number of clusters repeats in a single cluster SULSO CTTTCAATTCCTTTTGGGATTAATC CTTTCAATTCTATAAGAGATTATC 61–66 61–66 151 127 201 ðN ¼ 5Þ 224 ðN ¼ 5Þ 2 4 103 96 SULTO TCTTTCAATTCCTTTTGGGATTCATC ACTTTCAATTCCATTAAGGATTATC 62–66 63–67 44 57 188 ðN ¼ 5Þ 271 ðN ¼ 5Þ 2 4 113 96 PYRAE GTTTCAACTATCTTTTGATTTCTGG GAATCTTCGAGATAGAATTGCAAG 65–70 66–69 43 81 45 ðN ¼ 5Þ 83 ðN ¼ 5Þ 3 1 18 81 AERPE GCATATCCCTAAAGGGAATAGAAAG GAATCTTCGAGATAGAATTGCAAG 63–67 62–69 42 36 42 ðN ¼ 5Þ 47 ðN ¼ 5Þ 1 2 42 25 METJA RTTAAAATCAGACCGTTTCGGAATGGAAAY 65–73 66 134 ðN ¼ 6Þ 9 25 METTH GTTAAAATCAGACCAAAATGGGATTGAAAT 65–68 171 171 ðN ¼ 6Þ 2 124 ARCFU GTTGAAATCAGACCAAAATGGGATTGAAAG 66–69 107 108 ðN ¼ 6Þ 2 60 PYRHO GTTTCCGTAGAACTAAATAGTGTGGAAAG 65–68 71 111 ðN ¼ 6Þ 3 67 PYRAB GTTCCAATAAGACTAAAATAGAATTGAAAG 67 47 53 ðN ¼ 6Þ 3 27 PYRFU GTTCCAATAAGACTAAAATAGAATTGAAAG 66–68 50 206 ðN ¼ 6Þ 7 50 THEAC GTAAAATAGAACCTTAATAGGATTGAAAG 65–66 46 47 ðN ¼ 6Þ 1 47 THEVO GTTTAAGATGTACTAGTTAGTATGGAAG 70 33 40 ðN ¼ 6Þ 2 19 THEMA GTTTCAATAMTTCCTTAGAGGTATGGAAAC 65–67 100 113 ðN ¼ 6Þ 8 41 AQUAE GTTCCTAATGTACCGTGTGGAGTTGAAAC 65–67 14 25 ðN ¼ 6Þ 5 5 BACHA GTCGCACTCTTCATGGGTGCGTGGATTGAAAT 65–68 49 95 ðN ¼ 6Þ 5 36 characterized as an extreme alkaliphile bacterial organism living optimally at pH59:5 and containing a corresponding array of repeats. Two current mesophilic methanogens, Methanosarcina mazei and Methanosarcina acetivorans, contain repeats of the kind displayed in Table II. 2.4. Representations of Short Palindromes Archaeal and bacterial genomes tend to underrepresent 4 and 6 bp palindromes (Rocha et al., 2001), see also Karlin et al. (1992), while eukaryal genomes have many of these sufficiently short palindromes. This observation is consistent with the presence of restriction systems in prokaryotic genomes but not in eukaryotes. 2.5. Genome Signature Profiles Early biochemical experiments measuring nearestneighbor frequencies established that the set of dinucleotide relative abundances of dinucleotides (the socalled dinucleotide bias) is a remarkably stable property of the DNA of an organism (Josse et al., 1961; Russell et al., 1976). We observed that the dinucleotide bias appears to reflect species-specific properties of DNA Heterogeneity of Genome and Proteome Content stacking energies, DNA modification, replication, and repair mechanisms. Dinucleotide biases in a DNA sequence are assessed through the odds ratios rXY ¼ fXY =fX fY where fXY is the frequency of the dinucleotide XY and fX is the frequency of the nucleotide X : For double-stranded DNA sequences, a symmetrized version frnXY g is computed from corresponding frequencies of the sequence concatenated with its inverted complementary sequence. Our recent studies have demonstrated that the dinucleotide bias profiles frnXY g evaluated for disjoint 50 kb multiple DNA contigs from the same organism are approximately constant throughout the genome and are generally more equal to each other than they are to those from 50 kb contigs of different organisms (Russell and Subak-Sharpe, 1977; Karlin and Cardon, 1994; Blaisdell et al., 1996; Karlin, 1998). This bias pervades both coding and noncoding DNA (Karlin and Mr!azek, 1996). Furthermore, related organisms generally have more nearly equal dinucleotide biases than do distantly related organisms. These highly stable DNA doublets suggest that there may be genome-wide factors that limit the compositional and structural patterns of a genomic sequence. In the absence of strong current selection, the dinucleotide compositions should be especially conservative and likely to drift only slowly with time. Dinucleotide relative abundances capture most of the departure from randomness in genome sequences. Overall, the dinucleotide, trinucleotide and tetranucleotide relative abundances in a genome are highly correlated, indicating that DNA conformational arrangements are principally determined by base-step configurations. On this basis we refer to the profile frnXY g of a given genome as its ‘‘genomic signature.’’ What causes the uniformity of genomic signatures throughout the genome of an organism? A reasonable explanation for this constancy of the genomic signature may be based on the replication and repair machinery of different species, which either preferentially generates or preferentially selects specific dinucleotides in the DNA (Karlin and Burge, 1995; Karlin, 1998). These effects might operate through context-dependent mutation rates and/or DNA modifications and through local DNA structures (base-step conformational tendencies). 2.5.1. Dinucleotide relative abundance extremes in different prokaryotic genomes. Table III presents the frnXY g profiles for a broad collection of complete prokaryotic genomes. The dinucleotide relative abundance of XY is considered underrepresented when rnXY 40:78 and overrepresented when rnXY 51:23 (cf. Karlin, 1998). We observed that the dinucleotide TA is low in 371 about 75% of all prokaryote genomes (only 14/53 genomes (Table III) are in the normal range and the others are low). Possible reasons for TA underrepresentation may relate to the low thermodynamic stacking energy of TA, which is the lowest among all dinucleotides, to the high degree of degradation of UA dinucleotides by ribonucleases in mRNA (Beutler et al., 1989) and/or to the presence of TA as part of special regulatory signals. In this context, TA underrepresentation may help to avoid inappropriate binding of regulatory factors. The dinucleotide relative abundance of CG is low in 20/53 genomes, high in 5/53 and normal in 28/53 (Table III). For example, CG is low or underrepresented in all Chlamydia species, whereas CG is high or overrepresented in a- and b-proteobacterial genomes (except CAUCR) and mostly normal in g-proteobacteria. CG is low in 7/11 archaea with HALSP high. The dinucleotide GC is high in many g- and e-proteobacterial genomes and also in the low C+G Gram-positive LISMO, STAAU, and CLOAC genomes. Underrepresentation of the dinucleotide CG is prominent in vertebrate sequences but not observed in most invertebrates (e.g., in insects and worms); CG is also significantly low in many protist genomes (e.g., Entamoeba histolytica, Dictyostelium discoideum, Plasmodium falciparum but normal in Trypanosoma sp. and Tetrahymena sp.) and is also low in animal mitochondrial genomes (Cardon et al., 1994) and in most small eukaryotic viral genomes (Karlin et al., 1994). CG is underrepresented in MYCGE, but not in MYCPN and tends to be suppressed in low G+C Gram-positive genomes; CG is low in BORBU and in many thermophilic prokaryotes, including METJA, METTH, SULSO, PYRHO, and Thermus sp., but not in the thermophiles PYRAE, AQUAE, or THEMA. Connections between CG representations have been made to the immune system stimulations; see Krieg (1996) and Krieg et al. (1998). It is clear that mechanisms underlying biased CpG representations vary. 2.5.2. Genome signature differences among sequences. A measure of the genomic difference between two sequences f and g (from different organisms or from different regions of the same genome) is the dinucleotide average absolute relative P abundance difference calculated as dn ðf ; gÞ ¼ ð1=16Þ XY jrnXY ðf Þ rnXY ðgÞj; where the sum extends over all dinucleotides and usually dn is averaged between all pairs of 50 kb contigs of the sequences composing f and g: For convenience, we describe levels of dn -differences for some examples (all values multiplied by 1000): ‘‘close’’ indicates dn 445 372 Karlin et al. TABLE III Symmetrized Genome Signatures (rn Dinucleotide Relative Abundance Values) Genome CG GC TA AT CC GG AA TT AC GT CA TG AG CT GA TC HUMAN DROME CAEEL YEAST ARATH 0.26 0.93 0.98 0.80 0.72 1.01 1.28 1.04 1.02 0.92 0.73 0.74 0.61 0.77 0.75 0.87 0.97 0.86 0.94 0.91 1.25 1.08 1.05 1.06 1.05 1.13 1.24 1.28 1.14 1.13 0.82 0.84 0.84 0.89 0.91 1.20 1.12 1.08 1.10 1.10 1.18 0.87 0.90 0.99 1.03 0.98 0.90 1.11 1.06 1.11 PYRAB PYRHO ARCFU THEAC THEVO METTH METJA HALSP AERPE PYRAE SULSO 0.71 0.61 0.78 0.91 0.83 0.51 0.32 1.36 0.70 0.97 0.67 0.89 0.89 1.02 1.04 1.12 0.76 1.13 0.94 0.96 1.15 0.95 0.89 0.90 0.61 0.82 0.93 0.74 0.83 0.54 1.21 1.08 1.00 0.90 0.92 0.86 1.26 1.07 1.13 0.94 0.98 0.95 0.93 0.95 1.22 1.30 1.04 1.05 1.12 1.25 1.38 0.78 1.14 1.10 1.24 1.12 1.11 1.21 0.97 1.04 0.95 1.14 0.92 0.88 1.18 1.04 0.77 0.73 0.77 0.75 0.78 0.85 0.72 1.26 0.83 0.83 0.85 0.85 0.85 1.01 1.06 0.99 1.17 1.03 0.92 0.90 0.85 0.88 1.21 1.22 1.17 0.97 1.05 1.07 1.11 0.79 1.31 1.07 1.17 1.15 1.13 1.19 1.17 1.06 1.14 1.05 1.33 1.03 0.91 1.05 ECOLI HAEIN PSEAE BUCSP VIBCH PASM U XYLFA NEIME AGRTU AGRTU-L CAUCR MESLO SINME SINMEpA RICPR HELPY CAMJE CHLTR CHLPN CHLMU BORBU TREPA DEIRA THEMA AQUAE MYCTU MYCLE BACSU BACHA LACLA STAAU STRPN STRPY LISMO CLOAC UREUR MYCGE MYCPN SYNY3 1.16 1.09 1.10 0.87 1.04 1.07 1.01 1.31 1.24 1.22 1.16 1.23 1.29 1.23 0.77 0.93 0.62 0.79 0.73 0.75 0.48 1.08 1.07 0.92 0.87 1.18 1.12 1.04 1.09 0.77 0.94 0.69 0.71 1.11 0.45 0.88 0.39 0.82 0.75 1.28 1.43 1.17 1.25 1.30 1.30 1.18 1.28 1.21 1.20 1.11 1.21 1.15 1.15 1.53 1.56 1.75 1.12 1.06 1.12 1.47 1.22 1.16 0.69 0.75 1.07 1.07 1.27 1.08 1.19 1.25 1.05 1.19 1.28 1.28 1.42 1.19 1.14 1.02 0.75 0.75 0.54 0.85 0.69 0.76 0.68 0.64 0.47 0.48 0.45 0.44 0.47 0.52 0.98 0.73 0.77 0.77 0.78 0.75 0.77 0.74 0.49 0.50 0.82 0.57 0.75 0.65 0.69 0.67 0.85 0.72 0.77 0.77 0.93 0.79 0.75 0.77 0.75 1.10 0.95 1.17 0.95 0.99 0.97 1.13 1.04 1.37 1.39 1.29 1.41 1.39 1.28 0.98 0.86 0.83 0.89 0.88 0.88 0.88 0.93 0.89 0.83 0.66 1.25 1.10 1.02 0.98 0.88 1.00 0.89 0.89 0.92 0.95 0.93 0.77 0.71 1.00 0.91 1.01 0.84 1.22 0.88 1.01 0.91 0.97 0.86 0.87 0.85 0.81 0.82 0.84 1.03 1.17 1.11 1.01 1.05 1.08 1.29 0.86 0.87 0.99 1.24 0.87 0.88 0.97 1.00 1.05 0.95 1.03 1.04 0.99 1.22 1.08 1.13 1.12 1.36 1.21 1.25 1.07 1.14 1.20 1.23 1.10 1.45 1.26 1.24 1.09 1.17 1.18 1.15 1.05 1.37 1.25 1.16 1.14 1.19 1.22 1.18 1.25 1.19 1.29 1.06 1.04 1.24 1.21 1.23 1.09 1.15 1.17 1.23 1.08 1.17 1.23 1.30 1.32 0.88 0.85 0.86 0.81 0.89 0.90 0.96 0.84 0.75 0.76 0.86 0.79 0.77 0.81 0.86 0.67 0.71 0.76 0.77 0.74 0.69 0.96 0.93 0.87 0.89 1.05 1.05 0.75 0.84 0.82 0.95 0.85 0.87 0.86 0.81 0.89 0.96 1.02 0.79 1.12 1.12 1.10 1.02 1.17 1.14 1.26 1.01 1.04 1.05 1.01 1.05 0.93 1.00 1.02 0.97 1.03 0.96 0.96 0.96 1.02 1.13 1.12 0.97 0.74 1.11 1.14 1.08 1.03 1.13 1.18 1.10 1.12 1.04 1.02 1.18 1.16 1.08 1.05 0.82 0.82 1.02 0.94 0.90 0.81 0.83 0.69 0.82 0.82 0.96 0.87 0.89 0.91 1.06 0.97 1.09 1.18 1.19 1.15 1.07 0.94 1.00 1.11 1.18 0.79 0.86 0.91 0.91 0.97 0.88 1.08 1.04 0.89 1.13 0.84 1.06 0.96 0.85 0.92 0.87 1.10 1.02 0.95 0.89 0.94 0.90 1.14 1.14 1.22 1.18 1.28 1.22 0.91 0.87 0.92 1.15 1.15 1.12 1.01 0.95 1.01 1.40 1.12 1.08 1.02 1.06 1.10 1.05 0.95 1.09 0.99 0.97 0.97 0.94 0.89 0.81 0.86 Significantly underrepresented dinucleotides ðrn 40:78Þ are shown in red (bold face if rn 40:70). Significantly overrepresented dinucleotides ðrn 51:23Þ are shown in green (bold face if rn 51:30). Heterogeneity of Genome and Proteome Content (e.g., human vs cow, E. coli vs S. typhimurium), ‘‘moderately similar’’ indicates 554dn 485 (e.g., human vs chicken, E. coli vs H. influenzae), ‘‘weakly similar’’ indicates 904dn 4120 (e.g., human vs sea urchin, M. genitalium vs M. pneumoniae), ‘‘distantly similar’’ indicates 1254dn 4145 (e.g., human vs Sulfolobus, M. jannaschii vs M. thermoautotrophicum), ‘‘distant’’ indicates 1504dn 4180 (e.g., human vs Drosophila, E. coli vs H. pylori), ‘‘very distant’’ indicates dn 5190 (human vs E. coli, M. jannaschii vs Halobacterium sp.). How does within-species compare to between-species average dn -differences? Average within-prokaryotic group dn -differences range from 12 to 52 (persistently small), whereas the average between-group dn -differences range from 26 to 357 (see Fig. 1). What are the possible mechanisms underlying the signature determinations? DNA participates in multiple activities including genome replication, repair, and segregation, and provides special sequences for encoding gene products. There are fundamental differences in replication characteristics between Drosophila and mouse (Blumenthal et al., 1974). Drosophila DNA replicates frenetically in the first hour after fertilization, with replication bubbles distributed about every 10 kb: At about 12 h effective origins are spread to about 50 kb apart. In mouse, the rate of replication appears to be uniform throughout developmental and adult stages. Cell divisions involve DNA stacking on itself and loopouts that need to be judiciously decondensed to undergo segregation. The observed narrow limits to intragenomic heterogeneity putatively correlate with conserved features of DNA structure. The influence of the (double-stranded dinucleotide) base step on DNA conformational preferences is reflected in slide, roll, propeller twist, and helical twist parameters (Calladine and Drew, 1992; Hunter, 1993). Calculations and experiments both indicate that the sugar–phosphate backbones are relatively flexible. However, base sequence influences flexural properties of DNA and governs its ability to wrap around histone cores. Moreover, certain base sequences are associated with intrinsic curvature, which can lead to bending and supercoiling. Inappropriate juxtaposition or distribution of purine and pyrimidine bases could engender steric clashes (Calladine and Drew, 1992; Hunter, 1993). For example, transient misalignment during replication is associated with structural alterations of the backbone in alternating purine–pyrimidine sequences. On the other hand, purine and pyrimidine tracts have less steric conflicts between neighbors (Kunkel, 1992). Dinucleotide relative abundance deviations putatively reflect duplex curvature, supercoiling, and other higher-order 373 DNA structural features. Many DNA repair enzymes recognize shapes or lesions in DNA structures more than specific sequences (Echols and Goodman, 1991). DNA structures may be crucial in modulating processes of replication and repair. Nucleosome positioning, interactions with DNA-binding proteins, and ribosomal binding of mRNA appear to be strongly affected by dinucleotide arrangements (Calladine and Drew, 1992). A central unresolved problem concerns whether archaea are monophyletic or polyphyletic. On the basis of rRNA gene comparisons, the archaea are deemed monophyletic (Woese et al., 1990). This conclusion is supported by some protein comparisons, e.g., for the archaeal RecA-like sequences of Rad 50/Dmc1/RadA (Sandler et al., 1996) and the elongation factor EF-1a and EF-G families (Creti et al., 1994). However, many protein comparisons challenge the monophyletic character of the archaea. For example, bacterial relationships based on comparisons among the HSP 70 kD sequences place the Halobacteria closer to the Streptomyces than to archaeal or eukaryotic species (Karlin and Cardon, 1994). Other examples are glutamate dehydrogenase and glutamine synthetase (Benachenhou-Lahfa et al., 1994; Brown et al., 1994). Lake and collaborators split the prokaryotes into eubacteria, euryarchaea, and eocytes. With respect to genomic signature comparisons, the closest to HALSP are Streptomyces sequences dn ¼ differences about 85, and next but twice as distant are M.tuberculosis and M. leprae sequences: dn (HALSP, MYCTUÞ ¼ 145; dn (HALSP, MYCLEÞ ¼ 155: The dn differences of HALSP to the archaeal sequences of Sulfolobus sp. and M. jannaschii are very distant: dn > 280 and > 340; respectively. Sulfolobus sp. sequences are moderately similar only to Clostridium acetobutylicum, dn ¼ 87: All other comparisons with Sulfolobus sp. have dn values > 120 and mostly > 180: dn -differences of HALSP from other archaea exceed 245. Thus, a coherent description for the archaea is not supported by dn -difference data. Should rickettsial sequences be grouped with aproteobacteria? The classical a-types consist of two major subgroups: A1 including Rhizobium sp. and Agrobacterium tumefaciens, and A2 including R. capsulatus and R. sphaeroides, found predominantly in soil and/or marine habitats, the latter capable of anoxygenic photosynthesis. A third tentative group A3 includes the Rickettsial and Ehrlichial clades (obligate intracellular parasites). The genome signature sequence comparisons indicate pronounced discrepancies between A1 and A2 vs A3 : First, the A1 and A2 genomes are pervasively of high G+C content (generally 560%), whereas A3 genomes 374 Karlin et al. FIG. 1. are of low G þ C content (535%). Second, the dn differences among A1 sequences are 35–63 and among the A2 sequences are 65–90. The dn -differences between A1 and A2 sequences have the range 74–102. By contrast, the RICPR genomes compared to A1 and A2 show excessive dn -differences, generally > 200: Some additional challenging observations based on signature differences are: (i) All Chlamydia genomes are close in genome signature to the ARCFU genome. (ii) The enterobacteria ECOLI, HAEIN, VIBCH, and PASMU are intriguingly moderately similar to the Drosophila genome. (iii) In terms of dn -differences, the three cyanobacterial DNA sequences are not close. The cyanobacteria Synechocystis, Synechococcus, and Anabaena do not form a coherent group and are as far from each other as general Gram-negative sequences are from Gram-positive sequences. (iv) There is no consistent pattern of dn -differences among thermophiles. More generally, grouping of prokaryotes by environmental criteria (e.g., habitat properties, osmolarity tolerance, chemical conditions) reveals no correlations in genomic signature. 2.5.3. Genome signature comparisons among eukaryotes. (i) The most homogeneous genomes occur among fungi, especially for S. cerevisiae, whereas the most variable genomes are found among protists. The distribution of the dn -differences between human and mouse sequence samples is only slightly shifted relative to dn -differences within human sequence samples, reflecting moderate similarity of human and mouse. On the other hand, the dn -differences between human and S. cerevisiae and between human and D. melanogaster are substantially higher than all within-species dn -differences. 3. PROTEOMIC FEATURES 3.1. Chaperone Protein Contrasts Molecular chaperone systems have evolved in all three domains of life or originated in a common ancestor. Chaperones are considered to play pivotal roles in protein folding, degradation of misfolded proteins, Heterogeneity of Genome and Proteome Content proteolysis, and translocation across membranes. Specialized complex structures in cells often need their own ‘‘dedicated’’ chaperones (e.g., Kuehn et al., 1993). Among the top PHX genes in most bacterial genomes are those for the major chaperone proteins, DnaK (HSP70) and GroEL (HSP60). The HSP60 chaperonin complex is considered to assist protein folding by providing a cavity in which non-native polypeptides are enclosed and thereby protected against intermolecular aggregation (for a review, see Hartl and HayerHartl, 2002). The ATP-regulated DnaK together with its cofactors DnaJ and GrpE and the ATP-independent trigger factor (Tig) are reported to act co-translationally to assist in protein folding. Tig and DnaK are proposed to cooperate in the folding of newly synthesized proteins (Teter et al., 1999). Simultaneous deletion of both Tig and DnaK is lethal under normal growth conditions (Deuerling et al., 1999). Tig is broadly PHX for bacterial genomes but is not found in archaeal or eukaryotic genomes. HSP70 is abundant in many eukaryota and bacteria, often with multiple copies of the gene, but the gene is not present at all in some archaea. In particular, the HSP70 gene is absent from METJA, ARCFU, PYRAB, PYRHO, PYRAE, SULSO, AERPE but present in the archaea (METTH, THEAC, HALSP), where it may have been acquired by lateral transfer. All archaea and eukaryota apparently contain the molecular chaperone prefoldin or GimC (genes involved in microtubule biogenesis), which is absent from bacteria. GimC is believed to perform HSP70-like functions although there is no sequence similarity between GimC and HSP70 (Siegert et al., 2000). The chaperonin and its co-chaperonin (GroEL/ GroES) are seen to be highly expressed in virtually all bacterial genomes (see Table IV) (Houry et al., 1999), but found to be absent in the Mycolplasma UREUR, which lacks both GroEL and GroES. GroES is not found in archaea. The archaeal thermosome (a distant homolog of GroEL) is highly expressed in archaea (Karlin and Mr!azek, 2000) and a more closely related homolog to the eukaryotic protein TCP1, now referred to as TriC or CCT. The HSP60s in all three domains are purified from cells as double-ring complexes. In bacteria, each ring of GroEL contains seven HSP60 subunits, while in archaea, each ring contains eight or nine HSP60 subunits. The subunits comprising the ring may be identical for up to eight different, but closely related HSP60s. In most bacteria, the subunits are identical, except for rhizobium a-proteobacteria where there are two subunits. In some archaea rings are formed from identical subunits, while in others there are two subunits, and so far only among the Sulfolobus sp. 375 there are three subunits. Yeast contains at least 11 distinctive TriC genes. It is believed that the eukaryotic ring structures contain six to nine different subunits, but it is not yet clear how the different protein subunits are arranged. Duplicated HSP60 sequences stand out among the classical a-proteobacteria, contrasted to no duplications of HSP60 in all other proteobacterial clades. Multiple HSP60 sequences also exist in cyanobacteria, in Chlamydia, in high G þ C Gram-positive, but not in RICPR. Many a-mitochondrial eukaryotes (TRIVA, GIALA, ENTHI) contain two or more HSP60. Plastids carry multiple copies of HSP60 that bind Rubisco. The bacterial Thioredoxin (TrxA) implements protein folding by catalyzing the formation or disruption of disulfide bonds (Powis and Montfort, 2001; Ritz and Beckwith, 2001). The eukaryotic thioredoxin functional analog is protein disulfide isomerase, operating in the endoplasmic reticulum. The highest expression levels for thioredoxin occur in BACSU and then in other fastgrowing bacteria in the order DEIRA, VIBCH, HAEIN, and ECOLI (data not shown). Other chaperones are also distributed unevenly through the three domains. HSP90 exists in many bacteria and eukaryotes but is not found in archaea. Peptidyl-prolyl cis–trans isomerases (PPIases) accelerate the proper folding of proteins by promoting the cis–trans isomerization of imide bonds in proline within oligopeptides. Tig exhibits PPIase activity in vitro. ECOLI has at least nine PPIases defined by sequence similarity. One of these, the survival protein SurA, enhances the folding of periplasmic and outer membrane proteins. As expected, SurA does not exist in Gram-positive bacteria. DegP is a chaperone folding factor that is significantly PHX, and acts primarily in degrading misfolded proteins in the periplasm. Also associated with periplasmic and cytoplasmic chaperones are several PPIases and disulfide oxidase, DsbA. GroEL (and thermosomes in archaea) is PHX (expression level EðgÞ51) in almost all prokaryotic genomes as displayed in Table IV. Many HSP60 family members are among the top PHX with EðgÞ values often exceeding 2.00. 3.2. Ribosomal Protein (RP) Gene Distribution Many RP genes have diverged between most archaeal and bacterial genomes (see COGs database at NCBIBethesda, Tatusov et al., 2001). Most bacterial genomes have an operon or cluster, accounting for 20–40% of all RP genes, are located close to their origin of replication. 376 Karlin et al. TABLE IV Chaperonin (HSP60) Expression Levels Proteobacteria g-clade: ECOLI, EðgÞ ¼ 2:09; SALTY, 2.35; VIBCH, 1.31; HAEIN, 1.47; PSEAE, 1.37; YERPE, 2.08; BUCSP, 1.13 b-clade: NEIME, 1.15. a-clade: AGRTU, 2.07; SINME (5 copies), 1.66, 0.60, 1.67, 0.66, 1.13; MESLO (5 copies), 1.51, 1.96, 1.87, 1.73, 2.03; CAUCR, 1.78; RICPR, 1.01 e-clade: HELPY, 1.17; CAMJE, 1.43. Only a-types among proteobacteria carry multiple copies of GroEL and many are very highly expressed. Other Gram-negative: DEIRA, 2.35; cyanobacteria SYNY3 (2 paralogs), 1.51, 1.47; Nostoc sp., 1.59, 1.38. Small pathogenic: CHLTR (2 copies), 0.87, 1.16; CHLPN (3), 0.89, 0.73, 1.18; TREPA, 1.27; BORBU, 1.13; MYCGE, 0.82 UREUR has no GROEL. Gram-positive Low Gram+: LISMO, 1.89; LACLA, 1.23; STRPY, 0.86; STRPN, 1.08; BACSU, 1.87; BACHA, 1.79. High Gram+: MYCTU (2 copies), 1.39, 0.95; MYCLE (2), 1.60, 1.30. Thermosomes in archaea are generally structured from two rings of a and b units. All units are PHX. HASLP, b 1.20, a 1.21; ARCFU, b 1.35, a 1.49; METJA (a single unit), 1.56; METTH, 1.33, 138; THEAC, a 1.07, b 1.18; PYRHO (only 1 unit), 1.27; PYRAB (only 1 unit), 1.40; PYRAE, 1.48, 1.63; AERPE, 1.14, 1.21; SULSO (3 units a; b; g), 1.25, 1.37, 1.03. EðgÞ refers to the formal predicted expression level (Karlin and Mr!azek, 2000). Many genes involved in protein synthesis including tuf, fus, rpoA, rpoB, rpoC are encoded within or proximal to the large RP cluster in bacteria but not in archaea. Archaeal genomes, apparently lacking a unique origin of replication, contain approximately the same numbers of RP genes (range 50–65) as bacterial genomes and possess a less significant operon as in bacterial genomes. In contrast, the RP genes of yeast (and of higher eukaryotes) are randomly dispersed throughout their genome. A ‘‘giant’’ RP gene (labeled S1), commonly exceeding 500 aa length, is essential in bacterial genomes, (with the exception of Mycoplasma) where it is encoded away from the RP cluster, but is missing from archaeal and eukaryote genomes. S1 is overall acidic, binds weakly and reversibly to the small subunit of the ribosome, whereas most other RPs bind strongly (Sengupta et al., 2001). S1 has a high affinity for interaction with mRNA chains. Protein S1 is the largest RP present in the ribosome of most bacterial cells and consists of multiple tandem structural motifs each of about 70 aa length. The S1 protein is reported to be necessary in many cases for translation initiation and translation elongation and is directly involved in the process of mRNA recognition and binding. S1 can facilitate binding of mRNA that lacks a strong Shine–Dalgarno sequence. S1 is not encoded near any RP operon and generally is found among the highest expression levels. S1 is also encoded by the deeply branching Gram-negative AQUAE and THEMA, the latter allowing for a frame shift. The 820 aa S1 protein of THEMA can be recognized as a fusion of cytidylate kinase (contributing to pyrimidine biosynthesis) with a standard S1. The S1 proteins of Bacillus genomes (BACSU, BACHA) and of most low G+C Gram-positive genomes are generally of reduced size (in the range 380–407 aa). Bacterial and archaeal genomes encode about 50–65 RPs (to date, the highest number among prokaryotic genomes is 65 in SULSO), whereas metazoan eukaryotes invariably have 78 or 79 RP components (Warner, 1999). The situation for protozoa may be different. Ribosomal proteins are generally cationic (mostly > 20% cationic residues). Three acidic RPs are found in eukaryotes, P0 ; P1 ; P2 ; each containing a carboxyl hyperacidic residue run. Of these, only P0 is present in archaeal genomes. P0 ; P1 ; and P2 are considered to play an important regulatory role in the initiation step of eukaryotic protein translation. Acidic RPs are not 377 Heterogeneity of Genome and Proteome Content present in bacterial genomes, except for S1 and L7/L12. L7/L12}as with P0 ; P1 ; and P2 }is thought to act in adapting mRNA chains to the ribosome. 3.3. Special Transcription/Translation/ Replication Factors The special eukaryotic ancillary replication protein PCNA is extant in most archaea and eukaryotes but is not found in bacteria. Actually, there are multiple copies of the PCNA gene in the crenarchaeal genomes of SULSO, PYRAE, and AERPE. Translation elongation factors (e.g., Tuf, Fus) occur as single genes in archaea but generally appear in multiple highly expressed genes in a-, b-, and g-proteobacteria. The ribosome release elongation factor Rrf is found in most bacteria and in yeast, but is missing from archaea. The helicase protein RecG, which helps facilitate branch migration of the Holliday junction, is widespread in bacteria but seemingly absent from archaea (Suyama and Bork, 2001). 3.4. Origin and Function of Membrane Lipids All the three domains contain polyisoprenes but eukaryotes use significant amounts of sterols not present in either bacteria or archaea. Membranes of Gramnegative bacteria and eukaryotes are replete with phospholipid and lipid-modified proteins, whereas archaea generally emphasize prenylated ether lipids but apparently make no fatty acids (Hayes, 2000). Lipopolysaccharide biosynthesis genes of anomalous codon usage, which encode a hierarchy of surface antigen proteins (the Lps family) and often occur in clusters, are present in many bacterial and in archaeal genomes but never in Gram-positive organisms and apparently are not present in eukaryotes. The Lps biosynthesis genes generally indicate a putative alien gene cluster as characterized in Karlin (2001). The lipidA anchor (connecting sugar and lipid moieties) prominent in ECOLI and SALTY appears to be missing from Gram-positive and archaeal genomes. This phenomenon may be related to the fact that the enzymatic apparatus for lipid synthesis appears to be much reduced or nonexistent in most archaeal genomes. For example, FabB, FabD, and AcpP are not found in the archaea. 3.5. Nitrogen Fixation (Nif) Nif genes are present in several bacteria and archaea but apparently not in eukaryotes. The glnB family of nitrogen sensory–regulatory genes is widespread in bacteria and archaea. Nif in archaea is evolutionarily related to Nif genes in bacteria and operates by the same fundamental mechanism (Leigh, 2000). It is proposed that some genes of this kind wander about via lateral transfer (e.g., as occurs in Klebsiella). Some Nif genes are found in AQUAE, ARCFU, CHLTR, CHLPN, DEIRA, HAEIN, HELPY, METTH, NEIME, SYNY3, TREPA, VIBCH, SINME. For example, the predominant nitrogenases in methanogens seem to be molybdenum (cofactor) nitrogenases as is the case in bacteria. The methanogens vary with respect to nitrogen fixation. For example, neither METJA nor METVO fixes nitrogen while Methanosarcina barkeri and Methanococcus thermolithotropicus both do (Leigh, 2000). 4. METABOLIC PATHWAYS AND SOME PROTEIN CLASSES We describe in Tables V–VIII the status of genes of several pathways among archaeal and bacterial species emphasizing the presence, absence, and expression levels of these genes. (EðgÞ signifies the formal predicted expression level, see Karlin and Mr!azek (2000)). 4.1. Glycolysis (Table V) Hexokinase (Hxk) and glucokinase (Glk) are prominent glycolysis proteins in eukaryotes, but the former is not found in most prokaryotes nor in any archaeal sequences to date. Only TREPA contains a hexokinase homolog of low expression level. In glycolysis, hexokinase converts glucose to glucose-6-phosphate. However, glucose-6-phosphate arises from other hexoses and from glucose transported into the cell via the phosphotransferase system (PTS). Glucokinase occurs in many bacteria but normally at low to moderate expression levels. Bacteria which rely on carbohydrates as a primary energy source (STRPY, LACLA, BACSU, ECOLI, VIBCH) use the PTS system to transport glucose into the cytoplasm, which concomitantly phosphorylates glucose making Hxk/Glk expendable. PTS genes are present but generally not PHX in PSEAE, HAEIN, NEIME, MESLO, CAUCR, CHLTR, CHLPN, TREPA, MYCGE. PTS genes are absent from other current bacteria, from all current archaea, and from yeast. Bacteria mostly execute complete glycolysis cycles (apart from Hxk/Glk) and glycolysis enzymes tend to be PHX with very high expression 378 levels prevailing in yeast, LACLA, STRPY, and ECOLI. The human obligate intracellular parasite RICPR is not able to metabolize glucose (Winkler and Daugherty, 1986). However, RICPR contains five ATP–ADP exchange translocase genes. These antiporters take ATP from host cytoplasmic sources and release ADP from the bacterial cell; the standard mitochondrial exchange is reversed. The ATP–ADP translocase is very uncommon among bacteria and only identified in Chlamydia and Rickettsia and in an assortment of plant plastids. The glycolysis genes are PHX in all fast-growing bacteria with high expression levels for most of them (Karlin et al., 2001). Glycolysis genes in archaea are either not PHX or entirely missing. For example, glucose-phosphate isomerase is missing from the archaeon ARCFU, from Pyrococcus species, as well as from bacteria in the Mycoplasmas group. Phosphofruktokinase is absent from archaea and several proteobacteria (see Table V). There are two types of fructose biphosphate aldolase proteins. The class-II Fba is present in all bacteria (except RICPR, HELPY, CAMJE, UREUR, and Chlamydia) and also in yeast but has no homologs in archaea. Chlamydia possess a class-I Fba which also carries homologs in most archaea, in ECOLI, in MESLO, and in AQUAE. Phosphoglycerate mutase (Gpm) is present in all bacteria (except RICPR) and in yeast but only HALSP and THEAC carry Gpm homologs among archaea. Apart from Rickettsia, the glycolysis enzymes triosephosphate isomerase (Tpi), glyceraldehyde-3-phosphate dehydrogenase (Gap), phosphoglycerate kinase (Pgk), Enolase (Eno), and Pyruvate kinase (Pyk) are widespread in all prokaryotes. The multifunctional enzyme Gap is missing from UREUR, and pyruvate kinase (Pyk) is missing from ARCFU, METTH, AQUAE, HELPY, and TREPA. Precluding hexokinase (Hxk) and glucokinase (Glk), there are 10 major glycolysis genes (Table V). Bacterial genomes with at least six PHX glycolysis genes include LACLA (has nine PHX glycolysis genes), STRPY (9), BACSU (7), SYNY3 (6), LISMO (8), ECOLI (10), VIBCH (8), HAEIN (7), MESLO (6). There are no archaeal genomes with more than three (mostly one) glycolysis genes PHX. 4.2. Amino acyl-tRNA Synthetases (Table VI, aaRS) The picture of aaRS proteins has become rather complex (Handy and Doolittle, 1999; Wolf et al., 1999). Karlin et al. The existence of two classes of aaRS proteins has been firmly established during the past decade. Evidence comes from X-ray structural data, sequence comparisons, and enzymatic mechanisms. The corresponding amino acids divide into two sets of residues: Leu, Ile, Met, Val, Tyr, Gln, Glu, Arg, Cys, Trp constitute Class I, whereas Class II amino acids are Gly, Ala, Ser, Asp, Asn, Lys, His, Pro, Thr, Phe. Most aaRS are present in all genomes. However, glutaminyl-tRNA synthetase (GlnS) is missing from all archaea and most bacteria. In fact, it is present only in g-proteobacteria, and DEIRA. In other species, GlnS is complemented by GluS and an amidotransferase. Asparaginyl-tRNA synthetase (AsnS) is absent from many prokaryotic genomes. Among archaea, AsnS is found in both Thermoplasma and both Pyrococcus sequences. Among bacteria AsnS occurs in several g-proteobacteria, in several low G+C Gram-positive bacteria, in spirochetes and in mycoplasmas. Glycyl-tRNA synthetase in many bacteria is composed of two subunits GlyS and GlyQ whereas the archaea have a single unit GRS1. Interestingly, DEIRA, mycobacteria, spirochetes, and mycoplasmas also have the archaeal GRS1 instead of the bacterial type. Analogously, for LysS Class I lysyltRNA synthetase (LysU) occurs in most bacteria and, notably, in yeast. Class II lysyl-tRNA synthetase (LysS) is found in all archaea, and in a-proteobacteria (including RICPR) and spirochetes. Among bacteria, the number of PHX aaRS varies from zero in DEIRA, PSEAE and several other species to 19 in ECOLI. Archaea are incongruent with no aaRS reaching PHX status in METJA but 13 PHX in AERPE. In the yeast genome, 13 aaRS are PHX. Most yeast amino acyl-tRNA synthetases occur in two copies with a PHX nuclear version and a mitochondrial version of relatively low expression level. 4.3. TCA Cycle Genes (Table VII) The TCA cycle, apart from production of energy, can contribute in myriad ways to cellular needs, especially in making precursors and intermediates to macromolecules, e.g., in amino acid, vitamin, and heme biosyntheses. The order of actions in the TCA cycle is: citrate synthase (GltA); aconitate hydratase (AcnA/AcnB), isocitrate dehydrogenase (Icd), 2-oxoglutarate dehydrogenase (SucA), succinyl coenzyme A (succinylCoA) synthetase (SucD and SucC), succinate dehydrogenase (SdhB, SdhC, and SdhD), fumarate hydratase (FumA or FumC) and malate dehydrogenase (Mdh/ CitH). EðgÞ Values (Multiplied by 100) for Glycolysis Genes Archaea Gene Afu Hbs Euk Mja Mth Tac Pho Pab Ape Sce HXK - - - - - - - - 250 174 Glk - - - - - - - - - Bacteria Aae Tma Dra Mtu Mle Lla Spy Bsu Syn Eco EcZ Pae Vch Hin Nme Hpy Cje Mlo Ccr Rpr Ctr Cpn Tpa Bbu Uur Mge - - - - - - - - - - - - - - - - - - - - - - 84 - - - - - - - - - - - 87 49 58 - - - 55 92 - 82 65 64 - - - - - - - 99 94 97 77 - 93 75 74 83 - - Pgi - 80 56 83 97 - - 93 260 73 101 66 94 81 213 157 92 79 141 136 60 67 60 46 63 PfkA - - - - - - - - 213 249 89 97 96 99 105 91 176 143 89 106 59 186 198 - 101 109 - - - 93 - - 76 92 82 102 83 82 119 109 84 104 Fba - - - - - - - - 227 82 101 136 119 99 208 207 199 59 127 238 47 39 233 61 etc. 103 188 146 86 92 89 106 99 - - - 100 99 93 81 FbaB 98 85 119 107 88 65 100 - 85 77 87 - 114 - - - - - - - - ? ? 54 47 - - - - - - 74 - - 92 93 - - - - TpiA 103 107 75 100 88 109 96 88 252 97 91 59 91 83 192 159 148 77 208 192 95 130 130 77 94 75 83 77 58 - 96 82 99 105 84 84 GapA 103 113 106 89 102 88 105 84 254 233 229 111 119 166 120 112 195 104 213 180 48 130 107 207 46 ? 213 53 49 80 62 50 193 42 etc 172 80 74 92 85 97 175 154 - 96 105 104 109 - 91 Pgk 79 90 69 82 98 75 88 73 244 80 91 67 79 80 230 218 134 133 238 243 69 177 104 65 94 85 129 146 - 84 88 98 111 72 93 - - - 105 113 227 188 - - 124 139 - - 123 80 - - 115 80 - 97 107 102 103 - - 69 63 63 58 81 66 60 - - - - 82 76 etc 66 62 - - - - - - - GpmA - - - - - - - - 254 45 38 GpmB - - - - 95 90 89 - - - 77 66 etc 90 72 95 67 66 etc 117 100 etc 94 93 etc 58 51 etc 66 54 70 105 103 87 GpmI - 99 - - - - - - - - - - - - - - 108 94 ? 185 59 147 - - 86 75 - - - - - - - 70 81 182 - 65 - - 80 - 67 - - - - - - - - DeoB - - - - - - - - - - 83 68 - - 82 80 61 - 168 46 Eno 88 93 76 63 102 92 93 76 87 68 82 235 222 etc 98 111 147 101 92 196 49 217 192 112 211 214 91 193 159 93 85 92 142 128 - 79 89 98 81 93 78 PykF - 80 67 - 101 73 137 112 225 43 - 88 86 61 102 90 225 204 118 104 70 258 95 etc 264 88 etc 68 57 49 175 54 49 158 63 - 102 109 81 69 53 - 78 91 - 84 70 108 Heterogeneity of Genome and Proteome Content TABLE V PHX genes are shown in red, PA in blue. Special symbols: }, The gene is not present in the genome; ?, COG data indicate that the gene is in the genome but the name does not match any gene in the annotation; etc., more than three homologs belong to the COG. Top two EðgÞ values are shown. Genes: Hxk, hexokinase; Glk, glucokinase; Pgi, glucose-6-phosphate isomerase; PfkA, 6-phosphofructokinase; Fba, fructose/tagatose bisphosphate aldolase; FbaB, DhnA-type fructose-1,6-bisphosphate aldolase and related enzymes; TpiA, triosephosphate isomerase; GapA, glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase; Pgk, 3-phosphoglycerate kinase; GpmA, phosphoglycerate mutase 1; GpmB, phosphoglycerate mutase/fructose-2,6-bisphosphatase; GpmI, phosphoglyceromutase; DeoB, phosphopentomutase; Eno, enolase; PykF, pyruvate kinase. 379 380 TABLE VI EðgÞ Values (Multiplied by 100) for Aminoacyl-tRNA Synthetases Archaea Gene Euk Mja Mth Tac Pho Pab AlaS 124 74 66 81 78 122 92 96 138 98 77 ArgS 97 ? 80 71 83 63 105 128 67 36 75 AspS 110 124 66 101 89 93 117 126 135 27 114 - 94 89 92 82 107 81 - 169 40 AsnS CysS 74 91 - - 81 71 120 Ape Sce Bacteria Afu Hbs 140 56 Aae Tma Dra - Lla Spy Bsu Sy n Eco EcZ Pae Mlo Ccr Rpr Ctr 69 97 101 79 177 73 78 115 130 62 83 63 93 83 72 113 108 77 75 80 94 104 110 71 91 79 115 119 98 105 89 64 104 113 60 80 61 64 93 73 66 85 86 85 73 83 77 84 72 99 70 59 97 104 83 70 73 104 195 191 77 100 101 92 86 84 151 137 82 95 92 76 81 74 73 83 69 117 116 - 123 110 - - - - - - - - 87 78 89 84 55 90 89 92 68 63 63 74 87 66 69 81 89 83 89 90 107 71 89 89 135 46 136 43 59 59 86 46 72 74 64 88 82 101 66 120 70 69 66 94 82 92 76 97 109 70 77 - ? - - 116 105 59 91 71 92 78 43 40 102 91 74 105 85 80 78 60 101 85 Vch Hin Nme Hpy Cje Cpn Tpa Bbu Uur Mge Mtu Mle GltX 72 72 77 93 83 82 115 120 179 42 GlnS - - - - - - - - 64 - - 83 - - - - - - 126 136 72 75 91 68 - - - 48 - - - - - - - GlyQ - - - - - - - - - 104 83 - - - 78 92 60 95 118 120 72 124 114 81 89 85 95 82 99 76 90 - - - - 107 75 60 GlyS - - - - - - - - - 73 60 - - - 73 58 59 65 132 126 51 88 107 72 82 90 70 90 81 76 90 - - - - GRS1 104 80 84 77 102 85 88 148 135 29 - - 58 83 85 - - - - - - - - - - - - - - - - - 87 110 85 75 HisS 88 82 66 84 97 73 71 118 88 29 85 79 90 78 63 50 90 75 53 44 50 49 40 87 82 74 65 80 74 53 79 84 66 89 74 98 58 75 66 85 89 106 107 108 73 75 IleS 126 90 92 104 94 141 145 148 104 25 114 103 62 78 75 97 65 78 60 117 123 60 85 ? 67 78 67 88 109 87 86 97 97 83 70 74 119 83 66 LeuS 124 98 108 73 100 127 141 139 82 29 109 98 70 73 71 138 66 84 157 170 71 106 78 69 83 76 114 110 84 79 58 66 79 84 Lys S 86 93 75 97 91 88 93 82 - - - - - - - - - - - - - - - - - - 90 76 79 - - 89 80 - - Lys U - - - - - - - - 174 42 89 77 85 74 64 75 70 161 32 109 109 93 184 68 183 66 78 136 105 75 86 65 - - - 85 83 - - 84 87 MetG 91 84 79 95 89 99 101 135 74 33 105 103 77 108 91 54 69 69 84 117 119 72 74 73 83 100 87 77 69 105 99 100 87 82 84 83 80 96 PheS 96 PheSA 120 105 98 87 85 77 83 128 115 30 110 85 78 90 82 63 58 70 95 130 137 78 108 90 78 96 61 95 97 84 100 101 79 96 86 - 59 92 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PheT 76 110 74 67 82 76 91 102 109 30 86 89 65 81 72 65 62 52 89 116 117 56 103 77 59 78 71 92 116 80 69 86 86 100 72 108 ProS 67 103 72 78 107 85 78 121 132 32 98 90 95 103 91 79 69 58 94 164 162 67 90 96 59 78 65 91 88 103 95 85 77 104 84 83 86 123 142 39 98 92 98 109 87 127 124 70 92 164 176 78 117 98 69 86 82 87 65 97 89 77 101 112 110 74 120 95 73 142 40 107 112 56 92 104 107 59 87 140 94 85 75 98 59 96 81 95 78 119 92 83 SerS ThrS 118 89 93 71 79 93 102 94 84 113 108 TrpS 100 105 78 123 65 90 102 93 77 80 90 39 115 102 89 93 89 94 63 TyrS 106 97 76 93 96 98 113 89 92 35 ValS 115 96 84 87 101 120 124 127 127 117 116 87 133 88 87 90 54 54 92 91 117 57 64 69 70 85 74 118 87 96 88 83 76 101 83 79 91 89 104 89 81 82 64 115 60 69 x 140 124 97 56 72 40 97 81 86 69 116 89 79 91 88 90 95 77 77 106 83 95 74 78 89 65 82 181 181 80 118 102 71 93 90 115 103 80 89 78 92 118 69 72 Karlin et al. PHX genes are shown in red, PA in blue. Special symbols:}the gene is not present in the genome; ?, COG data indicate that the gene is in the genome but the name does not match any gene in the annotation; x, expression levels not evaluated (usually due to length 580 aa); etc, more than three homologs belong to the COG. Top two EðgÞ values are shown. Genes: AlaS, alanyl-tRNA synthetase: ArgS, arginyl-tRNA synthetase; AspS, aspartyl-tRNA synthetase; AsnS, aspartyl/asparaginyl-tRNA synthetases, CysS, cysteinyl-tRNA synthetase; GltX, glutamyl-tRNA synthetase; GlnS, glutaminyl-tRNA synthetase; GlyQ, glycyl-tRNA synthetase alpha subunit; GlyS, glycyl-tRNA synthetase beta subunit; GRS1, glycyl-tRNA synthetase class II; HisS, histidyl-tRNA synthetase; IleS, isoleucyl-tRNA synthetase; LeuS, leucyl-tRNA synthetase; LysS lysyl-tRNA synthetase class I; LysU, lysyl-tRNA synthetase class II; MetG, methionyl-tRNA synthetase; PheS, phenylalanyl-tRNA synthetase alpha subunit; PheSA, phenylalanyl-tRNA synthetase alpha subunit archaeal type; PheT, phenylalanyl-tRNA synthetase beta subunit; ProS, prolyl-tRNA synthetase; SerS, seryl-tRNA synthetase; ThrS, threonyl-tRNA synthetase; TrpS, tryptophanyl-tRNA synthetase; TyrS, tyrosyl-tRNA synthetase; ValS, valyl-tRNA synthetase. 381 Heterogeneity of Genome and Proteome Content Archaeal genomes lack AcnB (but some have AcnA) and lack SucA (Table VIII). The anaerobic STRPY lacks all classical TCA cycle genes except for Mdh. The spirochaetes (TREPA and BORBU) and mycoplasmas (e.g., UREUR and MYCGE) also lack TCA cycle genes except Mdh. However, Mdh is involved in several metabolic pathways including fermentation and CO2 fixation via the serine–isocitrate lyase pathway. On the other hand, DEIRA, ECOLI, MESLO, and CAUCR have most of the TCA genes at the PHX level. The genes isocitrate lyase (AceA) and malate synthase (AceB) of the glyoxylate shunt are restricted in bacterial genomes and wholly absent from archaeal genomes. These genes are also missing from many small pathogenic bacteria, including RICPR, CHLTR, CHLPN, TREPA, BORBU, UREUR, MYCGE. The glyoxylate shunt is required to synthesize precurors for carbohydrates when the carbon source is a C2 compound. Among archaea the only occurrence of the glyoxylate cycle is found in Haloferax volcanii. In ESCCO, isocitrate lyase converts isocitrate into succinate and glyoxylate, allowing carbon that entered the TCA cycle to bypass the formation of 2-oxoglutarate and succinylCoA. Among the current complete genomes, isocitrate lyase is strongly PHX in DEIRA, MYCTU and PSEAE, is present but not PHX in MYCLE, ECOLI and VIBCH, and in the a-proteobacteria MESLO and CAUCR, and is absent from all other currently sequenced prokaryotic genomes. 5. PROTEIN LENGTHS AND AMINO ACID USAGES AMONG THE THREE DOMAINS Table IX reports the median (50 percentile level) variation of protein lengths, amino acid usages, charge and hydrophobic usages for the five complete eukaryotes, 11 archaeal and 38 currently available bacterial genome sequences; Two data sets were used for human sequences; the SWISS-PROT (SP) collection and the draft human genome public version (HGP) released in 2001. The numbers are remarkably congruent, indicating that SP provides a faithful representation of the complete genome. Several striking patterns are evident from Table VIII and we present some possible interpretations. 5.1. Protein Lengths across Proteomes The median protein lengths (aa) of all complete genomes indicate the following orderings: archaea (median range 230–250 aa except THEAC 268, PYRAE 208, and AERPE 175) 5 bacteria (250–295 except NEIME 239 and STRPN 242) 5 eukaryotes (346–386). The same orderings hold when restricted to proteins of size 200 aa (archaea 331–340, HALSP 344, THEVO 347; bacteria 340–377, MESLO 337, LACLA 329, STAAU 338, STRPN 338, STRPY 335, UREUR 383; eukaryotes 433–473, CAEEL 401). The percent of proteins 200 aa relative to all proteins of the genome is 52–76% in Archaea, AERPE 43%; 51–74% in bacteria, and 76–80% in eukaryotes. The greater length of eukaryotic proteins may reflect their intrinsic complexity which may be attributed to their multifunctionality (several active regions of separate function), with concomitant multiple exons and alternative splicing. The smaller size of archaeal proteins may relate to more extreme environments and more specialized activity. 5.2. Charged Amino Acid Usages For the three domains we compare median protein frequencies of acidic (E or D) and basic (K or R) residues. The usage of glutamate (E) pervasively exceeds the usage of aspartate (D) by at least 1% (mostly 2%) with reversals in only six prokaryotes. These reversals tend to be in high G+C genomes (HALSP 68%; XYLFA 53%; CAUCR 67%; MESLO 63%; MYCTU 66%; MYCLE 58%). However, other genomes of high G+C content (e.g., PSEAE 66%; SINME 63%; DEIRA 67%) do not show this property. Strong differences in charged amino acid usages are also observed in the halophilic archaeon HALSP, which show unusually high levels of aspartate (D) usage (median 9.4% vs 4.3–6.2% in other organisms) and low usage of basic residues (median 8.1% compared to 9–13% in all other organisms). Acidic residues overall tend to be more used in euryarchaeal vs crenarchaeal proteomes and in the anciently diverged Gram-negative species AQUAE and THEMA (medians 14.2 and 14.3%). Usage values for acidic residues are in the range 11.4–12.2% for eukaryotes and 9.6–14.3% for archaea and bacteria. What can account for E being used more than D? There is no real distinction on the basis of the underlying codon types (GAR vs GAY). Reasons could reflect constraints on D in terms of protein secondary structure and size. For example, poly-E can establish highly stable long a-helices whereas poly-D does not form long a-helices or long b-strands (Richardson and Richardson, 1988). However, D is more common at active sites (e.g., in proteases) and in calcium metal coordination. The five complete 382 TABLE VII EðgÞ Values (Multiplied by 100) for TCA Cycle Genes Archaea Gene Afu Hbs Euk Mja Mth Tac Pho Pab GltA 101 116 - 84 76 AcnA - 127 - - 111 - - AcnB - - - - - - - Icd 105 111 - - 99 - - LeuB 74 - 85 68 105 101 - 82 74 70 Suc A - - - - - - - SucD 78 78 84 73 130 84 78 SucC SdhA 103 98 - - 110 89 78 102 - - 133 79 119 113 108 83 91 84 - - 121 45 116 123 92 93 120 95 78 - 121 100 67 69 98 - - 153 88 80 etc 82 SdhB 103 102 67 82 92 - - 136 SdhC 97 91 - - 96 - - 132 SdhD 97 - - - 103 - - FumC - 101 - - - - - TtdA a FumA Mdh a Bacteria Ape Sce Aae Tma Dra 67 115 50 135 80 141 38 183 169 59 124 256 53 119 131 47 113 115 207 34 162 156 121 82 87 69 etc 56 170 Mtu Mle 119 97 114 92 Lla 118 86 37 - 108 - 29 ? 47 29 - - - - - 98 140 92 - 45 - 133 95 101 86 44 - 50 43 76 128 104 - - 39 - 37 Spy Bsu Sy n Eco EcZ Pae 63 129 139 79 57 99 ? 72 70 46 - 165 108 97 - - 67 93 - 171 114 103 - - 63 99 135 104 etc 121 94 83 Vch Hin Nme Hpy Cje 96 49 63 62 143 165 Mlo Ccr 149 124 65 55 etc 48 Ctr 101 - - - - - - 155 79 - - - - - - - - - - - - - - - - 146 152 - - - - - - - - 75 130 88 84 95 - - - - - - - - 140 134 76 87 79 - - - - - 99 136 118 170 100 98 85 - - - - 116 - 85 115 99 162 103 83 76 - - - - 127 135 82 110 69 138 131 92 86 83 - - - - 109 137 93 75 74 112 100 88 92 89 - - - - - 100 80 42 - 58 - - 161 113 70 - 76 87 100 162 71 - - - 111 71 ? 67 51 72 63 74 89 130 173 ? ? 124 186 60 38 129 59 44 125 112 101 57 133 174 171 116 112 49 137 117 113 50 103 61 - 57 74 133 108 138 101 123 52 44 101 - - 63 100 81 113 55 126 57 117 Cpn Tpa Bbu Uur Mge Rpr 97 66 - - 71 96 92 - - 64 - 81 83 112 63 - 101 87 95 99 75 94 91 85 - - - - 115 77 48 - - - 110 103 99 - - - - 88 86 89 - 108 - - 94 73 91 91 - - - - - 89 84 - - 169 106 100 - - 54 75 32 40 112 73 57 41 92 65 99 92 113 122 89 90 91 - - - - 110 60 56 110 87 56 114 55 etc 114 87 etc 77 47 - 72 - - 64 - - - - - - - - 77 47 - 72 - - 64 - - - - - - - - 142 132 - 66 118 - - 88 75 130 168 92 78 90 - 119 - 84 87 - 78 82 77 - 89 79 - - 96 95 - - - - - - - 95 - 64 96 80 - 85 84 - - 107 93 90 - - - - - - - 95 102 66 91 112 - - 135 117 55 45 86 75 102 180 105 122 123 226 45 44 172 81 76 92 a Karlin et al. Some proteins occur in both TtdA and FumA COGs. PHX genes are shown in red, PA in blue. Special symbols:}The gene is not present in the genome; ?, COG data indicate that the gene is in the genome but the name does not match any gene in the annotation; etc, More than three homologs belong to the COG. Top two EðgÞ values are shown. Genes: GltA, citrate synthase; AcnA, aconitase A; AcnB, aconitase B; Icd, isocitrate dehydrogenases; LeuB, isocitrate/isopropylmalate dehydrogenase; SucA, pyruvate and 2-oxoglutarate dehydrogenases E1 component; SucD, succinyl-CoA synthetase alpha subunit; SucC, succinyl-CoA synthetase beta subunit; SdhA, succinate dehydrogenase/fumarate reductase flavoprotein subunits; SdhB, succinate dehydrogenase/fumarate reductase Fe-S protein; SdhC, succinate dehydrogenase/fumarate reductase cytochrome b subunit; SdhD, succinate dehydrogenase hydrophobic anchor subunit; FumC, Fumarase; TtdA, tartrate dehydratase alpha subunit/fumarate hydratase class I; FumA, tartrate dehydratase beta subunit/fumarate hydratase class I; Mdh, malate/lactate dehydrogenases. 383 Heterogeneity of Genome and Proteome Content eukaryotic genomes invariably have (E+D)% > (K+R)%. As with eukaryotes, most prokaryotic proteins use more acidic residues than basic residues. But there are several examples with (K+R)% > (E+D)%, including BORBU, MYCGE, TREPA, BUCSP, RICPR, HELPY, UREUR, MYCPN, PYRAE, SULSO, CLOAC. 5.3. Hydrophobic Residue Usages The median usage of hydrophobic residues in eukaryotes is lower than in archaea and bacteria (36.4–38.4% vs 38.7–43.5%, except for STAAU, 38.2%). This observation is somewhat unexpected given that eukaryotic proteins tend to be longer than prokaryotic proteins and might contain proportionally larger hydrophobic cores. A possible explanation is that eukaryotic structures tend to be multi-domain rather than consisting of compact globular units. Among bacteria, the highest median hydrophobic usage is exhibited by a-proteobacteria (42.3–43.5%) and mycobacteria (43%). 5.4. Correlations of G+C Genome Content and Amino Acid Usages We observe a positive correlation of the genome frequency of G+C with the frequency of amino acids encoded from strongly binding bases {Ala, Gly, Pro, Trp, Arg} and a negative correlation with the frequency of amino acids encoded from weak bases {Lys, Ile, Phe, Tyr, Asn}. 5.5. Amino Acid Usages in Thermophiles Among thermophiles optimal growth temperatures range from 508 to 1008C; but there is no correlation of amino acid usage with optimal growth temperatures. To date, all sequenced archaeal genome sequences, with the exception of HALSP, are thermophilic organisms. Two hyperthermophilic bacterial genomes, AQUAE and THEMA, are available. What is the nature of amino acid usages of thermophiles (or hyperthermophiles) compared to mesophiles? The following features stand out: Thermophilic organisms persistently show a higher charge usage than mesophilic organisms of similar G+C content. The only exception is HALSP, whose proteins tend to be richer in Asp representations compared to other proteomes. The greater charge in proteins of thermophiles presumably implicates more salt-bridge connections in their 3D structure and concomitantly greater structural stability. The strong amino acid Ala and Gly frequencies are significantly correlated with G+C content. Discounting this correlation, thermophilic organisms generally show lower frequencies of Ala and higher frequencies of Gly than mesophilic organisms. This is particularly obvious for the frequency of Ala in THEMA (5.7%) and AQUAE (5.6%), compared to about 7.0–8.0% in mesophiles of similar G+C content. b-branched residues are suggested as favorable in stabilizing thermophilic proteins (Gromiha et al., 1999). Frequencies of Val and especially Ile show some correlation (positive and negative, respectively) with genome G+C content. In particular, Ile and Val frequencies tend to be increased in most thermophilic archaeal proteomes and, in the case of Val, also in THEMA and AQUAE. 6. CONCLUDING REMARKS AND PROSPECTS Among all prokaryotic genomes (> 54 currently available with more than aggregate one million codons) there are only 76 gene sequences common to all genomes of which more than half are ribosomal protein sequences, more than a dozen are amino acetyl tRNA synthetases. Also, several are major protein processing factors, and few are chaperone complexes. The genomic and proteomic evaluations of the text and of Tables II– IX show substantial differences among prokaryotic genomes. Woese et al. (1990), based on 16S rRNA comparisons for a wide variety of organisms, have argued for partitioning all independent organisms into three sets: bacteria, archaea, and eukaryotes. Woese notes that bacterial cell membranes generally are formed of glycerol long diester hydrocarbon chains, while archaeal membranes are formed of isoprenoid glycerol diethers. Eukaryotes generally possess membranes formed of glycolipids. We have examined more than 20 properties, discussed in the foregoing text, of the three proposed ‘‘domains’’ and have found many discrepancies for different properties. Poole et al. (1999) argue for coherence of each domain but ‘‘with many exceptions.’’ Significant differences between eukaryotes, archaea, and bacteria are found in the distribution of protein lengths and for various classes of proteins including chaperonins, informational and metabolic proteins. Table IX summarizes presence, absence, and expression levels for a broad spectrum of genomic and proteomic attributes. Differences among 384 TABLE VIII Median Protein Length and Amino Acid Usages in Eukaryotic, Archaeal, and Bacterial Proteomes Speciesa Median amino acid usages in proteins5200 aa E D K R H L I V M F Y W P G A S T N Q C þ p fb HUM-SPc HUMAN DROME CAEEL YEAST ARATH 42 42 42 36 38 36 6194 13245 14100 17083 6066 25338 4824 10651 10740 13668 4782 19784 78 80 76 80 79 78 372 386 375 346 384 361 447 456 470 401 473 433 6.5 6.6 6.1 6.0 6.4 6.5 4.9 4.8 5.2 5.1 5.8 5.4 5.4 5.5 5.3 6.1 7.2 6.2 5.3 5.4 5.5 4.9 4.3 5.3 2.4 2.4 2.5 2.2 2.1 2.2 9.5 9.8 9.1 8.9 9.4 9.3 4.6 4.4 4.9 6.2 6.5 5.3 6.2 6.1 5.9 6.2 5.5 6.7 2.2 2.2 2.3 2.6 2.0 2.3 3.8 3.7 3.6 4.9 4.3 4.3 3.0 2.8 3.0 3.2 3.3 2.8 1.2 1.2 0.9 1.0 0.9 1.2 5.3 5.5 4.9 4.4 4.2 4.6 6.7 6.4 5.9 4.8 4.9 6.1 6.9 6.8 7.2 5.9 5.2 6.1 7.4 7.6 7.6 7.7 8.3 8.6 5.2 5.1 5.3 5.6 5.6 5.1 3.6 3.6 4.5 4.8 5.8 4.3 4.2 4.3 4.5 3.7 3.7 3.2 1.9 1.9 1.7 1.8 1.2 1.7 11.5 11.6 11.4 11.4 12.3 12.0 10.9 11.1 10.9 11.4 11.7 11.7 38.9 38.8 39.3 37.4 38.6 37.7 37.6 37.4 36.9 38.4 36.4 37.9 PYRAB PYRHO ARCFU THEAC THEVO METTH METJA HALSP AERPE PYRAE SULSO 45 42 49 46 40 50 31 68 56 51 36 1763 2058 2409 1478 1499 1869 1773 2058 2694 2603 2982 1192 1181 1438 1001 974 1123 1076 1554 1164 1351 1842 68 57 60 68 65 60 61 76 43 52 62 265 232 241 268 259 242 241 250 175 208 251 339 334 334 340 347 336 336 344 331 327 332 8.9 8.8 9.0 6.3 6.7 8.4 8.7 6.7 7.4 7.1 7.2 4.7 4.6 5.0 6.2 6.0 6.1 5.7 9.4 4.3 4.4 5.0 7.7 8.0 6.8 5.6 7.1 4.0 10.4 1.4 3.5 5.4 8.0 5.6 5.4 5.6 5.6 4.7 6.9 3.7 6.4 7.6 6.4 4.6 1.4 1.4 1.4 1.6 1.4 1.8 1.3 2.2 1.5 1.4 1.2 9.9 9.9 9.2 8.1 8.3 9.1 9.2 8.2 11.0 10.2 10.2 8.4 8.7 7.1 8.9 9.0 7.6 10.4 3.6 5.3 6.2 9.3 8.0 7.8 8.5 7.1 7.0 7.7 6.7 9.1 9.1 9.4 7.4 2.3 2.3 2.5 3.2 2.6 3.0 2.1 1.5 2.1 1.8 2.0 4.2 4.3 4.4 4.4 4.4 3.5 4.0 3.0 2.7 3.5 4.2 3.8 3.8 3.6 4.4 4.6 2.9 4.2 2.5 3.4 4.2 4.7 1.0 1.0 0.9 0.7 0.7 0.7 0.5 0.9 1.2 1.2 0.9 4.2 4.2 3.8 3.9 3.7 4.3 3.3 4.6 5.3 5.0 3.7 7.3 7.1 7.4 7.3 7.1 8.1 6.3 8.4 8.9 7.8 6.4 6.6 6.3 7.8 7.0 6.4 7.4 5.3 12.5 9.6 9.8 5.5 4.8 5.0 5.4 7.4 7.4 6.0 4.3 5.0 6.3 4.8 6.4 4.1 4.2 4.1 4.6 4.7 4.8 4.0 6.6 4.0 4.3 4.5 3.2 3.3 3.0 3.9 4.4 2.9 5.1 2.1 1.8 2.5 4.7 1.5 1.5 1.6 1.9 1.9 1.7 1.3 2.5 1.6 1.9 1.9 0.3 0.3 0.9 0.4 0.4 0.9 1.0 0.6 0.5 0.6 0.4 14.1 13.7 14.3 12.9 13.0 14.8 14.6 16.5 11.8 11.8 12.5 13.5 13.5 12.6 11.8 12.1 11.1 14.2 8.1 11.7 12.3 12.9 31.1 31.5 31.0 35.7 35.8 33.2 30.6 34.6 33.8 32.4 34.1 41.0 40.8 41.5 39.2 38.8 40.3 39.9 40.0 42.0 43.1 40.1 ECOLI HAEIN PSEAE BUCSP VIBCH PASMU XYLFA NEIME AGRTU AGRTU-Ld CAUCR MESLO SINME SINMEpAe RICPR HELPY CAMJE CHLTR CHLPN CHLMU BORBU TREPA DEIRA 51 38 67 26 47 40 53 52 59 59 67 63 63 60 29 39 31 41 41 40 29 53 67 4286 1707 5564 564 3835 2014 2766 2025 2721 1833 3737 7281 3341 1294 834 1575 1635 893 1052 818 850 1030 3117 2922 1118 4000 395 2407 1442 1400 1189 1846 1456 2539 4826 2302 836 582 1042 1122 625 731 558 599 743 2073 68 65 72 70 63 72 51 59 68 79 68 66 69 65 70 66 69 70 69 68 70 72 67 278 262 291 282 259 286 202 239 271 315 275 268 276 265 282 266 268 289 289 287 285 293 263 351 342 348 362 363 347 359 351 342 343 353 337 340 330 358 358 348 377 372 375 358 367 344 5.9 6.5 6.2 5.2 6.2 6.0 5.0 6.2 5.9 5.8 5.4 5.4 6.2 5.9 5.8 6.9 7.1 6.5 6.5 6.4 6.7 5.8 5.8 5.2 5.0 5.4 4.3 5.1 4.9 5.4 5.4 5.7 5.6 5.9 5.8 5.7 5.3 4.9 4.7 5.3 4.6 4.6 4.5 5.2 4.5 5.2 4.1 6.1 2.4 9.6 4.6 5.6 3.0 5.2 3.8 3.4 3.2 3.3 3.3 2.9 8.3 8.9 9.5 5.4 5.8 5.6 9.9 3.6 2.3 5.5 4.2 7.7 3.6 4.9 4.3 6.8 5.4 6.4 6.5 7.1 6.9 7.0 7.2 3.1 3.4 2.8 4.7 4.4 4.7 3.0 7.2 7.3 2.2 2.0 2.1 2.1 2.3 2.3 2.6 2.1 1.9 2.0 1.7 2.0 1.9 2.1 1.8 2.1 1.6 2.2 2.3 2.2 1.2 2.8 2.0 10.6 10.5 12.3 9.8 10.7 10.9 10.8 9.9 9.6 9.8 9.9 9.8 9.7 10.1 10.1 11.2 10.8 11.3 11.4 11.1 10.4 10.0 11.4 5.9 7.0 4.0 11.4 6.0 6.7 5.2 5.8 5.7 5.7 4.3 5.3 5.4 5.4 10.8 7.1 8.5 6.6 6.9 6.6 10.4 4.7 3.0 7.0 6.5 6.9 4.7 6.9 6.7 7.4 6.9 7.3 7.3 7.5 7.4 7.4 7.5 5.5 5.4 5.1 6.3 6.1 6.4 5.3 8.2 7.7 2.7 2.4 1.9 2.0 2.6 2.4 2.1 2.4 2.7 2.6 2.2 2.4 2.4 2.5 2.0 2.1 2.1 1.9 1.8 2.0 1.7 2.0 1.7 3.7 4.2 3.5 4.8 3.8 4.2 3.3 3.8 3.9 3.9 3.4 3.7 3.8 3.8 4.6 5.2 5.8 4.8 4.7 4.8 6.0 4.3 3.0 2.7 3.0 2.4 3.7 2.9 3.1 2.5 2.8 2.2 2.2 2.0 2.1 2.1 2.1 4.0 3.6 3.5 3.0 3.2 3.0 4.1 3.0 2.2 1.4 1.0 1.4 0.8 1.2 1.0 1.3 1.0 1.1 1.2 1.3 1.2 1.1 1.2 0.6 0.5 0.5 0.8 0.9 0.8 0.4 0.9 1.2 4.4 3.7 5.0 3.0 4.0 3.9 5.0 4.2 4.8 4.8 5.4 5.0 4.9 5.0 3.1 3.2 2.6 4.4 4.4 4.3 2.5 4.1 5.8 7.3 6.7 8.3 5.3 6.6 6.5 7.5 7.8 8.3 8.2 8.8 8.5 8.5 8.2 5.2 5.4 5.4 6.2 6.0 6.2 5.1 6.9 9.1 9.4 8.1 11.6 4.3 9.2 8.5 10.3 10.1 11.4 11.4 13.8 12.2 12.0 11.8 6.0 6.9 6.7 7.4 6.9 7.1 4.3 10.2 12.3 5.6 5.7 5.3 7.3 6.1 5.5 5.6 5.4 5.7 5.8 4.9 5.5 5.5 5.7 6.6 6.7 6.3 7.8 7.7 8.1 7.3 6.4 4.9 5.2 5.1 4.0 4.6 5.1 5.2 5.5 5.1 5.2 5.3 5.0 5.1 5.1 5.2 5.0 4.1 3.9 4.9 5.1 4.8 3.7 5.2 5.5 3.7 4.7 2.4 7.1 3.7 4.1 3.0 3.7 2.8 2.9 2.1 2.6 2.6 2.6 6.4 5.5 6.0 3.2 3.5 3.3 7.0 2.3 2.2 4.2 4.4 4.0 3.1 4.9 4.8 4.1 3.8 2.9 3.0 3.0 2.9 2.7 2.9 3.0 3.5 3.0 4.0 3.9 4.1 2.1 3.7 3.9 1.1 1.0 0.9 1.1 0.9 1.0 1.0 0.9 0.7 0.7 0.7 0.7 0.7 0.8 1.0 1.0 1.2 1.6 1.6 1.6 0.5 1.8 0.5 11.3 11.8 11.8 9.6 11.5 11.2 10.6 11.8 11.9 11.7 11.5 11.5 12.1 11.6 10.9 11.9 12.7 11.2 11.2 11.1 12.1 10.5 11.0 9.9 10.7 10.6 13.7 9.7 10.2 10.1 10.8 10.6 10.4 10.6 10.6 10.6 10.4 11.9 12.4 12.6 10.3 10.4 10.5 13.2 11.2 10.0 36.1 36.0 34.2 36.7 36.4 36.3 36.2 35.4 34.4 34.6 33.6 34.3 33.8 34.4 36.0 34.7 33.2 36.5 37.0 36.6 34.2 34.9 36.5 41.7 40.7 42.7 39.4 41.3 41.5 41.9 41.1 42.3 42.6 43.5 43.0 42.7 43.0 40.5 40.0 40.7 41.0 40.3 40.8 39.9 42.6 41.5 Karlin et al. % No. of sequences Median length G+C All 5200 aa %5200 All 5200 Species names generally follow the SwissProt convention}first three letters of the genus name followed by the first two letters of the species name. Lehninger alphabet: - is E,D; + is K,R; p (polar uncharged) is H,Y,P,G,S,T,N,Q; f (hydrophobics) is L,I,V,A,M,F,W,C. SwissProt collection of human proteins. d Linear chromosome of Agrobacterium tumefaciens. e Plasmid pSymA of Sinorhizobium meliloti. c b a THEMA AQUAE MYCTU MYCLE BACSU BACHA LACLA STAAU STRPN STRPY CLOAC UREUR MYCGE MYCPN SYNY3 46 43 66 58 44 44 35 33 40 39 31 25 32 40 48 1846 1521 3909 1605 4097 4066 2266 2714 2094 1696 3672 611 466 677 3163 1308 1072 2766 1100 2572 2620 1409 1715 1255 1127 2418 418 344 480 2086 71 70 71 69 63 64 62 63 60 66 66 68 74 71 66 284 274 286 282 256 261 251 254 242 263 262 286 292 286 274 344 340 360 358 340 341 329 338 338 335 342 383 369 354 360 9.0 9.6 4.7 5.0 7.4 7.9 7.3 6.5 7.3 6.3 7.1 5.7 5.4 5.3 5.9 5.0 4.3 6.0 6.0 5.3 5.1 5.4 6.0 5.7 6.0 5.7 5.7 4.9 5.0 5.0 7.5 9.3 1.8 2.4 6.8 5.7 7.2 7.2 6.4 6.6 9.5 9.8 9.6 8.4 3.9 5.5 4.8 7.5 6.9 4.0 4.7 3.4 3.4 3.9 3.8 3.3 2.6 2.8 3.3 5.0 1.5 1.5 2.2 2.2 2.2 2.3 1.7 2.2 1.8 2.0 1.2 1.6 1.5 1.7 1.8 9.8 10.4 9.8 10.0 9.5 9.8 9.7 8.9 10.1 9.9 8.5 10.0 10.7 10.4 11.4 7.1 7.2 4.2 4.7 7.2 6.8 7.5 8.4 7.2 7.4 9.6 10.5 8.4 6.7 6.1 8.7 7.9 8.6 9.2 6.8 7.4 6.6 6.7 7.0 6.7 6.7 5.3 6.1 6.4 6.7 2.2 1.7 1.8 2.0 2.6 2.6 2.4 2.6 2.4 2.5 2.5 1.7 1.5 1.5 1.9 4.9 4.8 2.8 2.8 4.2 4.1 4.5 4.3 4.5 4.2 4.3 5.1 6.1 5.5 3.8 3.4 4.0 2.0 2.1 3.3 3.3 3.4 3.8 3.6 3.5 4.1 4.3 3.2 3.2 2.9 0.9 0.8 1.3 1.2 0.9 1.0 0.8 0.6 0.7 0.7 0.5 0.8 0.8 1.0 1.4 4.0 4.0 5.7 5.2 3.7 3.8 3.3 3.2 3.3 3.4 2.7 2.5 2.9 3.3 5.1 7.0 6.8 8.7 8.5 7.0 7.0 6.5 6.2 6.6 6.4 6.2 4.1 4.5 5.3 7.3 5.7 5.6 13.1 11.8 7.6 7.3 7.3 6.1 7.2 7.6 5.4 4.8 5.4 6.6 8.5 5.4 4.7 5.3 5.8 6.2 5.6 6.3 5.9 6.0 6.0 6.5 5.9 6.3 5.9 5.6 4.4 4.2 5.8 6.0 5.3 5.5 5.5 5.6 5.3 5.8 4.7 4.8 5.2 5.7 5.3 3.5 3.4 2.1 2.5 3.7 3.5 5.0 5.3 4.1 4.2 6.3 8.5 7.4 6.1 3.7 1.8 1.9 3.0 3.1 3.6 3.9 3.5 3.9 3.9 4.1 2.1 3.7 4.4 5.1 5.5 0.5 0.6 0.8 0.9 0.7 0.6 0.3 0.5 0.4 0.5 1.1 0.6 0.8 0.7 0.9 14.3 14.2 11.1 11.2 13.0 13.2 13.0 12.9 13.4 12.6 12.9 11.6 10.5 10.5 11.0 13.1 14.3 9.6 9.5 11.0 10.5 10.9 10.8 10.5 10.7 13.0 12.7 12.7 11.8 9.1 31.9 31.2 35.5 35.7 35.7 35.5 35.7 36.7 35.5 36.2 34.3 36.3 36.0 36.9 38.0 40.2 39.4 43.0 43.0 39.4 39.5 39.3 38.2 39.8 39.7 38.7 39.0 40.4 39.7 40.9 Heterogeneity of Genome and Proteome Content 385 genomes are sensitive to lifestyle, habitat, food sources, physiology, type of membranes, and many other factors. For example, among fermentation processes, there are many alternative end-products depending on environmental conditions and degree of resistance to an acidic acid milieu, e.g., acetate, lactate, ethanol. Many theories have been proposed on domains of life, the origin and early evolution of eukaryotes, and the genesis of organelles. 16S rRNA genes give results conflicting with many protein sequence comparisons. It is increasingly appreciated that the genomes of many prokaryotes and primitive eukaryotes are ‘‘heterogeneous unions’’ in which lateral transfer and/ or close associations have been at work (Doolittle, 1998; Ochman et al., 2000; Campbell, 2000). The current archaeal genome number is 13 and, to date, numerous others have not been sequenced. Also, Woese (1998) no longer prescribes a single progenote as the genesis of life but rather a ‘‘community’’ of initial life forms involving much lateral exchange among them. Along these lines, there have been proposed archaeal–bacterial partnerships preceding the origin of eukaryotes (Zillig et al., 1989; Gupta and Golding, 1996; Martin and Mu. ller, 1998; Lopez-Garcia and Moreira, 1999; Karlin et al., 1999). Hartman and Fedorov (2002) postulate that the eukaryotic domain was established as a union of three genome types: a bacterium, an archaeon, and a ‘‘chronocyte,’’ the latter contributing to several basic eukaryotic systems including spliceosomal mechanisms, capping enzymes, nuclear pore constituents, and endoskeletal apparatus allowing for the functionalities of endocytosis, signaling and control. A primitive chronocyte is apparently no longer extant. Specificities and antecedents of these three cell types are also unclear. Conventional methods of phylogenetic reconstruction from sequence information use only similarity or dissimilarity assessments of aligned homologous genes or regions. For a detailed review of problems of inferring phylogeny, see Brocchieri (2001) and Gribaldo and Philippe (2002). Difficulties intrinsic in phylogenetic methods include the following: (i) Alignments of distantly related long sequences (e.g. complete genomes) are generally not feasible. (ii) Different phylogenetic reconstructions may result for the same set of organisms based on analysis of different protein, gene, or noncoding sequences. Attempts are made to overcome this by averaging over many proteins or by concatenating sequences (Daubin and Gouy, 2001) mostly restricted to classes of ribosomal protein genes. Even the numbers of RPs among prokaryotic genomes differ. (iii) Resultant trees may be highly dependent on details of the 386 Karlin et al. TABLE IX Various Summary Comparisons of Genomic and Proteomic Properties Property Bacteria Shine}Dalgarno sequences contributing to translation initiation + GC skew Periodic 30 bp repeats Aggregate 4 (and 6Þ bp palindromes}under-represented Genome of single chromosomes Linear chromosomes Extremes of genome signature; dinucleotide relative abundances Nitrogen fixation Lps (lipopolysaccharide) family Lipid A Chaperones/degradation HSP60 HSP70 + For bacterial chromosomes with a single origin (exceptions: DEIRA, SYNSQ, THEMA, AQUAE) (exceptions BACHA, THEMA, AQUAE) + + (Exceptions: VIBCH, DEIRA, SINME, RHOSP) BORBU, AGRTU, RHOSP; Streptomyces; most genomes are circular Variable (+ and ) (+ and ) Gram-negative + Gram-positive + GroEL (not in UREUR) + (DnaK) Archaea Eukaryotes + Many Crenarchaea have mostly leaderless transcripts Cap structure + in all thermophiles in HALSP + + + Variable Variable TA and CG mostly low in vertebrates (+ and ) (+ and ) + Thermosome Mostly missing; present in THEAC, METTH, HALSP + Hexamer structure 2a; 4b + Tcp1 Generally + Prefoldin complex (GimC) Trigger factor HSP90 Thioredoxin Peptidyl-prolyl cis–trans isomerases; 3 families: Cyclophilins Fkbp Parvulin Lon FtsZ (division control protein) + Variable + + + + + Variable + HALSP, METTH + Variable; not found in Crenarchaea + + + + Very weak similarity between archaeal FtsZ and tubulin + Separated +, mixed included in short cluster only P0 Not present + 50–63 53–65 78–79 in higher eukaryotes 72 in GIALA Ribosomal protein S1 (rpsA) (generally > 500 aa length) Cluster of RP genes S2 Ribosomal proteins P0 ; P1 ; P2 acidic, regulatory No. of RPs + 2a and 4b but different than archaea + + 387 Heterogeneity of Genome and Proteome Content Table IX (continued) Property Bacteria Archaea Eukaryotes Existence of introns + Phosphotransferase (PTS) system Hexokinase and/or glucokinase One copy + Many Unknown; no GC skew Multiple + + + hyperthermophiles Protein median lengths among proteins 580 aa Variable Variable (most bacterial organisms use the PTS system) Multiple copies in proteobacteria a, b, g Generally one (possible exceptions: DEIRA, SYNSQ, ARCFU, THEMA) AQUAE, THEMA, not found in mesophilic or moderate thermophilic prokaryotes Range 260–295 aa; except NEIME 239 euryarchaea +Crenarchaea 346–384 aa %proteins 5200 aa in the genome Range 51–74% 230–250 aa THEAC 268, PYRAE 208, AERPE 175 52–76% Translation elongation factors Tuf Fus No. of origins of replications PCNA replication factor Reverse Gyrase alignment algorithm used, inadequacies of phylogenetic methods, and biases in species and sequence sampling. (iv) Long branch attraction, mutational saturation, and different selection processes make ancient branching unreliable. (v) Chimeric origins, recombination, inversions, transpositions, and lateral transfer between distantly related organisms can complicate analyses. (vi) Tree construction derived from aligned sequences cannot apply to organisms for which similar gene sequences are largely unavailable (e.g., for bacteriophages, eukaryotic viruses, or deeply divergent organisms). (vii) Problems of influences of unrecognized paralogy and widespread reductions and expansions of genome content. Translation of sequence similarities into evolutionary relatedness will always be questionable as the underlying assumptions about mutation rates, selection forces and gene transfer events are uncertain. All the models of sequence evolution used are undoubtedly far from the real evolutionary mechanisms. The three-domain hypothesis and the endosymbiont hypothesis have undergone many changes. First, the original reason for dividing the living world into three domains was that there were, on the initial evidence, approximately three sets and that these were about equally ‘‘deviant’’ from one another. However, Table IX shows great variability within and among the three ‘‘domains.’’ Insistence on three domains leaves frozen a classification based on the limited knowledge available in the past (Karlin et al., 1997). If A (Archaea) and K (eukoryotes) are more closely related than either is to B 76–80% (Bacteria) (another point of controversy), then A and K are in the same domain. Otherwise, why not define additional primary domains within these three domains. Recent controversy has arisen regarding the problem of locating the bacterial root. In this context there are now a number of proposals placing the root on the eukaryotic branch (Poole et al., 1999; Forterre and Philippe, 1999), with later reductions producing the prokaryotes. With the range of protein sequences now available, the nuclear genome appears to be chimeric. If it arose by fusion of two (or more) genomes, each genome must have had its own 16S rRNA, one of which might then have broken off to inhabit the organelle. This leaves many possible scenarios. Our favorite compresses a multi-organismal fusion and the endosymbiont invasion into a single event (Karlin et al., 1999). The chimeric nature of the nuclear genome could then result primarily from migration into the nucleus of many genes, not just those affecting organellar function. We consider the Sulfolobus line as a likely candidate for the endosymbiont, particularly of animal mitochondria, for reasons outlined previously (Karlin et al., 1999). These reasons include similarities in genome signature. The patterns of dinucleotide relative abundance values (genomic signatures) are about the same for every contig of at least 50 kb length from the same organism but significantly different for those from different organisms. The uniformity of the signature throughout each genome suggests a recent acquisition 388 (on an evolutionary time scale). Mechanisms for the evolution and maintenance of the genomic signature are unknown although there are data to suggest that genome-wide processes, including DNA replication and repair, contribute intrinsically to the genomic signature (Karlin and Burge, 1995; Blaisdell et al., 1996; Karlin, 1998). The genomic signature is also useful for detecting pathogenicity islands (Karlin, 2001) in bacterial genomes and in conveying strong influences on codon usages. ACKNOWLEDGMENTS Supported in part by NIH Grants 5R01GM10452-40 and 5R01HG00335-15. REFERENCES Antelmann, H., Bernhardt, J., Schmid, R., Mach, H., Vo. lker, U., and Hecker, M. 1997. First steps from two-dimensional protein index towards a response regulation map for Bacillus subtilis, Electrophoresis 18, 1451–1463. Benachenhou-Lahfa, N., Labedan, B., and Forterre, P. 1994. PCRmediated cloning and sequencing of the gene encoding glutamate dehydrogenase from the archaeon Sulfolobus shibatae: Identification of putative amino-acid signatures for extremophilic adaptation, Gene 140, 17–24. Beutler, E., Gelbart, T., Han, J. H., Koziol, J. A., and Beutler B. 1989. Evolution of the genome and the genetic code: Selection at the dinucleotide level by methylation and polyribonucleotide cleavage, Proc. Natl. Acad. Sci. USA 86, 192-196. Blaisdell, B. E., Campbell, A. M., and Karlin, S. 1996. Similarities and dissimilarities of phage genomes, Proc. Natl. Acad. Sci. USA 93, 5854–5859. Blumenthal, A. B., Kriegstein, H. J., and Hogness, D. S. 1974. The units of DNA replication in Drosophila melanogaster chromosomes, Cold Spring Harbor Symp. Quant. Biol. 38, 205–223. Brocchieri, L. 2001. Phylogenetic inferences from molecular sequences: Review and critique, Theor. Popul. Biol. 59, 27–40. Brown, J. R., and Doolittle, W. F. 1997. Archaea and the prokaryoteto-eukaryote transition, Microbiol. Mol. Biol. Rev. 61, 456–502. Brown, J. R., Masuchi, Y., Robb, F. T. and Doolittle, W. F. 1994. Evolutionary relationships of bacterial and archaeal glutamine synthetase genes, J. Mol. Evol. 38, 566–576. Calladine, C. R., and Drew, H. R. 1992. ‘‘Understanding DNA,’’ Academic Press, San Diego. Campbell, A. M. 2000. Lateral gene transfer in prokaryotes. Theor. Popul. Biol. 57, 71–77. Cardon, L. R., Burge, C., Clayton, D. A., and Karlin, S. 1994. Pervasive CpG suppression in animal mitochondrial genomes, Proc. Natl. Acad. Sci. USA 91, 3799–3803. Creti, R., Ceccarelli, E., Bocchetta, M., Sanangelantoni, A. M., Tiboni, O., Palm, P., and Cammarano, P. 1994. Evolution of translational elongation factor (EF) sequences: Reliability of global Karlin et al. phylogenies inferred from EF-1 alpha(Tu) and EF-2(G) proteins, Proc. Natl. Acad. Sci. USA 91, 3255–3259. Daubin, V., and Gouy, M. 2001. Bacterial molecular phylogeny using supertree approach, Genome Inform. Ser. Workshop Genome Inform. 12, 155–164. Deuerling, E., Schulze-Specking, A., Tomoyasu, T., Mogk, A., and Bukau, B. 1999. Trigger factor and DnaK cooperate in folding of newly synthesized proteins, Nature 400, 693–696. Doolittle, W. F. 1998. You are what you eat: A gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes, Trends Genet. 14, 307–311. Draper, D. E. 1996. Translational initiation, in ‘‘Escherichia coli and Salmonella: Cellular and Molecular Biology,’’ 2nd ed. (F. C. Neidhardt, R. Curtiss III, J. L. Ingraham, E. C. C. Lin, K. B. Low, B. Magasanik, W. S. Reznikoff, M. Riley, M. Schaechter and H. E. Umbarger, Eds.), pp. 902–908., ASM Press, Washington, DC. Echols, H. and Goodman, M. F. 1991. Fidelity mechanisms in DNA replication, Annu. Rev. Biochem. 60, 477–511. Forterre, P., and Philippe, H. 1999. Where is the root of the universal tree of life? Bioessays 21, 871–879. Francino, M. P., and Ochman, H. 1997. Strand asymmetries in DNA evolution, Trends Genet. 13, 240–245. Frank, A. C., and Lobry, J. R. 1999. Asymmetric substitution patterns: A review of possible underlying mutational or selective mechanisms, Gene 238, 65–77. Gold, L. 1988. Posttranscriptional regulatory mechanisms in Escherichia coli, Annu. Rev. Biochem. 57, 199–233. Gribaldo, S., and Philippe, H. 2002. Ancient phylogenetic relationships, in ‘‘Evolution of Genome Structures,’’ (A. M. Campbell and S. Karlin, Eds.) Theor. Popul. Biol. Gromiha, M. M., Oobatake, M., and Sarai, A. 1999. Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins, Biophys. Chem. 82, 51–67. Gupta, R. S. 1998. Protein phylogenies and signature sequences: A reappraisal of evolutionary relationships among archaebacteria, eubacteria, and eukaryotes, Microbiol. Mol. Biol. Rev. 62, 1435–1491. Gupta, R. S. 2000. The natural evolutionary relationships among prokaryotes, Crit. Rev. Microbiol. 26, 111–131. Gupta, R. S., and Golding, G. B. 1996. The origin of the eukaryotic cell, Trends Biochem. Sci. 21, 166–171. Handy, J., and Doolittle, R. F. 1999. An attempt to pinpoint the phylogenetic introduction of glutaminyl-tRNA synthetase among bacteria, J. Mol. Evol. 49, 709–715. Hartl, F. U., and Hayer-Hartl, M. 2002. Molecular chaperones in the cytosol: from nascent chain to folded protein, Science 295, 1852–1858. Hartman, H., and Fedorov, A. 2002. The origin of the eukaryotic cell: A genomic investigation, Proc. Natl. Acad. Sci. USA 99, 1420–1425. Hayes, J. M. 2000. Lipids as a common interest of microorganisms and geochemists, Proc. Natl. Acad. Sci. USA 97, 14 033–14 034. Houry, W. A., Frishman, D., Eckerskorn, C., Lottspeich, F., and Hartl, F. U. 1999. Identification of in vivo substrates of the chaperonin GroEL, Nature 402, 147–154. Hunter, C. A. 1993. Sequence-dependent DNA structure. The role of base stacking interactions, J. Mol. Biol. 230, 1025–1054. Josse, J., Kaiser, A. D., and Kornberg, A. 1961. Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid, J. Biol. Chem. 236, 864–875. Heterogeneity of Genome and Proteome Content Karlin, S. 1998. Global dinucleotide signatures and analysis of genomic heterogeneity, Curr. Opin. Microbiol. 1, 598–610. Karlin, S. 2001. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes, Trends Microbiol. 9, 335–343. Karlin, S., Brocchieri, L., Mr!azek, J., Campbell, A. M., and Spormann, A. M. 1999. A chimeric prokaryotic ancestry of mitochondria and primitive eukaryotes, Proc. Natl. Acad. Sci. USA 96, 9190–9195. Karlin, S., and Burge, C. 1995. Dinucleotide relative abundance extremes: A genomic signature, Trends Genet. 11, 283–290. Karlin, S., Burge, C. and Campbell, A. M. 1992. Statistical analyses of counts and distributions of restriction sites in DNA sequences, Nucleic Acids Res. 20, 1363–1370. Karlin, S., and Cardon, L. R. 1994. Computational DNA sequence analysis, Annu. Rev. Microbiol. 48, 619–654. Karlin, S., Doerfler, W., and Cardon, L. R. 1994. Why is CpG suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses? J. Virol. 68, 2889–2897. Karlin, S., and Mr!azek, J. 1996. What drives codon choices in human genes? J. Mol. Biol. 262, 459–472. Karlin, S., and Mr!azek, J. 2000. Predicted highly expressed genes of diverse prokaryotic genomes. J. Bacteriol. 182, 5238–5250. Karlin, S., Mr!azek, J., and Campbell, A. M. 1997. Compositional biases of bacterial genomes and evolutionary implications, J. Bacteriol. 179, 3899–3913. Karlin, S., Mr!azek, J., Campbell, A., and Kaiser, D. 2001. Characterizations of highly expressed genes of four fast-growing bacteria, J. Bacteriol. 183, 5025–5040. Krieg, A. M. 1996. Lymphocyte activation by CpG dinucleotide motifs in prokaryotic DNA, Trends Microbiol. 4, 73–76. Krieg, A. M., Yi, A. K., Schorr J., and Davis H. L. 1998. The role of CpG dinucleotides in DNA vaccines, Trends Microbiol. 6, 23–27. Kuehn, M. J., Ogg, D. J., Kihlberg, J., Slonim, L. N., Flemmer, K., Bergfors, T., and Hultgren, S.J. 1993. Structural basis of pilus subunit recognition by the PapD chaperone. Science 262, 1234–1241. Kunkel, T. A. 1992. Biological asymmetries and the fidelity of eukaryotic DNA replication, Bioessays 14, 303–308. Leigh, J. A. 2000. Nitrogen fixation in methanogens}the archaeal perspective, in ‘‘Prokaryotic Nitrogen Fixation: A Model System for Analysis of a Biological Process,’’ (E. W. Triplett, Ed.), Horizon Scientific Press, Wymondham, UK. Lipton, M. S., Pasa-Tolic, L., Anderson, G. A., Anderson, D. J., Auberry, D. L., Battista, J. R., Daly, M. J., Fredrickson, J., Hixson, K. K., Konstandarithes, H., Conrads, T. P., Masselon, C., Markillie, L. M., Moore, R. J., Romine, M. F., Shen, Y., Tolic, N., Udseth, H. R., Veenstra, T. D., Venkateswaran, A., Wong, K.-K., Zhao, R., and Smith, R. D. 2002. Global analysis of Deinococcus radiodurans R1 proteomes using accurate mass tags (submitted). Lobry, J. R. 1996a. Asymmetric substitution patterns in the two DNA strands of bacteria, Mol. Biol. Evol. 13, 660–665. Lobry, J. R. 1996b. Origin of replication of Mycoplasma genitalium, Science 272, 745–746. Lopez, P., Forterre, P., and Philippe, H. 1999. The root of the tree of life in the light of the covarion model, J. Mol. Evol. 49, 496–508. Lopez-Garcia, P., and Moreira, D. 1999. Metabolic symbiosis at the origin of eukaryotes, Trends Biochem. Sci. 24, 88–93. Ma, J., Campbell, A., and Karlin, S. 2002. Correlations between Shine–Dalgarno sequences and predicted gene expression levels and operon features, J. Bacteriol. (submitted). 389 Martin, W., and Mu. ller, M. 1998. The hydrogen hypothesis for the first eukaryote, Nature 392, 37–41. Mr!azek, J., and Karlin, S. 1998. Strand compositional asymmetry in bacterial and large viral genomes, Proc. Natl. Acad. Sci. USA 95, 3720–3725. Ochman, H., Lawrence, J. G., and Groisman, E. A. 2000. Lateral gene transfer and the nature of bacterial innovation, Nature 405, 299–304. Poole, A., Jeffares, D., and Penny D. 1999. Early evolution: prokaryotes, the new kids on the block, Bioessays 21, 880–889. Powis, G., and Montfort, W. R. 2001. Properties and biological activities of thioredoxins, Annu. Rev. Biophys. Biomol. Struct. 30, 421–455. Richardson, J. S., and Richardson, D. C. 1988. Amino acid preferences for specific locations at the ends of alpha helices, Science 240, 1648–1652. Rivera, M. C., Jain, R., Moore, J. E., and Lake, J. A. 1998. Genomic evidence for two functionally distinct gene classes, Proc. Natl. Acad. Sci. USA 95, 6239–6244. Ritz, D., and Beckwith, J. 2001. Roles of thiol-redox pathways in bacteria, Annu. Rev. Microbiol. 55, 21–48. Rocha, E. P. C., Danchin, A., and Viari, A. 1999. Universal replication biases in bacteria, Mol. Microbiol. 32, 11–16. Rocha, E. P. C., Danchin, A., and Viari, A. 2001. Evolutionary role of restriction/modification systems as revealed by comparative genome analysis, Genome Res. 11, 946–958. Russell, G. J., McGeoch, D. J., Elton, R. A., and Subak-Sharpe, J. H. 1976. Doublet frequency analysis of bacterial DNAs, J. Mol. Evol. 2, 277–292. Russell, G. J., and Subak-Sharpe, J. H. 1977. Similarity of the general designs of protochordates and invertebrates, Nature 266, 533–536. Sandler, S. J., Satin, L. H., Samra, H. S., and Clark, A. J. 1996. recAlike genes from three archaean species with putative protein products similar to Rad51 and Dmc1 proteins of the yeast Saccharomyces cerevisiae, Nucleic Acids Res. 24, 2125–2132. Sengupta, J., Agrawal, R. K., and Frank, J. 2001. Visualization of protein S1 within the 30S ribosomal subunit and its inter action with messenger RNA, Proc. Natl. Acad. Sci. USA 98, 11,991–11,996. Siegert, R., Leroux, M. R., Scheufler, C., Hartl, F. U., and Moarefi, I. 2000. Structure of the molecular chaperone prefoldin: Unique interaction of multiple coiled coil tentacles with unfolded proteins, Cell 103, 621–632. Suyama, M., and Bork, P. 2001. Evolution of prokaryotic gene order: Genome rearrangements in closely related species, Trends Genet. 17, 10–13. Tatusov, R. L., Natale, D. A., Garkavtsev, I. V., Tatusova, T. A., Shankavaram, U. T., Rao, B. S., Kiryutin, B., Galperin, M. Y., Fedorova, N. D., and Koonin, E. V. 2001. The COG database: new developments in phylogenetic classification of proteins from complete genomes, Nucleic Acids Res. 29, 22–28. Teter, S. A., Houry, W. A., Ang, D., Tradler, T., Rockabrand, D., Fischer, G., Blum, P., Georgopoulos, C., and Hartl, F. U. 1999. Polypeptide flux through bacterial Hsp70: DnaK cooperates with trigger factor in chaperoning nascent chains, Cell 97, 755–765. Thomas, D. C., Svoboda, D. L., Vos, J. M. H., and Kunkel, T. A. 1996. Strand specificity of mutagenic bypass replication of DNA containing psoralen monoadducts in a human cell extract, Mol. Cell. Biol. 16, 2537–2544. VanBogelen, R. A., Abshire, K. Z., Pertsemlidis, A., Clark, R. L., and Neidhardt, F. C. 1996. Gene-protein database of Escherichia coli 390 K-12, edition 6, in ‘‘Escherichia coli and Salmonella: Cellular and Molecular Biology,’’ 2nd ed. (F. C. Neidhardt, R. Curtiss III, J. L. Ingraham, E. C. C. Lin, K. B. Low, B. Magasanik, W. S. Reznikoff, M. Riley, M. Schaechter and H. E. Umbarger, Eds.), pp. 2067–2117, ASM Press, Washington, D.C. VanBogelen, R. A., Schiller, E. E., Thomas, J. D., and Neidhardt, F. C. 1999. Diagnosis of cellular states of microbial organisms using proteomics, Electrophoresis 20, 2149–2159. Warner, J. R. 1999. The economics of ribosome biosynthesis in yeast, Trends Biochem. Sci. 24, 437–440. Winkler, H. H., and Daugherty, R. M. 1986. Acquisition of glucose by Rickettsia prowazekii through the nucleotide intermediate uridine 5’-diphosphoglucose, J. Bacteriol. 167, 805–808. Karlin et al. Woese, C. R., Kandler, O., and Wheelis M. L. 1990. Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya, Proc. Natl. Acad. Sci. USA 87, 4576–4579. Woese, C. 1998. The universal ancestor, Proc. Natl. Acad. Sci. USA 95, 6854–6859. Wolf, Y. I., Aravind, L., Grishin, N. V., and Koonin, E. V. 1999. Evolution of aminoacyl-tRNA synthetases}analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events, Genome Res. 9, 689–710. Zillig, W., Klenk, H. P., Palm, P., Puhler, G., Gropp, F., Garrett, R. A., and Leffers, H. 1989. The phylogenetic relations of DNAdependent RNA polymerases of archaebacteria, eukaryotes, and eubacteria, Can. J. Microbiol. 35, 73–80.