Download Heterogeneity of Genome and Proteome Content in Bacteria

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interactome wikipedia , lookup

Magnesium transporter wikipedia , lookup

Ridge (biology) wikipedia , lookup

Proteolysis wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Transposable element wikipedia , lookup

RNA-Seq wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene regulatory network wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Evolution of metal ions in biological systems wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Expression vector wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene expression wikipedia , lookup

Point mutation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Genomic library wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Theoretical Population Biology 61, 367–390 (2002)
doi:10.1006/tpbi.2002.1606
Heterogeneity of Genome and Proteome Content in
Bacteria, Archaea, and Eukaryotes
Samuel Karlin1, Luciano Brocchieri1, Jonathan Trent2, B. Edwin Blaisdell1, and Jan Mrázek1
1
Department of Mathematics, Stanford University, Stanford, California 94305-2125
E-mail: karlin@math:stanford:edu
2
NASA Ames Research Center, Mail Stop 239-15, Moffett Field, California 94035
Received April 7, 2002
Our analysis compares bacteria, archaea, and eukaryota with respect to a wide assortment of genome
and proteome properties. These properties include ribosomal protein gene distributions, chaperone protein
contrasts, major variation of transcription/translation factors, gene encoding pathways of energy
metabolism, and predicted protein expression levels. Significant differences within and between the three
domains of life include protein lengths, information processing procedures, many metabolic and lipid
biosynthesis pathways, cellular controls, and regulatory proteins. Differences among genomes are
influenced by lifestyle, habitat, physiology, energy sources, and other factors. & 2002 Elsevier Science (USA)
morphological structures and ‘‘operational’’ metabolic
genes and proteins (Rivera et al., 1998). Other features
make the phylogenetic cohesiveness of the three
domains uncertain and their classification moot. To
investigate the rigor and value of the current classification system, it is, therefore, of interest to catalog
important genes and proteins in each domain and to
show strong similarities and differences within and
between domains.
In this post-genomic era we can, in principle, study
the complete inventory of cellular proteins. Five
eukaryotic genome sequences are now complete or
nearly complete: Saccharomyces cerevisiae, Drosophila
melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, and Homo sapiens. In addition, more than 50
prokaryotic genome sequences are complete. These
include human pathogens and microbes of industrial
and commercial value. Our comparisons between the
eukaryota, archaea, and bacteria can provide insights
1. INTRODUCTION
Based primarily on rRNA sequence criteria, life has
been broadly divided into three domains referred to as
Eukaryota, Bacteria, and Archaea, which are believed to
reflect phylogenetic relationships (Woese et al., 1990;
Brown and Doolittle, 1997). This classification system,
however, has not gone uncontested and many problems
concerning inferences of evolutionary relationships from
molecular sequence data have been identified (e.g., for
different views, see Gupta, 1998, 2000; Poole et al., 1999;
Lopez et al., 1999; Forterre and Philippe, 1999;
Brocchieri, 2001; Gribaldo and Philippe, 2002). Furthermore, the identity of the three domains and the
relationships between them is problematic. For example,
the archaea and eukaryota putatively share many genes
and proteins involved in information and cellular
activity, whereas archaea and bacteria share many
367
0040-5809/02 $35.00
# 2002 Elsevier Science (USA)
All rights reserved.
368
Karlin et al.
into the workings of living cells by illuminating protein
properties associated with specific structures and
functions that will not only have evolutionary consequences, but also medical and environmental benefits.
2. GENOMIC FEATURES AND
ORGANIZATION
It is widely believed that in most prokaryotic
organisms during exponential growth, ribosomal proteins (RP), transcription/translation processing factors
(TF), and the major chaperone/degradation (CH) genes
functioning in protein folding and trafficking are highly
expressed based on two-dimensional polyacrylamide gel
electrophoresis and mass spectrometry measurements
for ECOLI (for genome species abbreviations, see
Table I) (see VanBogelen et al., 1996, 1999), for BACSU
(see Hecker and collaborators (e.g., Antlemann et al.,
1997)), for METJA (Giometti, Argonne National Labs,
pers. comm.) and for DEIRA (see Lipton et al., 2002).
The three gene classes RP, CH, and TF record similar
codon frequencies which show high codon biases
relative to the average gene, and the codon usage
differences among these three gene classes are low
(Karlin and Mr!azek, 2000). Using these genes as a basis,
a gene is Predicted Highly eXpressed (PHX) if its codon
usage is rather similar to that of the RP, TF, and CH
gene classes and deviates strongly from the average
gene of the genome. Using these criteria, PHX genes
in most prokaryotic genomes include genes of principal
energy metabolism and often genes of amino acids
nucleotide, and fatty acid biosyntheses. In the cyanobacterium SYNY3, the primary genes of photosynthesis
TABLE I
List of Species and Abbreviations
Genome
Abbreviationa
Bacteria
Escherichia coli K12
Escherichia coli O157
Salmonella typhimurium
Pasteurella multocida
Yersinia pestis
Vibrio cholerae
Xylella fastidiosa
Haemophilus influenzae
Buchnera sp. APS
Pseudomonas aeruginosa
Neisseria meningitidis
ECOLI
}
SALTY
PASMU
YERPE
VIBCH
XYLFA
HAEIN
BUCSP
PSEAU
NEIME
(Eco)
(EcZ)
(Vch)
(Hin)
(Pae)
(Nme)
Table I (continued)
Genome
Abbreviationa
Agrobacterium tumefaciens
Mesorhizobium loti
Sinorhizobium meliloti
Caulobacter crescentus
Rhodobacter sphaeroides
Rickettsia prowazekii
Helicobacter pylori
Campylobacter jejuni
Bacillus subtilis
Bacillus halodurans
Staphylococcus aureus
Lactococcus lactis
Streptococcus pyogenes
Streptococcus pneumoniae
Listeria monocytogenes
Clostridium acetobutylicum
Mycobacterium leprae
Mycobacterium tuberculosis
Mycoplasma genitalium
Mycoplasma pneumoniae
Ureaplasma urealyticum
Synechocystis sp.
Deinococcus radiodurans
Treponema pallidum
Borrelia burgdorferi
Chlamydia trachomatis
Chlamydia pneumoniae
Chlamydia muridarum
Aquifex aeolicus
Thermotoga maritima
AGRTU
MESLO
SINME
CAUCR
RHOSP
RICPR
HELPY
CAMJE
BACSU
BACHA
STAAU
LACLA
STRPY
STRPN
LISMO
CLOAC
MYCLE
MYCTU
MYCGE
MYCPN
UREUR
SYNY3
DEIRA
TREPA
BORBU
CHLTR
CHLPN
CHLMU
AQUAE
THEMA
Archaea
Halobacterium sp.
Methanococcus jannaschii
Methanococcus voltae
Methanosarcina barkeri
Methanobacterium thermoautotrophicum
Archaeoglobus fulgidus
Pyrococcus horikoshii
Pyrococcus abyssi
Thermoplasma acidophilum
Thermoplasma volcanium
Pyrobaculum aerophilum
Aeropyrum pernix
Sulfolobus solfataricus
Sulfolobus tokodaii
HALSP
METJA
METVO
METBA
METTH
ARCFU
PYRHO
PYRAB
THEAC
THEVO
PYRAE
AERPE
SULSO
SULTO
Eukaryotes
Homo sapiens
Drosophila melanogaster
Caenorhabditis elegans
Saccharomyces cerevisiae
Arabidopsis thaliana
Trichomonas vaginalis
Giardia lamblia
Entamoeba histolytica
HUMAN
DROME
CAEEL
YEAST
ARATH
TRIVA
GIALA
ENTHI
a
(Mlo)
(Ccr)
(Rpr)
(Hpy)
(Cje)
(Bsu)
(Lla)
(Spy)
(Mle)
(Mtu)
(Mge)
(Uur)
(Syn)
(Dra)
(Tpa)
(Bbu)
(Ctr)
(Cpn)
(Aae)
(Tma)
(Hbs)
(Mja)
(Mth)
(Afu)
(Pho)
(Pab)
(Tac)
(Ape)
The SwissProt five-letter abbreviations are used in the text. The threeletter abbreviations are used in some tables.
369
Heterogeneity of Genome and Proteome Content
are PHX and in methanogens those essential for
methanogenesis are PHX.
2.1. Shine–Dalgarno (SD) Sequences and PHX
Genes
In prokaryotes, we find a correlation between gene
expression levels and the presence of a Shine–Dalgarno
(SD) sequence, which plays an important role in
translation initiation (see below). In bacterial cells,
translation initiation is generally considered the ratelimiting step of translation (reviewed in Gold, 1988;
Draper, 1996). Initiation of gene translation in many
bacteria involves interactions between a conserved SD
sequence upstream of the start codon in the mRNA and
an equally conserved anti-SD sequence at the 30 end of
the 16S rRNA. Not all mRNAs possess a recognizable
SD sequence, however. The consensus SD sequence
features at its core the purine run AGGAGG or
GGAGGA, generally traversing positions 10 to 5
relative to the start codon and the 16S rRNA gene which
mainly carries the anti-SD sequence CACCTCCTTTC
at its 30 end (see Ma et al., 2002 for details). We observed
that the majority of prokaryotic genomes have at least
one copy of the 16S rDNA gene that has the CCTCCT
terminal motif.
For our purposes, a strong SD sequence refers to the
motif GGAG, GAGG, or sometimes AGGA, aligned
within the positions 10 to 5 from the gene start
codon. In several genomes, the proportion of PHX
genes and average to low expression level genes with
strong SD sequence has been investigated (Ma et al.,
2002). As may be expected, the PHX genes as compared
to genes with an average or low expression level are
significantly more likely to possess a strong SD motif.
This positive correlation between strong SD signal
sequences and high expression genes can be found in
almost all bacterial and archaeal genomes, whereas SD
sequences do not exist in eukaryotes.
archaea and bacteria (e.g., Mr!azek and Karlin, 1998;
Frank and Lobry, 1999). For eukaryotic chromosomes
including the YEAST genome, the chromosomes of
CAEEL, of DROME, of HUMAN, and of ARATH
show no strand asymmetry.
What are the possible sources of strand composition
asymmetry? Lobry (1996a, b) was the first to observe
strand compositional asymmetry which he associates
with differences in replication, mutation and repair
biases in the leading vs the lagging strand. Francino and
Ochman (1997) emphasize a mutational bias associated
with transcription-coupled repair mechanisms and
deamination events. Other possible sources of strand
asymmetry include enzyme and architectural asymmetries at the replication fork, differences of replication
processivity (Kunkel, 1992), intergenic differences in
signal or binding sites in the two strands, differences in
gene density coupled with amino acid and codon biases
(Mr!azek and Karlin, 1998), and dNTP-pool fluctuations
during the cell cycle (Thomas et al., 1996). Rocha et al.
(1999), using a statistical linear discriminant function,
observed compositional asymmetries between genes on
the leading vs those on the lagging strand at the level of
nucleotides, codons, and amino acids. The GC skew
switches sign at the origin and terminus of replication in
those bacteria possessing a single origin of replication
(oriC) that is subject to bidirectional replication. Other
factors that may contribute to strand asymmetry include
gene function or expression level, operon organization,
and differences in single-base or context-dependent
mutations.
Strand compositional asymmetry, in general, is not
apparent in the genomes of organisms known to possess
multiple origins of replication distributed, on average,
every 50 kb: It may, therefore, be surmised that archaeal
genomes, which apparently do not show compositional
asymmetry (no GC skew bias), possess multiple replication origins.
2.2. Unique vs Multiple Origins of Replication
2.3. Periodic 30 bp Repeats in Archaea,
Thermophiles, and Alkaliphiles
The GC skew (biases in ðG CÞ=ðG þ CÞ counts)
shows a strong difference between archaea and bacteria
probably due to the existence of unique vs multiple
origins of replication (Mr!azek and Karlin, 1998; Frank
and Lobry, 1999). There is substantial evidence for a
prevalence of G in excess of C in the leading strand
relative to the lagging strand in most bacterial genomes.
Exceptions include the genomes of SYNY3, DEIRA,
THEMA, and all of the archaeal genomes. The counts
of ðG CÞ=ðG þ CÞ show a strong difference between
All current archaeal genomes, except Halobacterium
sp., contain one or more unusual clusters of 24–32 bp
repeat elements, usually in excess of 40 copies, separated
by 40–60 bp of nonconserved spacers (see Table II). A
similar repeat arrangement is present in the Gramnegative hyperthermophilic bacteria AQUAE and
THEMA. The function of these repeats is unknown,
although it is tempting to speculate that it is related to
the thermophilic lifestyle. However, it is also observed in
the genome of BACHA, a mesophilic bacterium
370
Karlin et al.
TABLE II
Periodic 24–32 bp Repeats Prokaryotic Genomes
Genome
Repeat sequence(s)
Periodicity Repeat count in the genome
(bp)
Exact
Allowing N errors
Number of Max. number of
clusters
repeats in a
single cluster
SULSO
CTTTCAATTCCTTTTGGGATTAATC
CTTTCAATTCTATAAGAGATTATC
61–66
61–66
151
127
201 ðN ¼ 5Þ
224 ðN ¼ 5Þ
2
4
103
96
SULTO
TCTTTCAATTCCTTTTGGGATTCATC
ACTTTCAATTCCATTAAGGATTATC
62–66
63–67
44
57
188 ðN ¼ 5Þ
271 ðN ¼ 5Þ
2
4
113
96
PYRAE
GTTTCAACTATCTTTTGATTTCTGG
GAATCTTCGAGATAGAATTGCAAG
65–70
66–69
43
81
45 ðN ¼ 5Þ
83 ðN ¼ 5Þ
3
1
18
81
AERPE
GCATATCCCTAAAGGGAATAGAAAG
GAATCTTCGAGATAGAATTGCAAG
63–67
62–69
42
36
42 ðN ¼ 5Þ
47 ðN ¼ 5Þ
1
2
42
25
METJA
RTTAAAATCAGACCGTTTCGGAATGGAAAY
65–73
66
134 ðN ¼ 6Þ
9
25
METTH
GTTAAAATCAGACCAAAATGGGATTGAAAT
65–68
171
171 ðN ¼ 6Þ
2
124
ARCFU
GTTGAAATCAGACCAAAATGGGATTGAAAG
66–69
107
108 ðN ¼ 6Þ
2
60
PYRHO
GTTTCCGTAGAACTAAATAGTGTGGAAAG
65–68
71
111 ðN ¼ 6Þ
3
67
PYRAB
GTTCCAATAAGACTAAAATAGAATTGAAAG
67
47
53 ðN ¼ 6Þ
3
27
PYRFU
GTTCCAATAAGACTAAAATAGAATTGAAAG
66–68
50
206 ðN ¼ 6Þ
7
50
THEAC
GTAAAATAGAACCTTAATAGGATTGAAAG
65–66
46
47 ðN ¼ 6Þ
1
47
THEVO
GTTTAAGATGTACTAGTTAGTATGGAAG
70
33
40 ðN ¼ 6Þ
2
19
THEMA
GTTTCAATAMTTCCTTAGAGGTATGGAAAC
65–67
100
113 ðN ¼ 6Þ
8
41
AQUAE
GTTCCTAATGTACCGTGTGGAGTTGAAAC
65–67
14
25 ðN ¼ 6Þ
5
5
BACHA
GTCGCACTCTTCATGGGTGCGTGGATTGAAAT 65–68
49
95 ðN ¼ 6Þ
5
36
characterized as an extreme alkaliphile bacterial organism
living optimally at pH59:5 and containing a corresponding array of repeats. Two current mesophilic methanogens,
Methanosarcina mazei and Methanosarcina acetivorans,
contain repeats of the kind displayed in Table II.
2.4. Representations of Short Palindromes
Archaeal and bacterial genomes tend to underrepresent 4 and 6 bp palindromes (Rocha et al., 2001), see
also Karlin et al. (1992), while eukaryal genomes have
many of these sufficiently short palindromes. This
observation is consistent with the presence of restriction
systems in prokaryotic genomes but not in eukaryotes.
2.5. Genome Signature Profiles
Early biochemical experiments measuring nearestneighbor frequencies established that the set of dinucleotide relative abundances of dinucleotides (the socalled dinucleotide bias) is a remarkably stable property
of the DNA of an organism (Josse et al., 1961; Russell
et al., 1976). We observed that the dinucleotide bias
appears to reflect species-specific properties of DNA
Heterogeneity of Genome and Proteome Content
stacking energies, DNA modification, replication, and
repair mechanisms. Dinucleotide biases in a DNA
sequence are assessed through the odds ratios rXY ¼
fXY =fX fY where fXY is the frequency of the dinucleotide
XY and fX is the frequency of the nucleotide X : For
double-stranded DNA sequences, a symmetrized version
frnXY g is computed from corresponding frequencies of
the sequence concatenated with its inverted complementary sequence. Our recent studies have demonstrated
that the dinucleotide bias profiles frnXY g evaluated for
disjoint 50 kb multiple DNA contigs from the same
organism are approximately constant throughout the
genome and are generally more equal to each other than
they are to those from 50 kb contigs of different
organisms (Russell and Subak-Sharpe, 1977; Karlin
and Cardon, 1994; Blaisdell et al., 1996; Karlin, 1998).
This bias pervades both coding and noncoding DNA
(Karlin and Mr!azek, 1996). Furthermore, related
organisms generally have more nearly equal dinucleotide biases than do distantly related organisms.
These highly stable DNA doublets suggest that there
may be genome-wide factors that limit the compositional and structural patterns of a genomic sequence. In
the absence of strong current selection, the dinucleotide
compositions should be especially conservative and
likely to drift only slowly with time. Dinucleotide
relative abundances capture most of the departure from
randomness in genome sequences. Overall, the dinucleotide, trinucleotide and tetranucleotide relative abundances in a genome are highly correlated, indicating that
DNA conformational arrangements are principally
determined by base-step configurations. On this basis
we refer to the profile frnXY g of a given genome as its
‘‘genomic signature.’’
What causes the uniformity of genomic signatures
throughout the genome of an organism? A reasonable
explanation for this constancy of the genomic signature
may be based on the replication and repair machinery of
different species, which either preferentially generates or
preferentially selects specific dinucleotides in the DNA
(Karlin and Burge, 1995; Karlin, 1998). These effects
might operate through context-dependent mutation
rates and/or DNA modifications and through local
DNA structures (base-step conformational tendencies).
2.5.1. Dinucleotide relative abundance extremes in
different prokaryotic genomes. Table III presents the
frnXY g profiles for a broad collection of complete
prokaryotic genomes. The dinucleotide relative abundance of XY is considered underrepresented when rnXY
40:78 and overrepresented when rnXY 51:23 (cf. Karlin,
1998). We observed that the dinucleotide TA is low in
371
about 75% of all prokaryote genomes (only 14/53
genomes (Table III) are in the normal range and the
others are low). Possible reasons for TA underrepresentation may relate to the low thermodynamic stacking
energy of TA, which is the lowest among all dinucleotides, to the high degree of degradation of UA
dinucleotides by ribonucleases in mRNA (Beutler et al.,
1989) and/or to the presence of TA as part of special
regulatory signals. In this context, TA underrepresentation may help to avoid inappropriate binding of
regulatory factors.
The dinucleotide relative abundance of CG is low in
20/53 genomes, high in 5/53 and normal in 28/53 (Table
III). For example, CG is low or underrepresented in all
Chlamydia species, whereas CG is high or overrepresented in a- and b-proteobacterial genomes (except
CAUCR) and mostly normal in g-proteobacteria. CG is
low in 7/11 archaea with HALSP high. The dinucleotide
GC is high in many g- and e-proteobacterial genomes
and also in the low C+G Gram-positive LISMO,
STAAU, and CLOAC genomes. Underrepresentation
of the dinucleotide CG is prominent in vertebrate
sequences but not observed in most invertebrates (e.g.,
in insects and worms); CG is also significantly low in
many protist genomes (e.g., Entamoeba histolytica,
Dictyostelium discoideum, Plasmodium falciparum but
normal in Trypanosoma sp. and Tetrahymena sp.) and is
also low in animal mitochondrial genomes (Cardon
et al., 1994) and in most small eukaryotic viral genomes
(Karlin et al., 1994). CG is underrepresented in
MYCGE, but not in MYCPN and tends to be
suppressed in low G+C Gram-positive genomes; CG
is low in BORBU and in many thermophilic prokaryotes, including METJA, METTH, SULSO, PYRHO,
and Thermus sp., but not in the thermophiles PYRAE,
AQUAE, or THEMA. Connections between CG
representations have been made to the immune system
stimulations; see Krieg (1996) and Krieg et al. (1998). It
is clear that mechanisms underlying biased CpG
representations vary.
2.5.2. Genome signature differences among sequences. A
measure of the genomic difference between two
sequences f and g (from different organisms or from
different regions of the same genome) is the dinucleotide
average absolute relative P
abundance difference calculated as dn ðf ; gÞ ¼ ð1=16Þ XY jrnXY ðf Þ rnXY ðgÞj; where
the sum extends over all dinucleotides and usually dn is
averaged between all pairs of 50 kb contigs of the
sequences composing f and g: For convenience, we
describe levels of dn -differences for some examples (all
values multiplied by 1000): ‘‘close’’ indicates dn 445
372
Karlin et al.
TABLE III
Symmetrized Genome Signatures (rn Dinucleotide Relative Abundance Values)
Genome
CG
GC
TA
AT
CC
GG
AA
TT
AC
GT
CA
TG
AG
CT
GA
TC
HUMAN
DROME
CAEEL
YEAST
ARATH
0.26
0.93
0.98
0.80
0.72
1.01
1.28
1.04
1.02
0.92
0.73
0.74
0.61
0.77
0.75
0.87
0.97
0.86
0.94
0.91
1.25
1.08
1.05
1.06
1.05
1.13
1.24
1.28
1.14
1.13
0.82
0.84
0.84
0.89
0.91
1.20
1.12
1.08
1.10
1.10
1.18
0.87
0.90
0.99
1.03
0.98
0.90
1.11
1.06
1.11
PYRAB
PYRHO
ARCFU
THEAC
THEVO
METTH
METJA
HALSP
AERPE
PYRAE
SULSO
0.71
0.61
0.78
0.91
0.83
0.51
0.32
1.36
0.70
0.97
0.67
0.89
0.89
1.02
1.04
1.12
0.76
1.13
0.94
0.96
1.15
0.95
0.89
0.90
0.61
0.82
0.93
0.74
0.83
0.54
1.21
1.08
1.00
0.90
0.92
0.86
1.26
1.07
1.13
0.94
0.98
0.95
0.93
0.95
1.22
1.30
1.04
1.05
1.12
1.25
1.38
0.78
1.14
1.10
1.24
1.12
1.11
1.21
0.97
1.04
0.95
1.14
0.92
0.88
1.18
1.04
0.77
0.73
0.77
0.75
0.78
0.85
0.72
1.26
0.83
0.83
0.85
0.85
0.85
1.01
1.06
0.99
1.17
1.03
0.92
0.90
0.85
0.88
1.21
1.22
1.17
0.97
1.05
1.07
1.11
0.79
1.31
1.07
1.17
1.15
1.13
1.19
1.17
1.06
1.14
1.05
1.33
1.03
0.91
1.05
ECOLI
HAEIN
PSEAE
BUCSP
VIBCH
PASM U
XYLFA
NEIME
AGRTU
AGRTU-L
CAUCR
MESLO
SINME
SINMEpA
RICPR
HELPY
CAMJE
CHLTR
CHLPN
CHLMU
BORBU
TREPA
DEIRA
THEMA
AQUAE
MYCTU
MYCLE
BACSU
BACHA
LACLA
STAAU
STRPN
STRPY
LISMO
CLOAC
UREUR
MYCGE
MYCPN
SYNY3
1.16
1.09
1.10
0.87
1.04
1.07
1.01
1.31
1.24
1.22
1.16
1.23
1.29
1.23
0.77
0.93
0.62
0.79
0.73
0.75
0.48
1.08
1.07
0.92
0.87
1.18
1.12
1.04
1.09
0.77
0.94
0.69
0.71
1.11
0.45
0.88
0.39
0.82
0.75
1.28
1.43
1.17
1.25
1.30
1.30
1.18
1.28
1.21
1.20
1.11
1.21
1.15
1.15
1.53
1.56
1.75
1.12
1.06
1.12
1.47
1.22
1.16
0.69
0.75
1.07
1.07
1.27
1.08
1.19
1.25
1.05
1.19
1.28
1.28
1.42
1.19
1.14
1.02
0.75
0.75
0.54
0.85
0.69
0.76
0.68
0.64
0.47
0.48
0.45
0.44
0.47
0.52
0.98
0.73
0.77
0.77
0.78
0.75
0.77
0.74
0.49
0.50
0.82
0.57
0.75
0.65
0.69
0.67
0.85
0.72
0.77
0.77
0.93
0.79
0.75
0.77
0.75
1.10
0.95
1.17
0.95
0.99
0.97
1.13
1.04
1.37
1.39
1.29
1.41
1.39
1.28
0.98
0.86
0.83
0.89
0.88
0.88
0.88
0.93
0.89
0.83
0.66
1.25
1.10
1.02
0.98
0.88
1.00
0.89
0.89
0.92
0.95
0.93
0.77
0.71
1.00
0.91
1.01
0.84
1.22
0.88
1.01
0.91
0.97
0.86
0.87
0.85
0.81
0.82
0.84
1.03
1.17
1.11
1.01
1.05
1.08
1.29
0.86
0.87
0.99
1.24
0.87
0.88
0.97
1.00
1.05
0.95
1.03
1.04
0.99
1.22
1.08
1.13
1.12
1.36
1.21
1.25
1.07
1.14
1.20
1.23
1.10
1.45
1.26
1.24
1.09
1.17
1.18
1.15
1.05
1.37
1.25
1.16
1.14
1.19
1.22
1.18
1.25
1.19
1.29
1.06
1.04
1.24
1.21
1.23
1.09
1.15
1.17
1.23
1.08
1.17
1.23
1.30
1.32
0.88
0.85
0.86
0.81
0.89
0.90
0.96
0.84
0.75
0.76
0.86
0.79
0.77
0.81
0.86
0.67
0.71
0.76
0.77
0.74
0.69
0.96
0.93
0.87
0.89
1.05
1.05
0.75
0.84
0.82
0.95
0.85
0.87
0.86
0.81
0.89
0.96
1.02
0.79
1.12
1.12
1.10
1.02
1.17
1.14
1.26
1.01
1.04
1.05
1.01
1.05
0.93
1.00
1.02
0.97
1.03
0.96
0.96
0.96
1.02
1.13
1.12
0.97
0.74
1.11
1.14
1.08
1.03
1.13
1.18
1.10
1.12
1.04
1.02
1.18
1.16
1.08
1.05
0.82
0.82
1.02
0.94
0.90
0.81
0.83
0.69
0.82
0.82
0.96
0.87
0.89
0.91
1.06
0.97
1.09
1.18
1.19
1.15
1.07
0.94
1.00
1.11
1.18
0.79
0.86
0.91
0.91
0.97
0.88
1.08
1.04
0.89
1.13
0.84
1.06
0.96
0.85
0.92
0.87
1.10
1.02
0.95
0.89
0.94
0.90
1.14
1.14
1.22
1.18
1.28
1.22
0.91
0.87
0.92
1.15
1.15
1.12
1.01
0.95
1.01
1.40
1.12
1.08
1.02
1.06
1.10
1.05
0.95
1.09
0.99
0.97
0.97
0.94
0.89
0.81
0.86
Significantly underrepresented dinucleotides ðrn 40:78Þ are shown in red (bold face if
rn 40:70). Significantly overrepresented dinucleotides ðrn 51:23Þ are shown in green (bold
face if rn 51:30).
Heterogeneity of Genome and Proteome Content
(e.g., human vs cow, E. coli vs S. typhimurium),
‘‘moderately similar’’ indicates 554dn 485 (e.g., human
vs chicken, E. coli vs H. influenzae), ‘‘weakly similar’’
indicates 904dn 4120 (e.g., human vs sea urchin, M.
genitalium vs M. pneumoniae), ‘‘distantly similar’’
indicates 1254dn 4145 (e.g., human vs Sulfolobus, M.
jannaschii vs M. thermoautotrophicum), ‘‘distant’’ indicates 1504dn 4180 (e.g., human vs Drosophila, E. coli
vs H. pylori), ‘‘very distant’’ indicates dn 5190 (human
vs E. coli, M. jannaschii vs Halobacterium sp.).
How does within-species compare to between-species
average dn -differences? Average within-prokaryotic
group dn -differences range from 12 to 52 (persistently
small), whereas the average between-group dn -differences range from 26 to 357 (see Fig. 1).
What are the possible mechanisms underlying the
signature determinations? DNA participates in multiple
activities including genome replication, repair, and
segregation, and provides special sequences for encoding
gene products. There are fundamental differences in
replication characteristics between Drosophila and
mouse (Blumenthal et al., 1974). Drosophila DNA
replicates frenetically in the first hour after fertilization,
with replication bubbles distributed about every 10 kb:
At about 12 h effective origins are spread to about 50 kb
apart. In mouse, the rate of replication appears to be
uniform throughout developmental and adult stages.
Cell divisions involve DNA stacking on itself and
loopouts that need to be judiciously decondensed to
undergo segregation. The observed narrow limits to
intragenomic heterogeneity putatively correlate with
conserved features of DNA structure.
The influence of the (double-stranded dinucleotide)
base step on DNA conformational preferences is
reflected in slide, roll, propeller twist, and helical twist
parameters (Calladine and Drew, 1992; Hunter, 1993).
Calculations and experiments both indicate that the
sugar–phosphate backbones are relatively flexible. However, base sequence influences flexural properties of
DNA and governs its ability to wrap around histone
cores. Moreover, certain base sequences are associated
with intrinsic curvature, which can lead to bending and
supercoiling. Inappropriate juxtaposition or distribution
of purine and pyrimidine bases could engender steric
clashes (Calladine and Drew, 1992; Hunter, 1993). For
example, transient misalignment during replication is
associated with structural alterations of the backbone in
alternating purine–pyrimidine sequences. On the other
hand, purine and pyrimidine tracts have less steric
conflicts between neighbors (Kunkel, 1992). Dinucleotide relative abundance deviations putatively reflect
duplex curvature, supercoiling, and other higher-order
373
DNA structural features. Many DNA repair enzymes
recognize shapes or lesions in DNA structures more
than specific sequences (Echols and Goodman, 1991).
DNA structures may be crucial in modulating
processes of replication and repair. Nucleosome positioning, interactions with DNA-binding proteins, and
ribosomal binding of mRNA appear to be strongly
affected by dinucleotide arrangements (Calladine and
Drew, 1992).
A central unresolved problem concerns whether
archaea are monophyletic or polyphyletic. On the basis
of rRNA gene comparisons, the archaea are deemed
monophyletic (Woese et al., 1990). This conclusion is
supported by some protein comparisons, e.g., for the
archaeal RecA-like sequences of Rad 50/Dmc1/RadA
(Sandler et al., 1996) and the elongation factor EF-1a
and EF-G families (Creti et al., 1994). However, many
protein comparisons challenge the monophyletic character of the archaea. For example, bacterial relationships based on comparisons among the HSP 70 kD
sequences place the Halobacteria closer to the Streptomyces than to archaeal or eukaryotic species (Karlin and
Cardon, 1994). Other examples are glutamate dehydrogenase and glutamine synthetase (Benachenhou-Lahfa
et al., 1994; Brown et al., 1994). Lake and collaborators
split the prokaryotes into eubacteria, euryarchaea, and
eocytes. With respect to genomic signature comparisons,
the closest to HALSP are Streptomyces sequences dn ¼
differences about 85, and next but twice as distant are
M.tuberculosis and M. leprae sequences: dn (HALSP,
MYCTUÞ ¼ 145; dn (HALSP, MYCLEÞ ¼ 155: The dn differences of HALSP to the archaeal sequences of
Sulfolobus sp. and M. jannaschii are very distant: dn >
280 and > 340; respectively. Sulfolobus sp. sequences are
moderately similar only to Clostridium acetobutylicum,
dn ¼ 87: All other comparisons with Sulfolobus sp. have
dn values > 120 and mostly > 180: dn -differences of
HALSP from other archaea exceed 245. Thus, a
coherent description for the archaea is not supported
by dn -difference data.
Should rickettsial sequences be grouped with aproteobacteria? The classical a-types consist of two
major subgroups: A1 including Rhizobium sp. and
Agrobacterium tumefaciens, and A2 including R. capsulatus and R. sphaeroides, found predominantly in soil
and/or marine habitats, the latter capable of anoxygenic
photosynthesis. A third tentative group A3 includes the
Rickettsial and Ehrlichial clades (obligate intracellular
parasites). The genome signature sequence comparisons
indicate pronounced discrepancies between A1 and A2 vs
A3 : First, the A1 and A2 genomes are pervasively of high
G+C content (generally 560%), whereas A3 genomes
374
Karlin et al.
FIG. 1.
are of low G þ C content (535%). Second, the dn differences among A1 sequences are 35–63 and
among the A2 sequences are 65–90. The dn -differences
between A1 and A2 sequences have the range 74–102.
By contrast, the RICPR genomes compared to A1
and A2 show excessive dn -differences, generally
> 200:
Some additional challenging observations based on
signature differences are: (i) All Chlamydia genomes are
close in genome signature to the ARCFU genome. (ii)
The enterobacteria ECOLI, HAEIN, VIBCH, and
PASMU are intriguingly moderately similar to the
Drosophila genome. (iii) In terms of dn -differences, the
three cyanobacterial DNA sequences are not close. The
cyanobacteria Synechocystis, Synechococcus, and Anabaena do not form a coherent group and are as far from
each other as general Gram-negative sequences are from
Gram-positive sequences. (iv) There is no consistent
pattern of dn -differences among thermophiles. More
generally, grouping of prokaryotes by environmental
criteria (e.g., habitat properties, osmolarity tolerance,
chemical conditions) reveals no correlations in genomic
signature.
2.5.3. Genome signature comparisons among eukaryotes. (i) The most homogeneous genomes occur among
fungi, especially for S. cerevisiae, whereas the most
variable genomes are found among protists. The
distribution of the dn -differences between human and
mouse sequence samples is only slightly shifted relative
to dn -differences within human sequence samples,
reflecting moderate similarity of human and mouse.
On the other hand, the dn -differences between human
and S. cerevisiae and between human and D. melanogaster are substantially higher than all within-species
dn -differences.
3. PROTEOMIC FEATURES
3.1. Chaperone Protein Contrasts
Molecular chaperone systems have evolved in all three
domains of life or originated in a common ancestor.
Chaperones are considered to play pivotal roles in
protein folding, degradation of misfolded proteins,
Heterogeneity of Genome and Proteome Content
proteolysis, and translocation across membranes. Specialized complex structures in cells often need their own
‘‘dedicated’’ chaperones (e.g., Kuehn et al., 1993).
Among the top PHX genes in most bacterial genomes
are those for the major chaperone proteins, DnaK
(HSP70) and GroEL (HSP60). The HSP60 chaperonin
complex is considered to assist protein folding by
providing a cavity in which non-native polypeptides
are enclosed and thereby protected against intermolecular aggregation (for a review, see Hartl and HayerHartl, 2002). The ATP-regulated DnaK together with its
cofactors DnaJ and GrpE and the ATP-independent
trigger factor (Tig) are reported to act co-translationally
to assist in protein folding. Tig and DnaK are proposed
to cooperate in the folding of newly synthesized proteins
(Teter et al., 1999). Simultaneous deletion of both Tig
and DnaK is lethal under normal growth conditions
(Deuerling et al., 1999). Tig is broadly PHX for bacterial
genomes but is not found in archaeal or eukaryotic
genomes. HSP70 is abundant in many eukaryota and
bacteria, often with multiple copies of the gene, but the
gene is not present at all in some archaea. In particular,
the HSP70 gene is absent from METJA, ARCFU,
PYRAB, PYRHO, PYRAE, SULSO, AERPE but
present in the archaea (METTH, THEAC, HALSP),
where it may have been acquired by lateral transfer. All
archaea and eukaryota apparently contain the molecular chaperone prefoldin or GimC (genes involved in
microtubule biogenesis), which is absent from bacteria.
GimC is believed to perform HSP70-like functions
although there is no sequence similarity between GimC
and HSP70 (Siegert et al., 2000).
The chaperonin and its co-chaperonin (GroEL/
GroES) are seen to be highly expressed in virtually all
bacterial genomes (see Table IV) (Houry et al., 1999),
but found to be absent in the Mycolplasma UREUR,
which lacks both GroEL and GroES. GroES is not
found in archaea. The archaeal thermosome (a distant
homolog of GroEL) is highly expressed in archaea
(Karlin and Mr!azek, 2000) and a more closely related
homolog to the eukaryotic protein TCP1, now referred
to as TriC or CCT. The HSP60s in all three domains are
purified from cells as double-ring complexes. In
bacteria, each ring of GroEL contains seven HSP60
subunits, while in archaea, each ring contains eight or
nine HSP60 subunits. The subunits comprising the ring
may be identical for up to eight different, but closely
related HSP60s. In most bacteria, the subunits are
identical, except for rhizobium a-proteobacteria where
there are two subunits. In some archaea rings are
formed from identical subunits, while in others there are
two subunits, and so far only among the Sulfolobus sp.
375
there are three subunits. Yeast contains at least 11
distinctive TriC genes. It is believed that the eukaryotic
ring structures contain six to nine different subunits, but
it is not yet clear how the different protein subunits are
arranged.
Duplicated HSP60 sequences stand out among the
classical a-proteobacteria, contrasted to no duplications
of HSP60 in all other proteobacterial clades. Multiple
HSP60 sequences also exist in cyanobacteria, in
Chlamydia, in high G þ C Gram-positive, but not in
RICPR. Many a-mitochondrial eukaryotes (TRIVA,
GIALA, ENTHI) contain two or more HSP60. Plastids
carry multiple copies of HSP60 that bind Rubisco.
The bacterial Thioredoxin (TrxA) implements protein
folding by catalyzing the formation or disruption of
disulfide bonds (Powis and Montfort, 2001; Ritz and
Beckwith, 2001). The eukaryotic thioredoxin functional
analog is protein disulfide isomerase, operating in the
endoplasmic reticulum. The highest expression levels for
thioredoxin occur in BACSU and then in other fastgrowing bacteria in the order DEIRA, VIBCH,
HAEIN, and ECOLI (data not shown). Other chaperones are also distributed unevenly through the three
domains. HSP90 exists in many bacteria and eukaryotes
but is not found in archaea.
Peptidyl-prolyl cis–trans isomerases (PPIases) accelerate the proper folding of proteins by promoting the
cis–trans isomerization of imide bonds in proline within
oligopeptides. Tig exhibits PPIase activity in vitro.
ECOLI has at least nine PPIases defined by sequence
similarity. One of these, the survival protein SurA,
enhances the folding of periplasmic and outer membrane proteins. As expected, SurA does not exist in
Gram-positive bacteria. DegP is a chaperone folding
factor that is significantly PHX, and acts primarily in
degrading misfolded proteins in the periplasm. Also
associated with periplasmic and cytoplasmic chaperones
are several PPIases and disulfide oxidase, DsbA.
GroEL (and thermosomes in archaea) is PHX
(expression level EðgÞ51) in almost all prokaryotic
genomes as displayed in Table IV. Many HSP60 family
members are among the top PHX with EðgÞ values often
exceeding 2.00.
3.2. Ribosomal Protein (RP) Gene Distribution
Many RP genes have diverged between most archaeal
and bacterial genomes (see COGs database at NCBIBethesda, Tatusov et al., 2001). Most bacterial genomes
have an operon or cluster, accounting for 20–40% of all
RP genes, are located close to their origin of replication.
376
Karlin et al.
TABLE IV
Chaperonin (HSP60) Expression Levels
Proteobacteria
g-clade: ECOLI, EðgÞ ¼ 2:09; SALTY, 2.35; VIBCH, 1.31; HAEIN, 1.47; PSEAE, 1.37; YERPE, 2.08; BUCSP, 1.13
b-clade: NEIME, 1.15.
a-clade: AGRTU, 2.07; SINME (5 copies), 1.66, 0.60, 1.67, 0.66, 1.13; MESLO (5 copies), 1.51, 1.96, 1.87, 1.73, 2.03; CAUCR, 1.78; RICPR, 1.01
e-clade: HELPY, 1.17; CAMJE, 1.43.
Only a-types among proteobacteria carry multiple copies of GroEL and many are very highly expressed.
Other Gram-negative: DEIRA, 2.35; cyanobacteria SYNY3 (2 paralogs), 1.51, 1.47; Nostoc sp., 1.59, 1.38.
Small pathogenic: CHLTR (2 copies), 0.87, 1.16; CHLPN (3), 0.89, 0.73, 1.18; TREPA, 1.27; BORBU, 1.13; MYCGE, 0.82
UREUR has no GROEL.
Gram-positive
Low Gram+: LISMO, 1.89; LACLA, 1.23; STRPY, 0.86; STRPN, 1.08; BACSU, 1.87; BACHA, 1.79.
High Gram+: MYCTU (2 copies), 1.39, 0.95; MYCLE (2), 1.60, 1.30.
Thermosomes in archaea are generally structured from two rings of a and b units. All units are PHX.
HASLP, b 1.20, a 1.21; ARCFU, b 1.35, a 1.49; METJA (a single unit), 1.56; METTH, 1.33, 138; THEAC, a 1.07, b 1.18; PYRHO (only 1 unit),
1.27; PYRAB (only 1 unit), 1.40; PYRAE, 1.48, 1.63; AERPE, 1.14, 1.21; SULSO (3 units a; b; g), 1.25, 1.37, 1.03.
EðgÞ refers to the formal predicted expression level (Karlin and Mr!azek, 2000).
Many genes involved in protein synthesis including tuf,
fus, rpoA, rpoB, rpoC are encoded within or proximal to
the large RP cluster in bacteria but not in archaea.
Archaeal genomes, apparently lacking a unique origin of
replication, contain approximately the same numbers of
RP genes (range 50–65) as bacterial genomes and
possess a less significant operon as in bacterial genomes.
In contrast, the RP genes of yeast (and of higher
eukaryotes) are randomly dispersed throughout their
genome.
A ‘‘giant’’ RP gene (labeled S1), commonly exceeding
500 aa length, is essential in bacterial genomes, (with the
exception of Mycoplasma) where it is encoded away
from the RP cluster, but is missing from archaeal and
eukaryote genomes. S1 is overall acidic, binds weakly
and reversibly to the small subunit of the ribosome,
whereas most other RPs bind strongly (Sengupta et al.,
2001). S1 has a high affinity for interaction with mRNA
chains. Protein S1 is the largest RP present in the
ribosome of most bacterial cells and consists of multiple
tandem structural motifs each of about 70 aa length.
The S1 protein is reported to be necessary in many cases
for translation initiation and translation elongation and
is directly involved in the process of mRNA recognition
and binding. S1 can facilitate binding of mRNA that
lacks a strong Shine–Dalgarno sequence. S1 is not
encoded near any RP operon and generally is found
among the highest expression levels. S1 is also encoded
by the deeply branching Gram-negative AQUAE and
THEMA, the latter allowing for a frame shift. The 820
aa S1 protein of THEMA can be recognized as a fusion
of cytidylate kinase (contributing to pyrimidine biosynthesis) with a standard S1. The S1 proteins of
Bacillus genomes (BACSU, BACHA) and of most low
G+C Gram-positive genomes are generally of reduced
size (in the range 380–407 aa).
Bacterial and archaeal genomes encode about 50–65
RPs (to date, the highest number among prokaryotic
genomes is 65 in SULSO), whereas metazoan eukaryotes
invariably have 78 or 79 RP components (Warner,
1999). The situation for protozoa may be different.
Ribosomal proteins are generally cationic (mostly > 20%
cationic residues). Three acidic RPs are found in
eukaryotes, P0 ; P1 ; P2 ; each containing a carboxyl
hyperacidic residue run. Of these, only P0 is present in
archaeal genomes. P0 ; P1 ; and P2 are considered to play
an important regulatory role in the initiation step of
eukaryotic protein translation. Acidic RPs are not
377
Heterogeneity of Genome and Proteome Content
present in bacterial genomes, except for S1 and L7/L12.
L7/L12}as with P0 ; P1 ; and P2 }is thought to act in
adapting mRNA chains to the ribosome.
3.3. Special Transcription/Translation/
Replication Factors
The special eukaryotic ancillary replication protein
PCNA is extant in most archaea and eukaryotes but is
not found in bacteria. Actually, there are multiple copies
of the PCNA gene in the crenarchaeal genomes of
SULSO, PYRAE, and AERPE. Translation elongation
factors (e.g., Tuf, Fus) occur as single genes in archaea
but generally appear in multiple highly expressed genes
in a-, b-, and g-proteobacteria. The ribosome release
elongation factor Rrf is found in most bacteria and in
yeast, but is missing from archaea. The helicase protein
RecG, which helps facilitate branch migration of the
Holliday junction, is widespread in bacteria but seemingly absent from archaea (Suyama and Bork, 2001).
3.4. Origin and Function of Membrane Lipids
All the three domains contain polyisoprenes but
eukaryotes use significant amounts of sterols not present
in either bacteria or archaea. Membranes of Gramnegative bacteria and eukaryotes are replete with
phospholipid and lipid-modified proteins, whereas
archaea generally emphasize prenylated ether lipids
but apparently make no fatty acids (Hayes, 2000).
Lipopolysaccharide biosynthesis genes of anomalous
codon usage, which encode a hierarchy of surface
antigen proteins (the Lps family) and often occur in
clusters, are present in many bacterial and in archaeal
genomes but never in Gram-positive organisms and
apparently are not present in eukaryotes. The Lps
biosynthesis genes generally indicate a putative alien
gene cluster as characterized in Karlin (2001). The lipidA anchor (connecting sugar and lipid moieties) prominent in ECOLI and SALTY appears to be missing
from Gram-positive and archaeal genomes. This phenomenon may be related to the fact that the enzymatic
apparatus for lipid synthesis appears to be much
reduced or nonexistent in most archaeal genomes. For
example, FabB, FabD, and AcpP are not found in the
archaea.
3.5. Nitrogen Fixation (Nif)
Nif genes are present in several bacteria and archaea
but apparently not in eukaryotes. The glnB family of
nitrogen sensory–regulatory genes is widespread in
bacteria and archaea. Nif in archaea is evolutionarily
related to Nif genes in bacteria and operates by the same
fundamental mechanism (Leigh, 2000). It is proposed
that some genes of this kind wander about via lateral
transfer (e.g., as occurs in Klebsiella). Some Nif genes
are found in AQUAE, ARCFU, CHLTR, CHLPN,
DEIRA, HAEIN, HELPY, METTH, NEIME,
SYNY3, TREPA, VIBCH, SINME. For example, the
predominant nitrogenases in methanogens seem to be
molybdenum (cofactor) nitrogenases as is the case in
bacteria. The methanogens vary with respect to
nitrogen fixation. For example, neither METJA nor
METVO fixes nitrogen while Methanosarcina barkeri
and Methanococcus thermolithotropicus both do
(Leigh, 2000).
4. METABOLIC PATHWAYS AND SOME
PROTEIN CLASSES
We describe in Tables V–VIII the status of genes of
several pathways among archaeal and bacterial species
emphasizing the presence, absence, and expression levels
of these genes. (EðgÞ signifies the formal predicted
expression level, see Karlin and Mr!azek (2000)).
4.1. Glycolysis (Table V)
Hexokinase (Hxk) and glucokinase (Glk) are prominent glycolysis proteins in eukaryotes, but the former is
not found in most prokaryotes nor in any archaeal
sequences to date. Only TREPA contains a hexokinase
homolog of low expression level. In glycolysis, hexokinase converts glucose to glucose-6-phosphate. However,
glucose-6-phosphate arises from other hexoses and from
glucose transported into the cell via the phosphotransferase system (PTS). Glucokinase occurs in many
bacteria but normally at low to moderate expression
levels. Bacteria which rely on carbohydrates as a
primary energy source (STRPY, LACLA, BACSU,
ECOLI, VIBCH) use the PTS system to transport
glucose into the cytoplasm, which concomitantly phosphorylates glucose making Hxk/Glk expendable. PTS
genes are present but generally not PHX in PSEAE,
HAEIN, NEIME, MESLO, CAUCR, CHLTR,
CHLPN, TREPA, MYCGE. PTS genes are absent
from other current bacteria, from all current archaea,
and from yeast. Bacteria mostly execute complete
glycolysis cycles (apart from Hxk/Glk) and glycolysis
enzymes tend to be PHX with very high expression
378
levels prevailing in yeast, LACLA, STRPY, and
ECOLI.
The human obligate intracellular parasite RICPR is
not able to metabolize glucose (Winkler and Daugherty,
1986). However, RICPR contains five ATP–ADP
exchange translocase genes. These antiporters take
ATP from host cytoplasmic sources and release ADP
from the bacterial cell; the standard mitochondrial
exchange is reversed. The ATP–ADP translocase is very
uncommon among bacteria and only identified in
Chlamydia and Rickettsia and in an assortment of plant
plastids.
The glycolysis genes are PHX in all fast-growing
bacteria with high expression levels for most of them
(Karlin et al., 2001). Glycolysis genes in archaea are
either not PHX or entirely missing. For example,
glucose-phosphate isomerase is missing from the archaeon ARCFU, from Pyrococcus species, as well as from
bacteria in the Mycoplasmas group. Phosphofruktokinase is absent from archaea and several proteobacteria
(see Table V). There are two types of fructose biphosphate aldolase proteins. The class-II Fba is present in
all bacteria (except RICPR, HELPY, CAMJE, UREUR, and Chlamydia) and also in yeast but has no
homologs in archaea. Chlamydia possess a class-I Fba
which also carries homologs in most archaea, in ECOLI,
in MESLO, and in AQUAE. Phosphoglycerate mutase
(Gpm) is present in all bacteria (except RICPR) and in
yeast but only HALSP and THEAC carry
Gpm homologs among archaea. Apart from Rickettsia,
the glycolysis enzymes triosephosphate isomerase
(Tpi),
glyceraldehyde-3-phosphate
dehydrogenase
(Gap), phosphoglycerate kinase (Pgk), Enolase (Eno),
and Pyruvate kinase (Pyk) are widespread in all
prokaryotes. The multifunctional enzyme Gap is
missing from UREUR, and pyruvate kinase (Pyk) is
missing from ARCFU, METTH, AQUAE, HELPY,
and TREPA. Precluding hexokinase (Hxk) and
glucokinase (Glk), there are 10 major glycolysis genes
(Table V). Bacterial genomes with at least six PHX
glycolysis genes include LACLA (has nine PHX
glycolysis genes), STRPY (9), BACSU (7), SYNY3 (6),
LISMO (8), ECOLI (10), VIBCH (8), HAEIN (7),
MESLO (6). There are no archaeal genomes with
more than three (mostly one) glycolysis genes
PHX.
4.2. Amino acyl-tRNA Synthetases
(Table VI, aaRS)
The picture of aaRS proteins has become rather
complex (Handy and Doolittle, 1999; Wolf et al., 1999).
Karlin et al.
The existence of two classes of aaRS proteins has been
firmly established during the past decade. Evidence
comes from X-ray structural data, sequence comparisons, and enzymatic mechanisms. The corresponding
amino acids divide into two sets of residues: Leu, Ile,
Met, Val, Tyr, Gln, Glu, Arg, Cys, Trp constitute Class
I, whereas Class II amino acids are Gly, Ala, Ser, Asp,
Asn, Lys, His, Pro, Thr, Phe. Most aaRS are present in
all genomes. However, glutaminyl-tRNA synthetase
(GlnS) is missing from all archaea and most bacteria.
In fact, it is present only in g-proteobacteria, and
DEIRA. In other species, GlnS is complemented by
GluS and an amidotransferase. Asparaginyl-tRNA
synthetase (AsnS) is absent from many prokaryotic
genomes. Among archaea, AsnS is found in both
Thermoplasma and both Pyrococcus sequences. Among
bacteria AsnS occurs in several g-proteobacteria, in
several low G+C Gram-positive bacteria, in spirochetes
and in mycoplasmas. Glycyl-tRNA synthetase in many
bacteria is composed of two subunits GlyS and GlyQ
whereas the archaea have a single unit GRS1. Interestingly, DEIRA, mycobacteria, spirochetes, and mycoplasmas also have the archaeal GRS1 instead of the
bacterial type. Analogously, for LysS Class I lysyltRNA synthetase (LysU) occurs in most bacteria and,
notably, in yeast. Class II lysyl-tRNA synthetase (LysS)
is found in all archaea, and in a-proteobacteria
(including RICPR) and spirochetes.
Among bacteria, the number of PHX aaRS varies
from zero in DEIRA, PSEAE and several other species
to 19 in ECOLI. Archaea are incongruent with no aaRS
reaching PHX status in METJA but 13 PHX in
AERPE. In the yeast genome, 13 aaRS are PHX. Most
yeast amino acyl-tRNA synthetases occur in two copies
with a PHX nuclear version and a mitochondrial version
of relatively low expression level.
4.3. TCA Cycle Genes (Table VII)
The TCA cycle, apart from production of energy, can
contribute in myriad ways to cellular needs, especially in
making precursors and intermediates to macromolecules, e.g., in amino acid, vitamin, and heme biosyntheses. The order of actions in the TCA cycle is: citrate
synthase (GltA); aconitate hydratase (AcnA/AcnB),
isocitrate
dehydrogenase
(Icd),
2-oxoglutarate
dehydrogenase (SucA), succinyl coenzyme A (succinylCoA) synthetase (SucD and SucC), succinate dehydrogenase (SdhB, SdhC, and SdhD), fumarate hydratase
(FumA or FumC) and malate dehydrogenase (Mdh/
CitH).
EðgÞ Values (Multiplied by 100) for Glycolysis Genes
Archaea
Gene
Afu Hbs
Euk
Mja Mth Tac Pho Pab
Ape Sce
HXK
-
-
-
-
-
-
-
-
250
174
Glk
-
-
-
-
-
-
-
-
-
Bacteria
Aae Tma Dra
Mtu Mle
Lla Spy Bsu Syn Eco EcZ Pae
Vch Hin Nme Hpy Cje
Mlo Ccr
Rpr
Ctr Cpn Tpa Bbu Uur Mge
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
84
-
-
-
-
-
-
-
-
-
-
-
87
49
58
-
-
-
55
92
-
82
65
64
-
-
-
-
-
-
-
99
94
97
77
-
93
75
74
83
-
-
Pgi
-
80
56
83
97
-
-
93
260
73
101
66
94
81
213
157
92
79
141
136
60
67
60
46
63
PfkA
-
-
-
-
-
-
-
-
213
249
89
97
96
99
105
91
176
143
89
106
59
186
198
-
101
109
-
-
-
93
-
-
76
92
82
102
83
82
119
109
84
104
Fba
-
-
-
-
-
-
-
-
227
82
101
136
119
99
208
207
199
59
127
238
47
39
233
61
etc.
103
188
146
86
92
89
106
99
-
-
-
100
99
93
81
FbaB
98
85
119
107
88
65
100
-
85
77
87
-
114
-
-
-
-
-
-
-
-
?
?
54
47
-
-
-
-
-
-
74
-
-
92
93
-
-
-
-
TpiA
103
107
75
100
88
109
96
88
252
97
91
59
91
83
192
159
148
77
208
192
95
130
130
77
94
75
83
77
58
-
96
82
99
105
84
84
GapA
103
113
106
89
102
88
105
84
254
233
229
111
119
166
120
112
195
104
213
180
48
130
107
207
46
?
213
53
49
80
62
50
193
42
etc
172
80
74
92
85
97
175
154
-
96
105
104
109
-
91
Pgk
79
90
69
82
98
75
88
73
244
80
91
67
79
80
230
218
134
133
238
243
69
177
104
65
94
85
129
146
-
84
88
98
111
72
93
-
-
-
105
113
227
188
-
-
124
139
-
-
123
80
-
-
115
80
-
97
107
102
103
-
-
69
63
63
58
81
66
60
-
-
-
-
82
76
etc
66
62
-
-
-
-
-
-
-
GpmA
-
-
-
-
-
-
-
-
254
45
38
GpmB
-
-
-
-
95
90
89
-
-
-
77
66
etc
90
72
95
67
66
etc
117
100
etc
94
93
etc
58
51
etc
66
54
70
105
103
87
GpmI
-
99
-
-
-
-
-
-
-
-
-
-
-
-
-
-
108
94
?
185
59
147
-
-
86
75
-
-
-
-
-
-
-
70
81
182
-
65
-
-
80
-
67
-
-
-
-
-
-
-
-
DeoB
-
-
-
-
-
-
-
-
-
-
83
68
-
-
82
80
61
-
168
46
Eno
88
93
76
63
102
92
93
76
87
68
82
235
222
etc
98
111
147
101
92
196
49
217
192
112
211
214
91
193
159
93
85
92
142
128
-
79
89
98
81
93
78
PykF
-
80
67
-
101
73
137
112
225
43
-
88
86
61
102
90
225
204
118
104
70
258
95
etc
264
88
etc
68
57
49
175
54
49
158
63
-
102
109
81
69
53
-
78
91
-
84
70
108
Heterogeneity of Genome and Proteome Content
TABLE V
PHX genes are shown in red, PA in blue. Special symbols: }, The gene is not present in the genome; ?, COG data indicate that the gene is in the genome but the name does not match any
gene in the annotation; etc., more than three homologs belong to the COG. Top two EðgÞ values are shown. Genes: Hxk, hexokinase; Glk, glucokinase; Pgi, glucose-6-phosphate
isomerase; PfkA, 6-phosphofructokinase; Fba, fructose/tagatose bisphosphate aldolase; FbaB, DhnA-type fructose-1,6-bisphosphate aldolase and related enzymes; TpiA, triosephosphate
isomerase; GapA, glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase; Pgk, 3-phosphoglycerate kinase; GpmA, phosphoglycerate mutase 1; GpmB,
phosphoglycerate mutase/fructose-2,6-bisphosphatase; GpmI, phosphoglyceromutase; DeoB, phosphopentomutase; Eno, enolase; PykF, pyruvate kinase.
379
380
TABLE VI
EðgÞ Values (Multiplied by 100) for Aminoacyl-tRNA Synthetases
Archaea
Gene
Euk
Mja Mth Tac Pho Pab
AlaS
124
74
66
81
78
122
92
96
138
98
77
ArgS
97
?
80
71
83
63
105
128
67
36
75
AspS
110
124
66
101
89
93
117
126
135
27
114
-
94
89
92
82
107
81
-
169
40
AsnS
CysS
74
91
-
-
81
71
120
Ape Sce
Bacteria
Afu Hbs
140
56
Aae Tma Dra
-
Lla
Spy Bsu Sy n Eco EcZ Pae
Mlo Ccr
Rpr
Ctr
69
97
101
79
177
73
78
115
130
62
83
63
93
83
72
113
108
77
75
80
94
104
110
71
91
79
115
119
98
105
89
64
104
113
60
80
61
64
93
73
66
85
86
85
73
83
77
84
72
99
70
59
97
104
83
70
73
104
195
191
77
100
101
92
86
84
151
137
82
95
92
76
81
74
73
83
69
117
116
-
123
110
-
-
-
-
-
-
-
-
87
78
89
84
55
90
89
92
68
63
63
74
87
66
69
81
89
83
89
90
107
71
89
89
135
46
136
43
59
59
86
46
72
74
64
88
82
101
66
120
70
69
66
94
82
92
76
97
109
70
77
-
?
-
-
116
105
59
91
71
92
78
43
40
102
91
74
105
85
80
78
60
101
85
Vch Hin Nme Hpy Cje
Cpn Tpa Bbu Uur Mge
Mtu Mle
GltX
72
72
77
93
83
82
115
120
179
42
GlnS
-
-
-
-
-
-
-
-
64
-
-
83
-
-
-
-
-
-
126
136
72
75
91
68
-
-
-
48
-
-
-
-
-
-
-
GlyQ
-
-
-
-
-
-
-
-
-
104
83
-
-
-
78
92
60
95
118
120
72
124
114
81
89
85
95
82
99
76
90
-
-
-
-
107
75
60
GlyS
-
-
-
-
-
-
-
-
-
73
60
-
-
-
73
58
59
65
132
126
51
88
107
72
82
90
70
90
81
76
90
-
-
-
-
GRS1
104
80
84
77
102
85
88
148
135
29
-
-
58
83
85
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
87
110
85
75
HisS
88
82
66
84
97
73
71
118
88
29
85
79
90
78
63
50
90
75
53
44
50
49
40
87
82
74
65
80
74
53
79
84
66
89
74
98
58
75
66
85
89
106
107
108
73
75
IleS
126
90
92
104
94
141
145
148
104
25
114
103
62
78
75
97
65
78
60
117
123
60
85
?
67
78
67
88
109
87
86
97
97
83
70
74
119
83
66
LeuS
124
98
108
73
100
127
141
139
82
29
109
98
70
73
71
138
66
84
157
170
71
106
78
69
83
76
114
110
84
79
58
66
79
84
Lys S
86
93
75
97
91
88
93
82
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
90
76
79
-
-
89
80
-
-
Lys U
-
-
-
-
-
-
-
-
174
42
89
77
85
74
64
75
70
161
32
109
109
93
184
68
183
66
78
136
105
75
86
65
-
-
-
85
83
-
-
84
87
MetG
91
84
79
95
89
99
101
135
74
33
105
103
77
108
91
54
69
69
84
117
119
72
74
73
83
100
87
77
69
105
99
100
87
82
84
83
80
96
PheS
96
PheSA 120
105
98
87
85
77
83
128
115
30
110
85
78
90
82
63
58
70
95
130
137
78
108
90
78
96
61
95
97
84
100
101
79
96
86
-
59
92
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
PheT
76
110
74
67
82
76
91
102
109
30
86
89
65
81
72
65
62
52
89
116
117
56
103
77
59
78
71
92
116
80
69
86
86
100
72
108
ProS
67
103
72
78
107
85
78
121
132
32
98
90
95
103
91
79
69
58
94
164
162
67
90
96
59
78
65
91
88
103
95
85
77
104
84
83
86
123
142
39
98
92
98
109
87
127
124
70
92
164
176
78
117
98
69
86
82
87
65
97
89
77
101
112
110
74
120
95
73
142
40
107
112
56
92
104
107
59
87
140
94
85
75
98
59
96
81
95
78
119
92
83
SerS
ThrS
118
89
93
71
79
93
102
94
84
113
108
TrpS
100
105
78
123
65
90
102
93
77
80
90
39
115
102
89
93
89
94
63
TyrS
106
97
76
93
96
98
113
89
92
35
ValS
115
96
84
87
101
120
124
127
127
117
116
87
133
88
87
90
54
54
92
91
117
57
64
69
70
85
74
118
87
96
88
83
76
101
83
79
91
89
104
89
81
82
64
115
60
69
x
140
124
97
56
72
40
97
81
86
69
116
89
79
91
88
90
95
77
77
106
83
95
74
78
89
65
82
181
181
80
118
102
71
93
90
115
103
80
89
78
92
118
69
72
Karlin et al.
PHX genes are shown in red, PA in blue. Special symbols:}the gene is not present in the genome; ?, COG data indicate that the gene is in the genome
but the name does not match any gene in the annotation; x, expression levels not evaluated (usually due to length 580 aa); etc, more than three homologs
belong to the COG. Top two EðgÞ values are shown. Genes: AlaS, alanyl-tRNA synthetase: ArgS, arginyl-tRNA synthetase; AspS, aspartyl-tRNA
synthetase; AsnS, aspartyl/asparaginyl-tRNA synthetases, CysS, cysteinyl-tRNA synthetase; GltX, glutamyl-tRNA synthetase; GlnS, glutaminyl-tRNA
synthetase; GlyQ, glycyl-tRNA synthetase alpha subunit; GlyS, glycyl-tRNA synthetase beta subunit; GRS1, glycyl-tRNA synthetase class II; HisS,
histidyl-tRNA synthetase; IleS, isoleucyl-tRNA synthetase; LeuS, leucyl-tRNA synthetase; LysS lysyl-tRNA synthetase class I; LysU, lysyl-tRNA
synthetase class II; MetG, methionyl-tRNA synthetase; PheS, phenylalanyl-tRNA synthetase alpha subunit; PheSA, phenylalanyl-tRNA synthetase
alpha subunit archaeal type; PheT, phenylalanyl-tRNA synthetase beta subunit; ProS, prolyl-tRNA synthetase; SerS, seryl-tRNA synthetase; ThrS,
threonyl-tRNA synthetase; TrpS, tryptophanyl-tRNA synthetase; TyrS, tyrosyl-tRNA synthetase; ValS, valyl-tRNA synthetase.
381
Heterogeneity of Genome and Proteome Content
Archaeal genomes lack AcnB (but some have AcnA)
and lack SucA (Table VIII). The anaerobic STRPY
lacks all classical TCA cycle genes except for Mdh. The
spirochaetes (TREPA and BORBU) and mycoplasmas
(e.g., UREUR and MYCGE) also lack TCA cycle genes
except Mdh. However, Mdh is involved in several
metabolic pathways including fermentation and CO2
fixation via the serine–isocitrate lyase pathway. On the
other hand, DEIRA, ECOLI, MESLO, and CAUCR
have most of the TCA genes at the PHX level.
The genes isocitrate lyase (AceA) and malate synthase
(AceB) of the glyoxylate shunt are restricted in bacterial
genomes and wholly absent from archaeal genomes.
These genes are also missing from many small pathogenic bacteria, including RICPR, CHLTR, CHLPN,
TREPA, BORBU, UREUR, MYCGE. The glyoxylate
shunt is required to synthesize precurors for carbohydrates when the carbon source is a C2 compound.
Among archaea the only occurrence of the glyoxylate
cycle is found in Haloferax volcanii. In ESCCO,
isocitrate lyase converts isocitrate into succinate and
glyoxylate, allowing carbon that entered the TCA cycle
to bypass the formation of 2-oxoglutarate and succinylCoA. Among the current complete genomes, isocitrate
lyase is strongly PHX in DEIRA, MYCTU and PSEAE,
is present but not PHX in MYCLE, ECOLI and
VIBCH, and in the a-proteobacteria MESLO and
CAUCR, and is absent from all other currently
sequenced prokaryotic genomes.
5. PROTEIN LENGTHS AND AMINO
ACID USAGES AMONG THE THREE
DOMAINS
Table IX reports the median (50 percentile level)
variation of protein lengths, amino acid usages, charge
and hydrophobic usages for the five complete eukaryotes, 11 archaeal and 38 currently available bacterial
genome sequences; Two data sets were used for human
sequences; the SWISS-PROT (SP) collection and the
draft human genome public version (HGP) released in
2001. The numbers are remarkably congruent, indicating
that SP provides a faithful representation of the complete
genome. Several striking patterns are evident from Table
VIII and we present some possible interpretations.
5.1. Protein Lengths across Proteomes
The median protein lengths (aa) of all complete
genomes indicate the following orderings: archaea
(median range 230–250 aa except THEAC 268, PYRAE
208, and AERPE 175) 5 bacteria (250–295 except
NEIME 239 and STRPN 242) 5 eukaryotes (346–386).
The same orderings hold when restricted to proteins of
size 200 aa (archaea 331–340, HALSP 344, THEVO
347; bacteria 340–377, MESLO 337, LACLA 329,
STAAU 338, STRPN 338, STRPY 335, UREUR 383;
eukaryotes 433–473, CAEEL 401). The percent of
proteins 200 aa relative to all proteins of the genome
is 52–76% in Archaea, AERPE 43%; 51–74% in
bacteria, and 76–80% in eukaryotes. The greater length
of eukaryotic proteins may reflect their intrinsic complexity which may be attributed to their multifunctionality (several active regions of separate function), with
concomitant multiple exons and alternative splicing.
The smaller size of archaeal proteins may relate to more
extreme environments and more specialized activity.
5.2. Charged Amino Acid Usages
For the three domains we compare median protein
frequencies of acidic (E or D) and basic (K or R)
residues. The usage of glutamate (E) pervasively exceeds
the usage of aspartate (D) by at least 1% (mostly 2%)
with reversals in only six prokaryotes. These reversals
tend to be in high G+C genomes (HALSP 68%;
XYLFA 53%; CAUCR 67%; MESLO 63%; MYCTU
66%; MYCLE 58%). However, other genomes of high
G+C content (e.g., PSEAE 66%; SINME 63%;
DEIRA 67%) do not show this property. Strong
differences in charged amino acid usages are also
observed in the halophilic archaeon HALSP, which
show unusually high levels of aspartate (D) usage
(median 9.4% vs 4.3–6.2% in other organisms) and
low usage of basic residues (median 8.1% compared to
9–13% in all other organisms). Acidic residues overall
tend to be more used in euryarchaeal vs crenarchaeal
proteomes and in the anciently diverged Gram-negative
species AQUAE and THEMA (medians 14.2 and
14.3%). Usage values for acidic residues are in the
range 11.4–12.2% for eukaryotes and 9.6–14.3% for
archaea and bacteria.
What can account for E being used more than D?
There is no real distinction on the basis of the
underlying codon types (GAR vs GAY). Reasons
could reflect constraints on D in terms of protein
secondary structure and size. For example, poly-E can
establish highly stable long a-helices whereas poly-D
does not form long a-helices or long b-strands
(Richardson and Richardson, 1988). However, D is
more common at active sites (e.g., in proteases) and
in calcium metal coordination. The five complete
382
TABLE VII
EðgÞ Values (Multiplied by 100) for TCA Cycle Genes
Archaea
Gene
Afu Hbs
Euk
Mja Mth Tac Pho Pab
GltA
101
116
-
84
76
AcnA
-
127
-
-
111
-
-
AcnB
-
-
-
-
-
-
-
Icd
105
111
-
-
99
-
-
LeuB
74
-
85
68
105
101
-
82
74
70
Suc A
-
-
-
-
-
-
-
SucD
78
78
84
73
130
84
78
SucC
SdhA
103
98
-
-
110
89
78
102
-
-
133
79
119
113
108
83
91
84
-
-
121
45
116
123
92
93
120
95
78
-
121
100
67
69
98
-
-
153
88
80
etc
82
SdhB
103
102
67
82
92
-
-
136
SdhC
97
91
-
-
96
-
-
132
SdhD
97
-
-
-
103
-
-
FumC
-
101
-
-
-
-
-
TtdA
a
FumA
Mdh
a
Bacteria
Ape Sce Aae Tma Dra
67
115 50 135 80 141
38
183
169 59 124
256
53
119
131 47 113 115 207
34
162
156
121 82
87
69
etc
56
170
Mtu Mle
119
97 114
92
Lla
118
86
37
-
108
-
29
?
47
29
-
-
-
-
-
98
140
92
-
45
-
133
95
101
86
44
-
50
43
76
128
104
-
-
39
-
37
Spy Bsu Sy n Eco EcZ Pae
63
129 139 79
57
99
?
72
70
46
-
165
108
97
-
-
67
93
-
171
114
103
-
-
63
99
135
104
etc
121
94
83
Vch Hin Nme Hpy Cje
96
49
63
62
143
165
Mlo Ccr
149 124
65
55
etc 48
Ctr
101
-
-
-
-
-
-
155
79
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
146
152
-
-
-
-
-
-
-
-
75
130
88
84
95
-
-
-
-
-
-
-
-
140
134
76
87
79
-
-
-
-
-
99
136
118
170
100
98
85
-
-
-
-
116
-
85
115
99
162
103
83
76
-
-
-
-
127
135
82
110
69
138
131
92
86
83
-
-
-
-
109
137
93
75
74
112
100
88
92
89
-
-
-
-
-
100
80
42
-
58
-
-
161
113
70
-
76
87
100
162
71
-
-
-
111
71
?
67
51
72
63
74
89
130
173
?
?
124
186
60
38
129
59
44
125
112
101
57
133
174
171
116
112
49
137
117
113
50
103
61
-
57
74
133
108
138
101
123
52
44
101
-
-
63
100
81
113
55
126
57
117
Cpn Tpa Bbu Uur Mge
Rpr
97
66
-
-
71
96
92
-
-
64
-
81
83
112
63
-
101
87
95
99
75
94
91
85
-
-
-
-
115
77
48
-
-
-
110
103
99
-
-
-
-
88
86
89
-
108
-
-
94
73
91
91
-
-
-
-
-
89
84
-
-
169
106
100
-
-
54
75
32
40
112
73
57
41
92
65
99
92
113
122
89
90
91
-
-
-
-
110
60
56
110
87
56
114
55
etc
114
87
etc
77
47
-
72
-
-
64
-
-
-
-
-
-
-
-
77
47
-
72
-
-
64
-
-
-
-
-
-
-
-
142
132
-
66
118
-
-
88
75
130
168
92
78
90
-
119
-
84
87
-
78
82
77
-
89
79
-
-
96
95
-
-
-
-
-
-
-
95
-
64
96
80
-
85
84
-
-
107
93
90
-
-
-
-
-
-
-
95
102
66
91
112
-
-
135
117
55
45
86
75
102
180
105
122
123
226
45
44
172
81
76
92
a
Karlin et al.
Some proteins occur in both TtdA and FumA COGs. PHX genes are shown in red, PA in blue. Special symbols:}The gene is not present in the genome; ?, COG data indicate that the
gene is in the genome but the name does not match any gene in the annotation; etc, More than three homologs belong to the COG. Top two EðgÞ values are shown. Genes: GltA, citrate
synthase; AcnA, aconitase A; AcnB, aconitase B; Icd, isocitrate dehydrogenases; LeuB, isocitrate/isopropylmalate dehydrogenase; SucA, pyruvate and 2-oxoglutarate dehydrogenases
E1 component; SucD, succinyl-CoA synthetase alpha subunit; SucC, succinyl-CoA synthetase beta subunit; SdhA, succinate dehydrogenase/fumarate reductase flavoprotein subunits;
SdhB, succinate dehydrogenase/fumarate reductase Fe-S protein; SdhC, succinate dehydrogenase/fumarate reductase cytochrome b subunit; SdhD, succinate dehydrogenase
hydrophobic anchor subunit; FumC, Fumarase; TtdA, tartrate dehydratase alpha subunit/fumarate hydratase class I; FumA, tartrate dehydratase beta subunit/fumarate hydratase class
I; Mdh, malate/lactate dehydrogenases.
383
Heterogeneity of Genome and Proteome Content
eukaryotic genomes invariably have (E+D)% >
(K+R)%. As with eukaryotes, most prokaryotic
proteins use more acidic residues than basic residues.
But there are several examples with (K+R)% >
(E+D)%, including BORBU, MYCGE, TREPA,
BUCSP, RICPR, HELPY, UREUR, MYCPN, PYRAE, SULSO, CLOAC.
5.3. Hydrophobic Residue Usages
The median usage of hydrophobic residues in
eukaryotes is lower than in archaea and bacteria
(36.4–38.4% vs 38.7–43.5%, except for STAAU,
38.2%). This observation is somewhat unexpected given
that eukaryotic proteins tend to be longer than
prokaryotic proteins and might contain proportionally
larger hydrophobic cores. A possible explanation is that
eukaryotic structures tend to be multi-domain rather
than consisting of compact globular units. Among
bacteria, the highest median hydrophobic usage is
exhibited by a-proteobacteria (42.3–43.5%) and mycobacteria (43%).
5.4. Correlations of G+C Genome Content and
Amino Acid Usages
We observe a positive correlation of the genome
frequency of G+C with the frequency of amino acids
encoded from strongly binding bases {Ala, Gly, Pro,
Trp, Arg} and a negative correlation with the frequency
of amino acids encoded from weak bases {Lys, Ile, Phe,
Tyr, Asn}.
5.5. Amino Acid Usages in Thermophiles
Among thermophiles optimal growth temperatures
range from 508 to 1008C; but there is no correlation of
amino acid usage with optimal growth temperatures. To
date, all sequenced archaeal genome sequences, with the
exception of HALSP, are thermophilic organisms. Two
hyperthermophilic bacterial genomes, AQUAE and
THEMA, are available. What is the nature of amino
acid usages of thermophiles (or hyperthermophiles)
compared to mesophiles? The following features stand
out: Thermophilic organisms persistently show a higher
charge usage than mesophilic organisms of similar
G+C content. The only exception is HALSP, whose
proteins tend to be richer in Asp representations
compared to other proteomes. The greater charge in
proteins of thermophiles presumably implicates more
salt-bridge connections in their 3D structure and
concomitantly greater structural stability.
The strong amino acid Ala and Gly frequencies are
significantly correlated with G+C content. Discounting
this correlation, thermophilic organisms generally show
lower frequencies of Ala and higher frequencies of Gly
than mesophilic organisms. This is particularly obvious
for the frequency of Ala in THEMA (5.7%) and
AQUAE (5.6%), compared to about 7.0–8.0% in
mesophiles of similar G+C content.
b-branched residues are suggested as favorable in
stabilizing thermophilic proteins (Gromiha et al., 1999).
Frequencies of Val and especially Ile show some
correlation (positive and negative, respectively) with
genome G+C content. In particular, Ile and Val
frequencies tend to be increased in most thermophilic
archaeal proteomes and, in the case of Val, also in
THEMA and AQUAE.
6. CONCLUDING REMARKS AND
PROSPECTS
Among all prokaryotic genomes (> 54 currently
available with more than aggregate one million codons)
there are only 76 gene sequences common to all genomes
of which more than half are ribosomal protein
sequences, more than a dozen are amino acetyl tRNA
synthetases. Also, several are major protein processing
factors, and few are chaperone complexes. The genomic
and proteomic evaluations of the text and of Tables II–
IX show substantial differences among prokaryotic
genomes. Woese et al. (1990), based on 16S rRNA
comparisons for a wide variety of organisms, have
argued for partitioning all independent organisms into
three sets: bacteria, archaea, and eukaryotes. Woese
notes that bacterial cell membranes generally are formed
of glycerol long diester hydrocarbon chains, while
archaeal membranes are formed of isoprenoid glycerol
diethers. Eukaryotes generally possess membranes
formed of glycolipids. We have examined more than
20 properties, discussed in the foregoing text, of the
three proposed ‘‘domains’’ and have found many
discrepancies for different properties. Poole et al.
(1999) argue for coherence of each domain but ‘‘with
many exceptions.’’ Significant differences between eukaryotes, archaea, and bacteria are found in the
distribution of protein lengths and for various classes
of proteins including chaperonins, informational and
metabolic proteins. Table IX summarizes presence,
absence, and expression levels for a broad spectrum of
genomic and proteomic attributes. Differences among
384
TABLE VIII
Median Protein Length and Amino Acid Usages in Eukaryotic, Archaeal, and Bacterial Proteomes
Speciesa
Median amino acid usages in proteins5200 aa
E
D
K
R
H
L
I
V
M F
Y
W P
G
A
S
T
N
Q
C
þ
p
fb
HUM-SPc
HUMAN
DROME
CAEEL
YEAST
ARATH
42
42
42
36
38
36
6194
13245
14100
17083
6066
25338
4824
10651
10740
13668
4782
19784
78
80
76
80
79
78
372
386
375
346
384
361
447
456
470
401
473
433
6.5
6.6
6.1
6.0
6.4
6.5
4.9
4.8
5.2
5.1
5.8
5.4
5.4
5.5
5.3
6.1
7.2
6.2
5.3
5.4
5.5
4.9
4.3
5.3
2.4
2.4
2.5
2.2
2.1
2.2
9.5
9.8
9.1
8.9
9.4
9.3
4.6
4.4
4.9
6.2
6.5
5.3
6.2
6.1
5.9
6.2
5.5
6.7
2.2
2.2
2.3
2.6
2.0
2.3
3.8
3.7
3.6
4.9
4.3
4.3
3.0
2.8
3.0
3.2
3.3
2.8
1.2
1.2
0.9
1.0
0.9
1.2
5.3
5.5
4.9
4.4
4.2
4.6
6.7
6.4
5.9
4.8
4.9
6.1
6.9
6.8
7.2
5.9
5.2
6.1
7.4
7.6
7.6
7.7
8.3
8.6
5.2
5.1
5.3
5.6
5.6
5.1
3.6
3.6
4.5
4.8
5.8
4.3
4.2
4.3
4.5
3.7
3.7
3.2
1.9
1.9
1.7
1.8
1.2
1.7
11.5
11.6
11.4
11.4
12.3
12.0
10.9
11.1
10.9
11.4
11.7
11.7
38.9
38.8
39.3
37.4
38.6
37.7
37.6
37.4
36.9
38.4
36.4
37.9
PYRAB
PYRHO
ARCFU
THEAC
THEVO
METTH
METJA
HALSP
AERPE
PYRAE
SULSO
45
42
49
46
40
50
31
68
56
51
36
1763
2058
2409
1478
1499
1869
1773
2058
2694
2603
2982
1192
1181
1438
1001
974
1123
1076
1554
1164
1351
1842
68
57
60
68
65
60
61
76
43
52
62
265
232
241
268
259
242
241
250
175
208
251
339
334
334
340
347
336
336
344
331
327
332
8.9
8.8
9.0
6.3
6.7
8.4
8.7
6.7
7.4
7.1
7.2
4.7
4.6
5.0
6.2
6.0
6.1
5.7
9.4
4.3
4.4
5.0
7.7
8.0
6.8
5.6
7.1
4.0
10.4
1.4
3.5
5.4
8.0
5.6
5.4
5.6
5.6
4.7
6.9
3.7
6.4
7.6
6.4
4.6
1.4
1.4
1.4
1.6
1.4
1.8
1.3
2.2
1.5
1.4
1.2
9.9
9.9
9.2
8.1
8.3
9.1
9.2
8.2
11.0
10.2
10.2
8.4
8.7
7.1
8.9
9.0
7.6
10.4
3.6
5.3
6.2
9.3
8.0
7.8
8.5
7.1
7.0
7.7
6.7
9.1
9.1
9.4
7.4
2.3
2.3
2.5
3.2
2.6
3.0
2.1
1.5
2.1
1.8
2.0
4.2
4.3
4.4
4.4
4.4
3.5
4.0
3.0
2.7
3.5
4.2
3.8
3.8
3.6
4.4
4.6
2.9
4.2
2.5
3.4
4.2
4.7
1.0
1.0
0.9
0.7
0.7
0.7
0.5
0.9
1.2
1.2
0.9
4.2
4.2
3.8
3.9
3.7
4.3
3.3
4.6
5.3
5.0
3.7
7.3
7.1
7.4
7.3
7.1
8.1
6.3
8.4
8.9
7.8
6.4
6.6
6.3
7.8
7.0
6.4
7.4
5.3
12.5
9.6
9.8
5.5
4.8
5.0
5.4
7.4
7.4
6.0
4.3
5.0
6.3
4.8
6.4
4.1
4.2
4.1
4.6
4.7
4.8
4.0
6.6
4.0
4.3
4.5
3.2
3.3
3.0
3.9
4.4
2.9
5.1
2.1
1.8
2.5
4.7
1.5
1.5
1.6
1.9
1.9
1.7
1.3
2.5
1.6
1.9
1.9
0.3
0.3
0.9
0.4
0.4
0.9
1.0
0.6
0.5
0.6
0.4
14.1
13.7
14.3
12.9
13.0
14.8
14.6
16.5
11.8
11.8
12.5
13.5
13.5
12.6
11.8
12.1
11.1
14.2
8.1
11.7
12.3
12.9
31.1
31.5
31.0
35.7
35.8
33.2
30.6
34.6
33.8
32.4
34.1
41.0
40.8
41.5
39.2
38.8
40.3
39.9
40.0
42.0
43.1
40.1
ECOLI
HAEIN
PSEAE
BUCSP
VIBCH
PASMU
XYLFA
NEIME
AGRTU
AGRTU-Ld
CAUCR
MESLO
SINME
SINMEpAe
RICPR
HELPY
CAMJE
CHLTR
CHLPN
CHLMU
BORBU
TREPA
DEIRA
51
38
67
26
47
40
53
52
59
59
67
63
63
60
29
39
31
41
41
40
29
53
67
4286
1707
5564
564
3835
2014
2766
2025
2721
1833
3737
7281
3341
1294
834
1575
1635
893
1052
818
850
1030
3117
2922
1118
4000
395
2407
1442
1400
1189
1846
1456
2539
4826
2302
836
582
1042
1122
625
731
558
599
743
2073
68
65
72
70
63
72
51
59
68
79
68
66
69
65
70
66
69
70
69
68
70
72
67
278
262
291
282
259
286
202
239
271
315
275
268
276
265
282
266
268
289
289
287
285
293
263
351
342
348
362
363
347
359
351
342
343
353
337
340
330
358
358
348
377
372
375
358
367
344
5.9
6.5
6.2
5.2
6.2
6.0
5.0
6.2
5.9
5.8
5.4
5.4
6.2
5.9
5.8
6.9
7.1
6.5
6.5
6.4
6.7
5.8
5.8
5.2
5.0
5.4
4.3
5.1
4.9
5.4
5.4
5.7
5.6
5.9
5.8
5.7
5.3
4.9
4.7
5.3
4.6
4.6
4.5
5.2
4.5
5.2
4.1
6.1
2.4
9.6
4.6
5.6
3.0
5.2
3.8
3.4
3.2
3.3
3.3
2.9
8.3
8.9
9.5
5.4
5.8
5.6
9.9
3.6
2.3
5.5
4.2
7.7
3.6
4.9
4.3
6.8
5.4
6.4
6.5
7.1
6.9
7.0
7.2
3.1
3.4
2.8
4.7
4.4
4.7
3.0
7.2
7.3
2.2
2.0
2.1
2.1
2.3
2.3
2.6
2.1
1.9
2.0
1.7
2.0
1.9
2.1
1.8
2.1
1.6
2.2
2.3
2.2
1.2
2.8
2.0
10.6
10.5
12.3
9.8
10.7
10.9
10.8
9.9
9.6
9.8
9.9
9.8
9.7
10.1
10.1
11.2
10.8
11.3
11.4
11.1
10.4
10.0
11.4
5.9
7.0
4.0
11.4
6.0
6.7
5.2
5.8
5.7
5.7
4.3
5.3
5.4
5.4
10.8
7.1
8.5
6.6
6.9
6.6
10.4
4.7
3.0
7.0
6.5
6.9
4.7
6.9
6.7
7.4
6.9
7.3
7.3
7.5
7.4
7.4
7.5
5.5
5.4
5.1
6.3
6.1
6.4
5.3
8.2
7.7
2.7
2.4
1.9
2.0
2.6
2.4
2.1
2.4
2.7
2.6
2.2
2.4
2.4
2.5
2.0
2.1
2.1
1.9
1.8
2.0
1.7
2.0
1.7
3.7
4.2
3.5
4.8
3.8
4.2
3.3
3.8
3.9
3.9
3.4
3.7
3.8
3.8
4.6
5.2
5.8
4.8
4.7
4.8
6.0
4.3
3.0
2.7
3.0
2.4
3.7
2.9
3.1
2.5
2.8
2.2
2.2
2.0
2.1
2.1
2.1
4.0
3.6
3.5
3.0
3.2
3.0
4.1
3.0
2.2
1.4
1.0
1.4
0.8
1.2
1.0
1.3
1.0
1.1
1.2
1.3
1.2
1.1
1.2
0.6
0.5
0.5
0.8
0.9
0.8
0.4
0.9
1.2
4.4
3.7
5.0
3.0
4.0
3.9
5.0
4.2
4.8
4.8
5.4
5.0
4.9
5.0
3.1
3.2
2.6
4.4
4.4
4.3
2.5
4.1
5.8
7.3
6.7
8.3
5.3
6.6
6.5
7.5
7.8
8.3
8.2
8.8
8.5
8.5
8.2
5.2
5.4
5.4
6.2
6.0
6.2
5.1
6.9
9.1
9.4
8.1
11.6
4.3
9.2
8.5
10.3
10.1
11.4
11.4
13.8
12.2
12.0
11.8
6.0
6.9
6.7
7.4
6.9
7.1
4.3
10.2
12.3
5.6
5.7
5.3
7.3
6.1
5.5
5.6
5.4
5.7
5.8
4.9
5.5
5.5
5.7
6.6
6.7
6.3
7.8
7.7
8.1
7.3
6.4
4.9
5.2
5.1
4.0
4.6
5.1
5.2
5.5
5.1
5.2
5.3
5.0
5.1
5.1
5.2
5.0
4.1
3.9
4.9
5.1
4.8
3.7
5.2
5.5
3.7
4.7
2.4
7.1
3.7
4.1
3.0
3.7
2.8
2.9
2.1
2.6
2.6
2.6
6.4
5.5
6.0
3.2
3.5
3.3
7.0
2.3
2.2
4.2
4.4
4.0
3.1
4.9
4.8
4.1
3.8
2.9
3.0
3.0
2.9
2.7
2.9
3.0
3.5
3.0
4.0
3.9
4.1
2.1
3.7
3.9
1.1
1.0
0.9
1.1
0.9
1.0
1.0
0.9
0.7
0.7
0.7
0.7
0.7
0.8
1.0
1.0
1.2
1.6
1.6
1.6
0.5
1.8
0.5
11.3
11.8
11.8
9.6
11.5
11.2
10.6
11.8
11.9
11.7
11.5
11.5
12.1
11.6
10.9
11.9
12.7
11.2
11.2
11.1
12.1
10.5
11.0
9.9
10.7
10.6
13.7
9.7
10.2
10.1
10.8
10.6
10.4
10.6
10.6
10.6
10.4
11.9
12.4
12.6
10.3
10.4
10.5
13.2
11.2
10.0
36.1
36.0
34.2
36.7
36.4
36.3
36.2
35.4
34.4
34.6
33.6
34.3
33.8
34.4
36.0
34.7
33.2
36.5
37.0
36.6
34.2
34.9
36.5
41.7
40.7
42.7
39.4
41.3
41.5
41.9
41.1
42.3
42.6
43.5
43.0
42.7
43.0
40.5
40.0
40.7
41.0
40.3
40.8
39.9
42.6
41.5
Karlin et al.
%
No. of sequences
Median length
G+C
All
5200 aa %5200 All
5200
Species names generally follow the SwissProt convention}first three letters of the genus name followed by the first two letters of the species name.
Lehninger alphabet: - is E,D; + is K,R; p (polar uncharged) is H,Y,P,G,S,T,N,Q; f (hydrophobics) is L,I,V,A,M,F,W,C.
SwissProt collection of human proteins.
d
Linear chromosome of Agrobacterium tumefaciens.
e
Plasmid pSymA of Sinorhizobium meliloti.
c
b
a
THEMA
AQUAE
MYCTU
MYCLE
BACSU
BACHA
LACLA
STAAU
STRPN
STRPY
CLOAC
UREUR
MYCGE
MYCPN
SYNY3
46
43
66
58
44
44
35
33
40
39
31
25
32
40
48
1846
1521
3909
1605
4097
4066
2266
2714
2094
1696
3672
611
466
677
3163
1308
1072
2766
1100
2572
2620
1409
1715
1255
1127
2418
418
344
480
2086
71
70
71
69
63
64
62
63
60
66
66
68
74
71
66
284
274
286
282
256
261
251
254
242
263
262
286
292
286
274
344
340
360
358
340
341
329
338
338
335
342
383
369
354
360
9.0
9.6
4.7
5.0
7.4
7.9
7.3
6.5
7.3
6.3
7.1
5.7
5.4
5.3
5.9
5.0
4.3
6.0
6.0
5.3
5.1
5.4
6.0
5.7
6.0
5.7
5.7
4.9
5.0
5.0
7.5
9.3
1.8
2.4
6.8
5.7
7.2
7.2
6.4
6.6
9.5
9.8
9.6
8.4
3.9
5.5
4.8
7.5
6.9
4.0
4.7
3.4
3.4
3.9
3.8
3.3
2.6
2.8
3.3
5.0
1.5
1.5
2.2
2.2
2.2
2.3
1.7
2.2
1.8
2.0
1.2
1.6
1.5
1.7
1.8
9.8
10.4
9.8
10.0
9.5
9.8
9.7
8.9
10.1
9.9
8.5
10.0
10.7
10.4
11.4
7.1
7.2
4.2
4.7
7.2
6.8
7.5
8.4
7.2
7.4
9.6
10.5
8.4
6.7
6.1
8.7
7.9
8.6
9.2
6.8
7.4
6.6
6.7
7.0
6.7
6.7
5.3
6.1
6.4
6.7
2.2
1.7
1.8
2.0
2.6
2.6
2.4
2.6
2.4
2.5
2.5
1.7
1.5
1.5
1.9
4.9
4.8
2.8
2.8
4.2
4.1
4.5
4.3
4.5
4.2
4.3
5.1
6.1
5.5
3.8
3.4
4.0
2.0
2.1
3.3
3.3
3.4
3.8
3.6
3.5
4.1
4.3
3.2
3.2
2.9
0.9
0.8
1.3
1.2
0.9
1.0
0.8
0.6
0.7
0.7
0.5
0.8
0.8
1.0
1.4
4.0
4.0
5.7
5.2
3.7
3.8
3.3
3.2
3.3
3.4
2.7
2.5
2.9
3.3
5.1
7.0
6.8
8.7
8.5
7.0
7.0
6.5
6.2
6.6
6.4
6.2
4.1
4.5
5.3
7.3
5.7
5.6
13.1
11.8
7.6
7.3
7.3
6.1
7.2
7.6
5.4
4.8
5.4
6.6
8.5
5.4
4.7
5.3
5.8
6.2
5.6
6.3
5.9
6.0
6.0
6.5
5.9
6.3
5.9
5.6
4.4
4.2
5.8
6.0
5.3
5.5
5.5
5.6
5.3
5.8
4.7
4.8
5.2
5.7
5.3
3.5
3.4
2.1
2.5
3.7
3.5
5.0
5.3
4.1
4.2
6.3
8.5
7.4
6.1
3.7
1.8
1.9
3.0
3.1
3.6
3.9
3.5
3.9
3.9
4.1
2.1
3.7
4.4
5.1
5.5
0.5
0.6
0.8
0.9
0.7
0.6
0.3
0.5
0.4
0.5
1.1
0.6
0.8
0.7
0.9
14.3
14.2
11.1
11.2
13.0
13.2
13.0
12.9
13.4
12.6
12.9
11.6
10.5
10.5
11.0
13.1
14.3
9.6
9.5
11.0
10.5
10.9
10.8
10.5
10.7
13.0
12.7
12.7
11.8
9.1
31.9
31.2
35.5
35.7
35.7
35.5
35.7
36.7
35.5
36.2
34.3
36.3
36.0
36.9
38.0
40.2
39.4
43.0
43.0
39.4
39.5
39.3
38.2
39.8
39.7
38.7
39.0
40.4
39.7
40.9
Heterogeneity of Genome and Proteome Content
385
genomes are sensitive to lifestyle, habitat, food sources,
physiology, type of membranes, and many other factors.
For example, among fermentation processes, there are
many alternative end-products depending on environmental conditions and degree of resistance to an acidic
acid milieu, e.g., acetate, lactate, ethanol.
Many theories have been proposed on domains of life,
the origin and early evolution of eukaryotes, and the
genesis of organelles. 16S rRNA genes give results
conflicting with many protein sequence comparisons. It
is increasingly appreciated that the genomes of many
prokaryotes
and
primitive
eukaryotes
are
‘‘heterogeneous unions’’ in which lateral transfer and/
or close associations have been at work (Doolittle, 1998;
Ochman et al., 2000; Campbell, 2000). The current
archaeal genome number is 13 and, to date, numerous
others have not been sequenced. Also, Woese (1998) no
longer prescribes a single progenote as the genesis of life
but rather a ‘‘community’’ of initial life forms involving
much lateral exchange among them. Along these lines,
there have been proposed archaeal–bacterial partnerships preceding the origin of eukaryotes (Zillig et al.,
1989; Gupta and Golding, 1996; Martin and Mu. ller,
1998; Lopez-Garcia and Moreira, 1999; Karlin et al.,
1999). Hartman and Fedorov (2002) postulate that the
eukaryotic domain was established as a union of three
genome types: a bacterium, an archaeon, and a
‘‘chronocyte,’’ the latter contributing to several basic
eukaryotic systems including spliceosomal mechanisms,
capping enzymes, nuclear pore constituents, and endoskeletal apparatus allowing for the functionalities of
endocytosis, signaling and control. A primitive
chronocyte is apparently no longer extant. Specificities
and antecedents of these three cell types are also
unclear.
Conventional methods of phylogenetic reconstruction
from sequence information use only similarity or
dissimilarity assessments of aligned homologous genes
or regions. For a detailed review of problems of
inferring phylogeny, see Brocchieri (2001) and Gribaldo
and Philippe (2002). Difficulties intrinsic in phylogenetic
methods include the following: (i) Alignments of
distantly related long sequences (e.g. complete genomes)
are generally not feasible. (ii) Different phylogenetic
reconstructions may result for the same set of organisms
based on analysis of different protein, gene, or noncoding sequences. Attempts are made to overcome this by
averaging over many proteins or by concatenating
sequences (Daubin and Gouy, 2001) mostly restricted
to classes of ribosomal protein genes. Even the numbers
of RPs among prokaryotic genomes differ. (iii) Resultant trees may be highly dependent on details of the
386
Karlin et al.
TABLE IX
Various Summary Comparisons of Genomic and Proteomic Properties
Property
Bacteria
Shine}Dalgarno sequences
contributing to translation
initiation
+
GC skew
Periodic 30 bp repeats
Aggregate 4 (and 6Þ bp
palindromes}under-represented
Genome of single chromosomes
Linear chromosomes
Extremes of genome signature;
dinucleotide relative abundances
Nitrogen fixation
Lps (lipopolysaccharide) family
Lipid A
Chaperones/degradation
HSP60
HSP70
+ For bacterial chromosomes
with a single origin
(exceptions: DEIRA,
SYNSQ, THEMA, AQUAE)
(exceptions BACHA,
THEMA, AQUAE)
+
+
(Exceptions: VIBCH,
DEIRA, SINME, RHOSP)
BORBU, AGRTU, RHOSP;
Streptomyces; most genomes are circular
Variable
(+ and )
(+ and )
Gram-negative +
Gram-positive +
GroEL (not in UREUR)
+
(DnaK)
Archaea
Eukaryotes
+
Many Crenarchaea have
mostly leaderless
transcripts
Cap structure
+ in all
thermophiles
in HALSP
+
+
+
Variable
Variable
TA and CG mostly low
in vertebrates
(+ and )
(+ and )
+
Thermosome
Mostly missing; present
in THEAC, METTH,
HALSP
+
Hexamer structure 2a; 4b
+
Tcp1
Generally +
Prefoldin complex (GimC)
Trigger factor
HSP90
Thioredoxin
Peptidyl-prolyl cis–trans
isomerases; 3 families:
Cyclophilins
Fkbp
Parvulin
Lon
FtsZ (division control protein)
+
Variable
+
+
+
+
+
Variable
+
HALSP, METTH
+
Variable; not found in
Crenarchaea
+
+
+
+
Very weak similarity
between archaeal FtsZ
and tubulin
+
Separated
+, mixed
included in short cluster
only P0
Not present
+
50–63
53–65
78–79 in higher
eukaryotes
72 in GIALA
Ribosomal protein S1 (rpsA)
(generally > 500 aa length)
Cluster of RP genes
S2
Ribosomal proteins P0 ; P1 ; P2
acidic, regulatory
No. of RPs
+
2a and 4b
but different than archaea
+
+
387
Heterogeneity of Genome and Proteome Content
Table IX (continued)
Property
Bacteria
Archaea
Eukaryotes
Existence of introns
+
Phosphotransferase (PTS) system
Hexokinase and/or glucokinase
One copy
+
Many
Unknown; no GC skew
Multiple
+
+
+ hyperthermophiles
Protein median lengths
among proteins 580 aa
Variable
Variable (most bacterial
organisms use the PTS system)
Multiple copies in
proteobacteria a, b, g
Generally one (possible
exceptions: DEIRA,
SYNSQ, ARCFU, THEMA)
AQUAE, THEMA,
not found in mesophilic or
moderate thermophilic
prokaryotes
Range 260–295 aa;
except NEIME 239
euryarchaea
+Crenarchaea
346–384 aa
%proteins 5200 aa in the genome
Range 51–74%
230–250 aa
THEAC 268, PYRAE
208, AERPE 175
52–76%
Translation elongation factors Tuf
Fus
No. of origins of replications
PCNA replication factor
Reverse Gyrase
alignment algorithm used, inadequacies of phylogenetic
methods, and biases in species and sequence sampling.
(iv) Long branch attraction, mutational saturation, and
different selection processes make ancient branching
unreliable. (v) Chimeric origins, recombination, inversions, transpositions, and lateral transfer between
distantly related organisms can complicate analyses.
(vi) Tree construction derived from aligned sequences
cannot apply to organisms for which similar gene
sequences are largely unavailable (e.g., for bacteriophages, eukaryotic viruses, or deeply divergent organisms). (vii) Problems of influences of unrecognized
paralogy and widespread reductions and expansions of
genome content. Translation of sequence similarities
into evolutionary relatedness will always be questionable as the underlying assumptions about mutation
rates, selection forces and gene transfer events are
uncertain. All the models of sequence evolution used are
undoubtedly far from the real evolutionary mechanisms.
The three-domain hypothesis and the endosymbiont
hypothesis have undergone many changes. First, the
original reason for dividing the living world into three
domains was that there were, on the initial evidence,
approximately three sets and that these were about
equally ‘‘deviant’’ from one another. However, Table IX
shows great variability within and among the three
‘‘domains.’’ Insistence on three domains leaves frozen a
classification based on the limited knowledge available
in the past (Karlin et al., 1997). If A (Archaea) and K
(eukoryotes) are more closely related than either is to B
76–80%
(Bacteria) (another point of controversy), then A and K
are in the same domain. Otherwise, why not define
additional primary domains within these three domains.
Recent controversy has arisen regarding the problem of
locating the bacterial root. In this context there are now
a number of proposals placing the root on the
eukaryotic branch (Poole et al., 1999; Forterre and
Philippe, 1999), with later reductions producing the
prokaryotes.
With the range of protein sequences now available,
the nuclear genome appears to be chimeric. If it arose by
fusion of two (or more) genomes, each genome must
have had its own 16S rRNA, one of which might then
have broken off to inhabit the organelle. This leaves
many possible scenarios. Our favorite compresses a
multi-organismal fusion and the endosymbiont invasion
into a single event (Karlin et al., 1999). The chimeric
nature of the nuclear genome could then result primarily
from migration into the nucleus of many genes, not just
those affecting organellar function. We consider the
Sulfolobus line as a likely candidate for the endosymbiont, particularly of animal mitochondria, for reasons
outlined previously (Karlin et al., 1999). These reasons
include similarities in genome signature.
The patterns of dinucleotide relative abundance
values (genomic signatures) are about the same for
every contig of at least 50 kb length from the same
organism but significantly different for those from
different organisms. The uniformity of the signature
throughout each genome suggests a recent acquisition
388
(on an evolutionary time scale). Mechanisms for the
evolution and maintenance of the genomic signature are
unknown although there are data to suggest that
genome-wide processes, including DNA replication
and repair, contribute intrinsically to the genomic
signature (Karlin and Burge, 1995; Blaisdell et al.,
1996; Karlin, 1998). The genomic signature is also useful
for detecting pathogenicity islands (Karlin, 2001) in
bacterial genomes and in conveying strong influences on
codon usages.
ACKNOWLEDGMENTS
Supported in part by NIH Grants 5R01GM10452-40 and
5R01HG00335-15.
REFERENCES
Antelmann, H., Bernhardt, J., Schmid, R., Mach, H., Vo. lker, U., and
Hecker, M. 1997. First steps from two-dimensional protein index
towards a response regulation map for Bacillus subtilis, Electrophoresis 18, 1451–1463.
Benachenhou-Lahfa, N., Labedan, B., and Forterre, P. 1994. PCRmediated cloning and sequencing of the gene encoding glutamate
dehydrogenase from the archaeon Sulfolobus shibatae: Identification of putative amino-acid signatures for extremophilic adaptation, Gene 140, 17–24.
Beutler, E., Gelbart, T., Han, J. H., Koziol, J. A., and Beutler B. 1989.
Evolution of the genome and the genetic code: Selection at the
dinucleotide level by methylation and polyribonucleotide cleavage,
Proc. Natl. Acad. Sci. USA 86, 192-196.
Blaisdell, B. E., Campbell, A. M., and Karlin, S. 1996. Similarities and
dissimilarities of phage genomes, Proc. Natl. Acad. Sci. USA 93,
5854–5859.
Blumenthal, A. B., Kriegstein, H. J., and Hogness, D. S. 1974. The
units of DNA replication in Drosophila melanogaster chromosomes, Cold Spring Harbor Symp. Quant. Biol. 38, 205–223.
Brocchieri, L. 2001. Phylogenetic inferences from molecular sequences:
Review and critique, Theor. Popul. Biol. 59, 27–40.
Brown, J. R., and Doolittle, W. F. 1997. Archaea and the prokaryoteto-eukaryote transition, Microbiol. Mol. Biol. Rev. 61, 456–502.
Brown, J. R., Masuchi, Y., Robb, F. T. and Doolittle, W. F. 1994.
Evolutionary relationships of bacterial and archaeal glutamine
synthetase genes, J. Mol. Evol. 38, 566–576.
Calladine, C. R., and Drew, H. R. 1992. ‘‘Understanding DNA,’’
Academic Press, San Diego.
Campbell, A. M. 2000. Lateral gene transfer in prokaryotes. Theor.
Popul. Biol. 57, 71–77.
Cardon, L. R., Burge, C., Clayton, D. A., and Karlin, S. 1994.
Pervasive CpG suppression in animal mitochondrial genomes,
Proc. Natl. Acad. Sci. USA 91, 3799–3803.
Creti, R., Ceccarelli, E., Bocchetta, M., Sanangelantoni, A. M.,
Tiboni, O., Palm, P., and Cammarano, P. 1994. Evolution of
translational elongation factor (EF) sequences: Reliability of global
Karlin et al.
phylogenies inferred from EF-1 alpha(Tu) and EF-2(G) proteins,
Proc. Natl. Acad. Sci. USA 91, 3255–3259.
Daubin, V., and Gouy, M. 2001. Bacterial molecular phylogeny using
supertree approach, Genome Inform. Ser. Workshop Genome
Inform. 12, 155–164.
Deuerling, E., Schulze-Specking, A., Tomoyasu, T., Mogk, A., and
Bukau, B. 1999. Trigger factor and DnaK cooperate in folding of
newly synthesized proteins, Nature 400, 693–696.
Doolittle, W. F. 1998. You are what you eat: A gene transfer ratchet
could account for bacterial genes in eukaryotic nuclear genomes,
Trends Genet. 14, 307–311.
Draper, D. E. 1996. Translational initiation, in ‘‘Escherichia coli
and Salmonella: Cellular and Molecular Biology,’’ 2nd ed.
(F. C. Neidhardt, R. Curtiss III, J. L. Ingraham, E. C. C. Lin,
K. B. Low, B. Magasanik, W. S. Reznikoff, M. Riley, M.
Schaechter and H. E. Umbarger, Eds.), pp. 902–908., ASM Press,
Washington, DC.
Echols, H. and Goodman, M. F. 1991. Fidelity mechanisms in DNA
replication, Annu. Rev. Biochem. 60, 477–511.
Forterre, P., and Philippe, H. 1999. Where is the root of the universal
tree of life? Bioessays 21, 871–879.
Francino, M. P., and Ochman, H. 1997. Strand asymmetries in DNA
evolution, Trends Genet. 13, 240–245.
Frank, A. C., and Lobry, J. R. 1999. Asymmetric substitution
patterns: A review of possible underlying mutational or selective
mechanisms, Gene 238, 65–77.
Gold, L. 1988. Posttranscriptional regulatory mechanisms in Escherichia coli, Annu. Rev. Biochem. 57, 199–233.
Gribaldo, S., and Philippe, H. 2002. Ancient phylogenetic relationships, in ‘‘Evolution of Genome Structures,’’ (A. M. Campbell and
S. Karlin, Eds.) Theor. Popul. Biol.
Gromiha, M. M., Oobatake, M., and Sarai, A. 1999. Important amino
acid properties for enhanced thermostability from mesophilic to
thermophilic proteins, Biophys. Chem. 82, 51–67.
Gupta, R. S. 1998. Protein phylogenies and signature sequences: A
reappraisal of evolutionary relationships among archaebacteria,
eubacteria, and eukaryotes, Microbiol. Mol. Biol. Rev. 62,
1435–1491.
Gupta, R. S. 2000. The natural evolutionary relationships among
prokaryotes, Crit. Rev. Microbiol. 26, 111–131.
Gupta, R. S., and Golding, G. B. 1996. The origin of the eukaryotic
cell, Trends Biochem. Sci. 21, 166–171.
Handy, J., and Doolittle, R. F. 1999. An attempt to pinpoint the
phylogenetic introduction of glutaminyl-tRNA synthetase among
bacteria, J. Mol. Evol. 49, 709–715.
Hartl, F. U., and Hayer-Hartl, M. 2002. Molecular chaperones in the
cytosol: from nascent chain to folded protein, Science 295,
1852–1858.
Hartman, H., and Fedorov, A. 2002. The origin of the eukaryotic cell:
A genomic investigation, Proc. Natl. Acad. Sci. USA 99,
1420–1425.
Hayes, J. M. 2000. Lipids as a common interest of microorganisms and
geochemists, Proc. Natl. Acad. Sci. USA 97, 14 033–14 034.
Houry, W. A., Frishman, D., Eckerskorn, C., Lottspeich, F., and
Hartl, F. U. 1999. Identification of in vivo substrates of the
chaperonin GroEL, Nature 402, 147–154.
Hunter, C. A. 1993. Sequence-dependent DNA structure. The role of
base stacking interactions, J. Mol. Biol. 230, 1025–1054.
Josse, J., Kaiser, A. D., and Kornberg, A. 1961. Enzymatic synthesis of
deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base
sequences in deoxyribonucleic acid, J. Biol. Chem. 236, 864–875.
Heterogeneity of Genome and Proteome Content
Karlin, S. 1998. Global dinucleotide signatures and analysis of
genomic heterogeneity, Curr. Opin. Microbiol. 1, 598–610.
Karlin, S. 2001. Detecting anomalous gene clusters and pathogenicity
islands in diverse bacterial genomes, Trends Microbiol. 9,
335–343.
Karlin, S., Brocchieri, L., Mr!azek, J., Campbell, A. M., and
Spormann, A. M. 1999. A chimeric prokaryotic ancestry of
mitochondria and primitive eukaryotes, Proc. Natl. Acad. Sci.
USA 96, 9190–9195.
Karlin, S., and Burge, C. 1995. Dinucleotide relative abundance
extremes: A genomic signature, Trends Genet. 11, 283–290.
Karlin, S., Burge, C. and Campbell, A. M. 1992. Statistical analyses of
counts and distributions of restriction sites in DNA sequences,
Nucleic Acids Res. 20, 1363–1370.
Karlin, S., and Cardon, L. R. 1994. Computational DNA sequence
analysis, Annu. Rev. Microbiol. 48, 619–654.
Karlin, S., Doerfler, W., and Cardon, L. R. 1994. Why is CpG
suppressed in the genomes of virtually all small eukaryotic viruses
but not in those of large eukaryotic viruses? J. Virol. 68, 2889–2897.
Karlin, S., and Mr!azek, J. 1996. What drives codon choices in human
genes? J. Mol. Biol. 262, 459–472.
Karlin, S., and Mr!azek, J. 2000. Predicted highly expressed genes of
diverse prokaryotic genomes. J. Bacteriol. 182, 5238–5250.
Karlin, S., Mr!azek, J., and Campbell, A. M. 1997. Compositional
biases of bacterial genomes and evolutionary implications,
J. Bacteriol. 179, 3899–3913.
Karlin, S., Mr!azek, J., Campbell, A., and Kaiser, D. 2001.
Characterizations of highly expressed genes of four fast-growing
bacteria, J. Bacteriol. 183, 5025–5040.
Krieg, A. M. 1996. Lymphocyte activation by CpG dinucleotide motifs
in prokaryotic DNA, Trends Microbiol. 4, 73–76.
Krieg, A. M., Yi, A. K., Schorr J., and Davis H. L. 1998. The role of
CpG dinucleotides in DNA vaccines, Trends Microbiol. 6, 23–27.
Kuehn, M. J., Ogg, D. J., Kihlberg, J., Slonim, L. N., Flemmer, K.,
Bergfors, T., and Hultgren, S.J. 1993. Structural basis of pilus
subunit recognition by the PapD chaperone. Science 262,
1234–1241.
Kunkel, T. A. 1992. Biological asymmetries and the fidelity of
eukaryotic DNA replication, Bioessays 14, 303–308.
Leigh, J. A. 2000. Nitrogen fixation in methanogens}the archaeal
perspective, in ‘‘Prokaryotic Nitrogen Fixation: A Model System
for Analysis of a Biological Process,’’ (E. W. Triplett, Ed.),
Horizon Scientific Press, Wymondham, UK.
Lipton, M. S., Pasa-Tolic, L., Anderson, G. A., Anderson, D. J.,
Auberry, D. L., Battista,
J. R., Daly, M. J., Fredrickson,
J., Hixson, K. K., Konstandarithes, H., Conrads, T. P., Masselon,
C., Markillie, L. M., Moore, R. J., Romine, M. F., Shen, Y., Tolic,
N., Udseth, H. R., Veenstra, T. D., Venkateswaran, A., Wong,
K.-K., Zhao, R., and Smith, R. D. 2002. Global analysis of
Deinococcus radiodurans R1 proteomes using accurate mass tags
(submitted).
Lobry, J. R. 1996a. Asymmetric substitution patterns in the two DNA
strands of bacteria, Mol. Biol. Evol. 13, 660–665.
Lobry, J. R. 1996b. Origin of replication of Mycoplasma genitalium,
Science 272, 745–746.
Lopez, P., Forterre, P., and Philippe, H. 1999. The root of the tree of
life in the light of the covarion model, J. Mol. Evol. 49, 496–508.
Lopez-Garcia, P., and Moreira, D. 1999. Metabolic symbiosis at the
origin of eukaryotes, Trends Biochem. Sci. 24, 88–93.
Ma, J., Campbell, A., and Karlin, S. 2002. Correlations between
Shine–Dalgarno sequences and predicted gene expression levels
and operon features, J. Bacteriol. (submitted).
389
Martin, W., and Mu. ller, M. 1998. The hydrogen hypothesis for the
first eukaryote, Nature 392, 37–41.
Mr!azek, J., and Karlin, S. 1998. Strand compositional asymmetry in
bacterial and large viral genomes, Proc. Natl. Acad. Sci. USA 95,
3720–3725.
Ochman, H., Lawrence, J. G., and Groisman, E. A. 2000. Lateral gene
transfer and the nature of bacterial innovation, Nature 405, 299–304.
Poole, A., Jeffares, D., and Penny D. 1999. Early evolution:
prokaryotes, the new kids on the block, Bioessays 21, 880–889.
Powis, G., and Montfort, W. R. 2001. Properties and biological
activities of thioredoxins, Annu. Rev. Biophys. Biomol. Struct. 30,
421–455.
Richardson, J. S., and Richardson, D. C. 1988. Amino acid preferences
for specific locations at the ends of alpha helices, Science 240,
1648–1652.
Rivera, M. C., Jain, R., Moore, J. E., and Lake, J. A. 1998. Genomic
evidence for two functionally distinct gene classes, Proc. Natl.
Acad. Sci. USA 95, 6239–6244.
Ritz, D., and Beckwith, J. 2001. Roles of thiol-redox pathways in
bacteria, Annu. Rev. Microbiol. 55, 21–48.
Rocha, E. P. C., Danchin, A., and Viari, A. 1999. Universal replication
biases in bacteria, Mol. Microbiol. 32, 11–16.
Rocha, E. P. C., Danchin, A., and Viari, A. 2001. Evolutionary role of
restriction/modification systems as revealed by comparative
genome analysis, Genome Res. 11, 946–958.
Russell, G. J., McGeoch, D. J., Elton, R. A., and Subak-Sharpe, J. H.
1976. Doublet frequency analysis of bacterial DNAs, J. Mol. Evol.
2, 277–292.
Russell, G. J., and Subak-Sharpe, J. H. 1977. Similarity of the
general designs of protochordates and invertebrates, Nature 266,
533–536.
Sandler, S. J., Satin, L. H., Samra, H. S., and Clark, A. J. 1996. recAlike genes from three archaean species with putative protein
products similar to Rad51 and Dmc1 proteins of the yeast
Saccharomyces cerevisiae, Nucleic Acids Res. 24, 2125–2132.
Sengupta, J., Agrawal, R. K., and Frank, J. 2001. Visualization of
protein S1 within the 30S ribosomal subunit and its inter
action with messenger RNA, Proc. Natl. Acad. Sci. USA 98,
11,991–11,996.
Siegert, R., Leroux, M. R., Scheufler, C., Hartl, F. U., and Moarefi, I.
2000. Structure of the molecular chaperone prefoldin: Unique
interaction of multiple coiled coil tentacles with unfolded proteins,
Cell 103, 621–632.
Suyama, M., and Bork, P. 2001. Evolution of prokaryotic gene order:
Genome rearrangements in closely related species, Trends Genet.
17, 10–13.
Tatusov, R. L., Natale, D. A., Garkavtsev, I. V., Tatusova, T. A.,
Shankavaram, U. T., Rao, B. S., Kiryutin, B., Galperin, M. Y.,
Fedorova, N. D., and Koonin, E. V. 2001. The COG database: new
developments in phylogenetic classification of proteins from
complete genomes, Nucleic Acids Res. 29, 22–28.
Teter, S. A., Houry, W. A., Ang, D., Tradler, T., Rockabrand, D.,
Fischer, G., Blum, P., Georgopoulos, C., and Hartl, F. U.
1999. Polypeptide flux through bacterial Hsp70: DnaK cooperates with trigger factor in chaperoning nascent chains, Cell
97, 755–765.
Thomas, D. C., Svoboda, D. L., Vos, J. M. H., and Kunkel, T. A.
1996. Strand specificity of mutagenic bypass replication of DNA
containing psoralen monoadducts in a human cell extract, Mol.
Cell. Biol. 16, 2537–2544.
VanBogelen, R. A., Abshire, K. Z., Pertsemlidis, A., Clark, R. L., and
Neidhardt, F. C. 1996. Gene-protein database of Escherichia coli
390
K-12, edition 6, in ‘‘Escherichia coli and Salmonella: Cellular and
Molecular Biology,’’ 2nd ed. (F. C. Neidhardt, R. Curtiss III, J. L.
Ingraham, E. C. C. Lin, K. B. Low, B. Magasanik, W. S.
Reznikoff, M. Riley, M. Schaechter and H. E. Umbarger, Eds.),
pp. 2067–2117, ASM Press, Washington, D.C.
VanBogelen, R. A., Schiller, E. E., Thomas, J. D., and Neidhardt, F.
C. 1999. Diagnosis of cellular states of microbial organisms using
proteomics, Electrophoresis 20, 2149–2159.
Warner, J. R. 1999. The economics of ribosome biosynthesis in yeast,
Trends Biochem. Sci. 24, 437–440.
Winkler, H. H., and Daugherty, R. M. 1986. Acquisition of glucose by
Rickettsia prowazekii through the nucleotide intermediate uridine
5’-diphosphoglucose, J. Bacteriol. 167, 805–808.
Karlin et al.
Woese, C. R., Kandler, O., and Wheelis M. L. 1990. Towards a natural
system of organisms: Proposal for the domains Archaea, Bacteria,
and Eucarya, Proc. Natl. Acad. Sci. USA 87, 4576–4579.
Woese, C. 1998. The universal ancestor, Proc. Natl. Acad. Sci. USA 95,
6854–6859.
Wolf, Y. I., Aravind, L., Grishin, N. V., and Koonin, E. V. 1999.
Evolution of aminoacyl-tRNA synthetases}analysis of unique
domain architectures and phylogenetic trees reveals a complex
history of horizontal gene transfer events, Genome Res. 9, 689–710.
Zillig, W., Klenk, H. P., Palm, P., Puhler, G., Gropp, F., Garrett, R.
A., and Leffers, H. 1989. The phylogenetic relations of DNAdependent RNA polymerases of archaebacteria, eukaryotes, and
eubacteria, Can. J. Microbiol. 35, 73–80.