Download A Tree of Life Based on Protein Domain Organizations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Human genome wikipedia , lookup

Metagenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Genomic library wikipedia , lookup

History of genetic engineering wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Pathogenomics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Gene expression programming wikipedia , lookup

Genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Genome evolution wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
A Tree of Life Based on Protein Domain Organizations
Kaoru Fukami-Kobayashi,* Yoshiaki Minezaki,à Yoshio Tateno,à and Ken Nishikawaà
*RIKEN BioResource Center, RIKEN Tsukuba Institute, University of Tsukuba, Tsukuba, Ibaraki, Japan; School of Library and
Information Science, University of Tsukuba, Tsukuba, Ibaraki, Japan; and àCenter for Information Biology and DNA Data Bank of
Japan, National Institute of Genetics, Research Organization of Information and Systems, Mishima, Japan
It is desirable to estimate a tree of life, a species tree including all available species in the 3 superkingdoms, Archaea, Bacteria,
and Eukaryota, using not a limited number of genes but full-scale genome information. Here, we report a new method for
constructing a tree of life based on protein domain organizations, that is, sequential order of domains in a protein, of all
proteins detected in a genome of an organism. The new method is free from the identification of orthologous gene sets and
therefore does not require the burdensome and error-prone computation. By pairwise comparisons of the repertoires of protein domain organizations of 17 archaeal, 136 bacterial, and 14 eukaryotic organisms, we computed evolutionary distances
among them and constructed a tree of life. Our tree shows monophyly in Archaea, Bacteria, and Eukaryota and then monophyly in each of eukaryotic kingdoms and in most bacterial phyla. In addition, the branching pattern of the bacterial phyla in
our tree is consistent with the widely accepted bacterial taxonomy and is very close to other genome-based trees. A couple of
inconsistent aspects between the traditional trees and the genome-based trees including ours, however, would perhaps urge to
revise the conventional view, particularly on the phylogenetic positions of hyperthermophiles.
Introduction
As a number of genome projects produced the complete genome sequences for various species, genome-wide
information has been employed to construct a genome tree
for inferring evolutionary relationship of species. Accordingly, several methods for constructing a genome tree have
been invented. Some authors compared the whole genomes
with respect to the gene contents (Fitz-Gibbon and House
1999; Snel et al. 1999; Tekaia et al. 1999; Wolf, Rogozin,
Grishin, et al. 2001; Korbel et al. 2002) to the domain contents (Gerstein 1998; Wolf et al. 1999; Lin and Gerstein
2000; Caetano-Anolles G and Caetano-Anolles D 2003;
Winstanley et al. 2005; Yang et al. 2005), to the gene order
(Wolf, Rogozin, Kondrashov, Koonin 2001; Korbel et al.
2002), or to the total evolutionary distances between a pair
of orthologs (Wolf, Rogozin, Grishin, et al. 2001; Clarke
et al. 2002), others dealt with multiple genes by concatenating the aligned sequences (Hansmann and Martin 2000;
Brown et al. 2001; Katoh et al. 2001; Wolf, Rogozin,
Grishin, et al. 2001), and still others synthesized a compound tree by uniting multiple trees constructed from
individual genes (Daubin et al. 2001).
In the pregenome era, we used to construct phylogenetic trees for individual genes or proteins, which have
greatly contributed to elucidating unknown, unclear, or
controversial phylogenetic positions of a number of species
and populations. Unfortunately, however, trees constructed
for different genes or proteins are often mutually incompatible (Wolf et al. 2002; Bapteste et al. 2004). The reasons for
this could be due to imperfectness of a tree-construction
method, stochastic nature of nucleotide substitution, polymorphism in a common ancestral population, horizontal
gene transfer (HGT), and/or nondiscrimination of paralogous from orthologous genes. The primary reason, however, lies in the fact that most of the constructed trees
are gene trees (Tateno et al. 1982) and not species trees.
Because not all genes have evolved consistently with their
host species owing to some of the biological factors mentioned above, a gene tree is sometimes inconsistent with the
way the species in question have evolved. One way to resolve the inconsistency between the gene and species trees
is to use as many genes in a genome as possible (Tateno
et al. 1982). It has been reported that the more genes we
use, the less influence we suffer from the disturbances mentioned and the closer we can reach the correct tree (Rokas
et al. 2003). It is thus desirable to use the entire and pertinent information on the whole genome, if we dare to construct a species tree as a first approximation of the
evolutionary relationship of species.
Here, we report a new method for constructing a species
tree based on protein domain organizations. We determine
domain organization of each protein by the sequential order
of the domains in the protein, and then define a repertoire of
protein domain organizations of an organism as a set of domain organizations of all proteins encoded in its genome. By
using difference of the repertoires among species, a species
tree is constructed. Proteins with the same domain organization are roughly regarded as homologous over their total
length. Therefore, our method compares the whole genomes
with respect to homologous gene contents by the medium of
protein domains. Because our method is based not only on
the contents of domains but also their order along the primary
structure in each protein, it treats more detailed aspects of
evolution than the methods based on the domain contents
alone (Winstanley et al. 2005; Yang et al. 2005). Moreover,
like the domain content methods, our method does not require an orthologous gene set of the species studied for which
numerous complicated and burdensome procedures are unavoidable. Applying our method to 17 archaeal, 136 bacterial, and 14 eukaryotic organisms, we have constructed a tree
of life that is quite consistent with the taxonomy widely accepted today. We then discuss consistent and inconsistent
aspects between our tree and the traditional tree focusing particularly on the phylogenetic positions of hyperthermophiles.
Key words: universal tree, protein domain, genome, phylogeny,
hyperthermophile.
Materials and Methods
Protein Domains
E-mail: [email protected].
Mol. Biol. Evol. 24(5):1181–1189. 2007
doi:10.1093/molbev/msm034
Advance Access publication March 1, 2007
We used the Protein families database of alignments
and HMMs (Hidden Markov Models) (Pfam) (Bateman
et al. 2004) version 11.0 to assign the Pfam domains to
Ó 2007 The Authors.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
1182 Fukami-Kobayashi et al.
the open reading frame (ORF) sequences in the Genomes
TO Protein structures and functions (GTOP) database
(Kawabata et al. 2002) as of May 2004 (ftp://spock.genes.
nig.ac.jp/pub/gtop/Old_04Jul/). GTOP contains all ORFs
of the organisms for which the whole-genome sequence
has been reported or the complete set of cDNA clones is
sequenced. The Pfam domains were assigned for the ORFs
by HMM (Durbin et al. 1998) using GA (gathering cutoffs)
scores recorded for individual protein families (Eddy 2003).
Assignment of the Structural Classification of Proteins
(SCOP) superfamily domains was also carried out by applying reverse PSI-Blast (position specific iterative Blast
[Basic Local Alignment Search Tool]) (Altschul et al.
1997) to the SCOP database (Murzin et al. 1995) version
1.65. The cutoff e value was 103.
We primarily adopted the Pfam domains for construction of a repertoire of protein domain organizations.
Although the SCOP domains have been commonly used
in the domain contents methods (Gerstein 1998; Wolf
et al. 1999; Lin and Gerstein 2000; Caetano-Anolles G
and Caetano-Anolles D 2003; Winstanley et al. 2005; Yang
et al. 2005), we supposed that the Pfam domains were more
suitable to the present study for the following 2 reasons. First,
the Pfam domains are defined by sequence conservation and
have clear evolutionary relationship, which is important to
phylogenetic analysis. Second, the Pfam domains are more
extensively assigned to the ORF regions and are expected to
bring out more information from genome sequences than the
SCOP domains: the former covered 50.5% of the ORF
regions, whereas the SCOP domains covered only 40.2%,
when measured in the residue basis in GTOP. In order to
check our expectation and robustness of our method, we
applied our method also to the SCOP superfamily domains.
Domain Organizations and Repertoires
The domain organization of a protein is defined as the
sequential order of the domains in the protein. For example,
if a protein has domains A, B, and C along its primary structure in that order, its domain organization is defined as
‘‘ABC’’ (fig. 1). It is distinguished from the domain organization ‘‘BAC,’’ which consists of the same domains but
their order is different, and is also distinguished from the
domain organization ‘‘ABCC,’’ which has an extra C domain. The domain organization could also be composed
of one domain. Regions to which the same domain is
assigned are guaranteed to reflect their common ancestry,
and therefore, proteins with the same domain organization
are roughly regarded as homologous over their total length.
The repertoire of protein domain organizations of an
organism is defined as the set of domain organizations of all
proteins encoded in the genome of the organism. Only the
presence or absence of a domain organization was considered for the construction of the repertoire of each organism.
Evolutionary Distances and a Phylogenetic Tree
Let us consider 2 organisms A and B that diverged
from a common ancestor O. Let us also denote that the numbers of domain organizations of A and B are nA and nB,
respectively, and the number of domain organizations
shared by A and B is nAB. A and B initially share the same
FIG. 1.—Protein and its domain organization. The domain organization of a protein is defined as the sequential order of the domains in the
protein. The domain organization could also be composed of one domain.
repertoires (i.e., nA 5 nAB 5 nB). As A and B diverge, nAB
decreases because they independently accumulate new domain organizations and lose extant ones. Assuming that the
gain and loss respectively follow the Poisson process, the
evolutionary distance between O and A, dOA, is given by
dOA 5 lnðnAB =nA Þ:
ð1Þ
Similarly, the evolutionary distance between O and B, dOB,
is obtained.
The values of dOA and dOB are more or less the same in
most cases. However, if one of the organisms, B for instance, has a substantially smaller number of the domain
organizations than A, dOA is much larger than dOB. The excess of dOA is mostly due to the shrinkage of the genome
size of B from the original one, which often occurs in parasites or symbionts. To reduce this adverse effect, we defined
the geometric mean of dOA and dOB as the evolutionary distance between A and B given by
1=2
dAB 5 ðdOA 3dOB Þ :
ð2Þ
Actually, we have explored other ways to define dAB, by
employing arithmetic mean and harmonic mean and by taking a smaller value between dOA and dOB and confirmed
that geometric mean showed the best performance among
them. In this sense the employment of geometric mean is
rather empirical, yet it also has some basis: because the geometric mean weighs a smaller value between the 2, the effect of the genome shrinkage on the evolutionary distance is
effectively reduced. To see this clearly, let us consider an
extreme case in which the genome shrinkage occurred in B
and no other changes occurred in A and B, then dOB is zero
by definition, and thus dAB is expectedly zero.
The evolutionary distance between every pair of the
organisms studied was computed, and a tree of life was
accordingly constructed by the Neighbor-Joining (NJ)
method (Saitou and Nei 1987).
Bootstrap Probability
We evaluated the reliability of each internal branch of
the constructed tree by the bootstrap method (Felsenstein
A Tree of Life Based on Domain Organizations
FIG. 2.—Correlation between the number of ORFs and the number
of domain organizations. Each spot represents an organism of Archaea
(triangle), Bacteria (square), or Eukaryote (diamond). The 2 numbers
are roughly proportional to one another, except for plant organisms
(marked with arrows). Both numbers for eukaryotic organisms are distinctively larger than those of archaeal or bacterial organisms.
1985) modified by us for the domain organization repertoire
data. The modification is as follows. Let us consider a table
of m rows and n columns, in which the rows and the columns stand for m organisms and n domain organizations
respectively, and presence or absence of a domain organization of an organism is recorded in each cell. In bootstrap
resampling, n columns were randomly sampled with replacement from the original table to make a resampled table. Using the resampled table, evolutionary distances and
then a tree were obtained by the same methods as those
mentioned in the above section. These processes were repeated 100 times. We then compared the topology of each
generated tree with that of the original tree and counted the
number of occurrences of each branch of the original tree
for the 100 generated trees. The percentile number of the
occurrences for each branch is the bootstrap probability
for the branch in this case.
Construction of a Network
Using the evolutionary distances used for a tree construction, we also constructed a network with NeighborNet
(Bryant and Moulton 2004) to reveal some of the conflicting signals that might harbor in the distance data.
Results
Properties of Domain Organization Repertoires
We selected 17 archaeal, 136 bacterial, and 14 eukaryotic organisms, for which the whole-genome sequence
or the complete set of cDNA clones is sequenced. Each
archaeal and eukaryotic organism represents its species,
and the bacterial organisms are derived from 105 species,
some of which contain different strains. The names, codes,
numbers of ORFs, genome sizes, and National Center for
Biotechnology Information (NCBI) taxonomy of the organisms analyzed are listed in supplementary table S1
(Supplementary Material online).
1183
FIG. 3.—Correlation between the number of domain organizations
and the proportion of shared domain organizations. The correlation was
computed only for the bacterial organisms. Each spot represents a bacterial
organism, and the proportion of a given organism was obtained by averaging all the proportions computed between the organism and the other
bacterial organisms.
The number of the domain organizations extracted
from each organism (nX) ranged between 266 and 4,762
among the 167 organisms, and 18,302 domain organizations were extracted from the organisms as a whole. As expected, nX was distinctively larger for eukaryotic organisms
than for archaeal or bacterial organisms. In prokaryotes, nX
is larger for a bacterial organism than for an archaeal organism, except for a parasitic or symbiotic bacterial organism
such as Mycoplasma, Rickettsia, Chlamydia, or Buchnera.
In general, nX is roughly proportional to the number of
ORFs in an organism (fig. 2), although plant organisms
have small nX for their number of ORFs, as compared with
other organisms. This is not due to ineffective detection of
the Pfam domains in the plant proteins because the domains
are no less extensively assigned to the ORF regions in the
plant genomes (e.g., the Pfam cover rate for A. thaliana is
40.2%) than in the other eukaryotic genomes (e.g., 38.2%
for human). The most probable explanation for the relatively small nX in the plants is the recent genome duplication in them (The Arabidopsis Genome Initiative 2000;
Goff et al. 2002; Yu et al. 2005), by which the number
of ORFs was doubled, whereas that of domain organizations was not.
We then counted nAB, the number of domain organizations shared by organism A and organism B, for every
pair of the 167 organisms. The proportion of the shared
nAB to the total nA or nB ranged between 0.03 and 1.00.
The proportions for a pair of organisms, nAB/nA and
nAB/nB, are expectedly larger as the organisms A and B
relate more closely. The proportion also tends to increase,
as the total nA deceases (fig. 3). This is consistent with the
fact that most genes in a small genome are essential and
ubiquitous.
Construction of a Domain Organization Tree
Using the nAB/nA and nAB/nB values, we computed an
evolutionary distance for every pair of the 167 organisms
and constructed a phylogenetic tree of them as shown in
figure 4 (see Supplementary fig. S1, Supplementary Material
online for an enlarged version of the tree). The organisms in
1184 Fukami-Kobayashi et al.
ecol0
ecol1
ecol2
ecol3
sfle0
sfle1
100 styp0
100
styp2
styp1
100
baph0
100
buch0
baph1
83 ypes0
100
ypes1
ypes2
bflo0
plum0
wbre0
hduc0
100
99
hinf0
pmul0
vcho0
100
vvul0
100
vvul1
vpar0
sone0
paer0
100
pput0
96
psyr0
neur0
100 bbro0
100
bpar0
99
bper0
rsol0
cvio0
100
nmen0
nmen1
cbur0
100
xaxo0
100
xcam0
100
xfas0
xfas1
100
atum0
atum1
smel0
100
mlot0
100
bmel0
92
bsui0
100
bjap0
rpal0
100
100
rcon0
rpro0
ccre0
cjej0
86
hhep0
95
100
100
hpyl0
hpyl1
wsuc0
aaeo0
gsul0
bbac0
100
lint0
lint1
}
100 100
97
100
Bacteria
γ-proteobacteria
β-proteobacteria
100
α-proteobacteria
ε-proteobacteria
δ -proteobacteria
100
ccav0
cmur0
ctra0
cpne0
100 cpne1
cpne3
cpne2
rbal0
bthe0
pgin0
92
tpal0
bbur0
Spirochaetes
100 95
100
ctep0
bsub0
bhal0
oihe0
bant0
100
bcer1
bcer0
99 saur0
100
saur1
100
saur2
sepi0
100
linn0
lmon0
100
99
87
97
97
100
93
97
100
96
100
100
97
100
87
87
100
86
100
cace0
cper0
Bacteroidetes
Bacillales
efae0
ljoh0
lpla0
llac0
100
saga0
saga1
spyo0
100
spyo1
spyo2
100
100 spyo3
100
spne0
spne1
smut0
100
81
Chlamydiae
Lactobacillales
mgal0
mgen0
mpne0
uure0
mpul0
mpen0
past0
Mollicutes
ctet0
Clostridia
tten0
tmar0
fnuc0
blon0
cdip0
100
ceff0
cglu0
100
100
98
100 mbov0
mtub0
100
mtub1
100
94
mlep0
100
100
save0
scoe0
100 twhi0
twhi1
anab0
100
telo0
syne0
96
98
pmar0
100
pmar2
100
pmar1
syne1
gvio0
100
drad0
tthe0
Deinococcus-Thermus
halo0
100
mace0
mmaz0
mjan0
100
mkan0
84
100
mthe0
mmar0
100
aful0
100
96
paby0
100
phor0
pfur0
100
taci0
tvol0
90
aper0
100
paer1
100
ssol0
stok0
hsap0
100
mmus0
100
rnor0
96
100
drer0
trub0
93
100
cint0
100
cbri0
cele0
100
dmel0
100
scer0
100
spom0
Fungi
ncra0
atha0
osat1
98
}
Proteobacteria
Firmicutes
Actinobacteria
Cyanobacteria
Euryarchaeota
Crenarchaeota
Archaea
Metazoa
100
100
Viridiplantae
Eukaryota
FIG. 4.—A phylogenetic tree based on the domain organization repertoires. The tree includes 17 archaeal, 136 bacterial, and 14 eukaryotic organisms.
The organisms in the tree are color coded according to the NCBI taxonomy (Benson et al. 2000; Wheeler et al. 2000). The Pfam (Bateman et al. 2004) domain
organizations were used for constructing the domain organization repertoire of each organism. Evolutionary distance between every pair of the 167
organisms was computed based on their domain organization repertoires, and the tree was constructed by the NJ method (Saitou and Nei 1987). The
number on an internal branch shows a percentage of bootstrap probability (Felsenstein 1985). Only bootstrap values equal to or higher than 80% are shown.
A Tree of Life Based on Domain Organizations
the phylogenetic tree are color coded following the NCBI
taxonomy (Benson et al. 2000; Wheeler et al. 2000).
The topology of the tree is quite consistent with the
NCBI taxonomy and other traditional ones (Garrity et al.
2004) in the following 5 aspects. First, the organisms are
clearly divided into the 3 superkingdoms, Archaea, Bacteria, and Eukaryota. Secondly, all the 3 eukaryote kingdoms
and most prokaryote phyla show monophyly at a high
bootstrap probability. Monophyly is also observed in many
bacterial sections, a lower rank to phylum. Although Beta-/
Gammaproteobacteria sections are not clearly separated,
this is rather due to historical misclassification of Betaproteobacteria as a separate phylum. Thirdly, most of the
branching patterns of the bacterial phyla are in agreement
with their evolutionary relationships that are widely accepted today, although bootstrap support for the branching
patterns is not significant. For example, it is considered now
that Deinococcus-Thermus phylum diverged very early in
the evolution of Bacteria (Gupta 1998; Brown 2003), which
is also shown in our tree. Our tree also shows early divergence of Cyanobacteria phylum, clustering of gram-positive
phyla Actinobacteria and Firmicutes together, and close
relationships of Chlamydiae, Bacteroides, and Spirochaetes to Proteobacteria. Fourthly, the branching patterns
of the eukaryotic organisms are in agreement with their
evolutionary relationships in the traditional taxonomy, although the branches in the animal (Metazoa) lineage and
those in the plant (Viridiplantae) lineage are respectively
longer and shorter than those in the other lineages. The
longer branches in the animal lineage probably reflect
an increase in the rate of change of domain organization
repertoires in this lineage. Finally, our tree places parasitic
and symbiotic organisms in the appropriate taxonomic
positions. Many genome trees do not show the correct
positions of those organisms and cluster them together
regardless of their taxonomic positions, whereas some genome trees resolve the problem and show the correct positions (Korbel et al. 2002; Kunin et al. 2005; Yang et al.
2005). Our method has also overcome the problem. These
5 aspects together confirm that the difference of the domain organization repertoires among organisms is a reliable measure for inferring their evolutionary lineages.
There are, however, also inconsistent aspects between
the traditional tree and ours, which are also found in genome
trees constructed by other authors. First, our tree shows that
Euryarchaeota is not monophyly in the Archaea cluster.
Whereas one of the Euryarchaeota organisms, Halobacterium
sp. NRC-1 (halo0), is placed on a separate basal branch of the
cluster, 2 Euryarchaeota organisms, Thermoplasma acidophilum DSM1728 (taci0) and Thermoplasma volcanium
GSS1 (tvol0), cluster together with Crenarchaeota rather
than the other Euryarchaeota organisms. The same relationships in the Archaea cluster have also been obtained in other
genome trees (Korbel et al. 2002; Wolf et al. 2002; Yang
et al. 2005). Secondly, the Bacillales section in our tree does
not show monophyly (Wolf, Rogozin, Grishin, et al. 2001;
Korbel et al. 2002; Wolf et al. 2002; Yang et al. 2005).
Thirdly, the Spirochaetes organisms in our tree do not make
a cluster (Yang et al. 2005). Fourthly, the Proteobacteria
cluster in our tree includes thermophilic bacterium, Aquifex
aeolicus (aaeo0) (Wolf et al. 2002; Yang et al. 2005).
1185
Finally, the Firmicutes cluster includes another thermophilic
bacterium, Thermotoga maritima (tmar0), and fusobacteria,
Fusobacterium nucleatum (fnuc0) (Yang et al. 2005).
Those inconsistent aspects are not conclusive to state
that our tree is incompatible with the traditional trees for the
following 2 reasons. First, some branches involved in the
inconsistent clusters have too low bootstrap probabilities
to be valid. Our tree cannot resolve deep phylogeny, which
is a general feature of trees of life. The branch clustering
Thermoplasmata and Crenarchaeota, for example, shows
less than 50% bootstrap probability, and so do the branches
that prevent Spirochaetes from being monophyly. Secondly,
a taxonomic unit sometimes corresponds to a paraphyletic
group such as reptiles in the traditional classification
(Graur and Li 2000). Because the Bacillales organisms make
a paraphyletic group in the Firmicutes cluster in our tree, and
so do the Euryarchaeota organisms in the Archaea cluster,
they can be considered as taxonomic units.
A SCOP Tree and a Domain Contents Tree
We also applied our method to the SCOP superfamily
domains in place of the Pfam domains (supplementary
fig. S2, Supplementary Material online). As can be seen
in figures 4 (or supplementary fig. S1, Supplementary
Material online) and supplementary figure S2 (Supplementary Material online), the 2 trees have similar topology. The
SCOP tree, however, is a little more different from the traditional taxonomy than the Pfam tree. The Cyanobacteria
phylum clusters with the Proteobacteria phylum in the
SCOP tree. The phyla Chlamydiae, Bacteroides, and a part
of Spirochaetes are more closely related to the Firmicutes
phylum in the SCOP tree than in the traditional taxonomy.
In addition, the overall bootstrap probabilities are lower in
the SCOP tree than in the Pfam tree. We considered, therefore, the Pfam tree better than the SCOP tree. The difference
in the 2 trees could be attributed to the difference in the
amount of information between the Pfam and SCOP
domains as mentioned in Materials and Methods.
We also constructed a tree using the Pfam domains
instead of the domain organizations (supplementary fig.
S3, Supplementary Material online) to see if the domain
organizations bring a better result than the domain contents.
The domain contents tree has more similar topology to the
domain organization tree than the SCOP tree: it exhibits an
early diversification of the Cyanobacteria phylum as well as
the domain organization tree. However, like the SCOP tree,
the domain contents tree shows closer phylogenetic relationship between the Firmicutes phylum and Chlamydiae,
Bacteroides, and a part of Spirochaetes than the traditional
taxonomy indicates. In addition, it does not show the Mollicutes section as monophyly, and the overall bootstrap
probabilities of the domain content tree are lower than those
of the domain organization tree. These results demonstrate
that the tree constructed from the domain organizations is
superior to the tree constructed from the domain contents
alone. This is mainly because the change of domain organization repertoires is caused not only by the gain or loss of
a domain but also by the reorganization of existing
domains, whereas the change of domain contents is related
only with the former.
1186 Fukami-Kobayashi et al.
Bacteria
α-proteobacteria
rcon0
rpro0
xfas0
atum0
xfas1
atum1
xcam0
δ-proteobacteria mlot0 bsui0 xaxo0
bmel0 cbur0
Spirochaetes bbac0 smel0
bjap0
rpal0
lint0
rbal0 lint1 ccre0
pgin0
bthe0
ctep0
0.1
Bacteroidetes
Actinobacteria
cdip0 blon0
ceff0
mbov0
cglu0
mtub0
mtub1 mlep0
scoe0
save0
twhi0
twhi1
γ-proteobacteria
-proteobacteria
β-proteobacteria
ε-proteobacteria
drad0
tthe0
Euryarchaeota
ecol3
sfle1
ecol1
ecol2 buch0
styp0 baph1
sfle0 baph0 bflo0
styp1
plum0 hduc0
wbre0 styp2 ypes1 hinf0
ypes2 pmul0
ecol0
ypes0
vcho0 vvul0
vvul1
vpar0
sone0
psyr0
pput0
paer0 nmen0
cvio0 nmen1
neur0
bper0
bbro0 bpar0
rsol0
hpyl1
hpyl0 hhep0
cjej0
wsuc0
aaeo0
gsul0
δ-proteobacteria
DeinococcusThermus
ccav0 ctra0
cmur0
cpne2 cpne1
cpne3 cpne0
mace0
mmaz0
Chlamydiae
mmar0
mkan0
mjan0 mthe0
aful0
paby0
phor0 pfur0
aper0
paer1
ssol0 taci0
halo0
stok0
tvol0
Archaea
Crenarchaeota
tten0
cace0
cper0
ctet0
fnuc0
gvio0
pmar2
anab0
pmar1
telo0
syne1
syne0 pmar0
Cyanobacteria
bsub0
oihe0
bhal0 sepi0
bant0 saur1
bcer1 saur0
bcer0 saur2
lmon0
linn0
Bacillales
ncra0
spom0
scer0
Fungi
dmel0
cele0
cbri0
cint0
trub0
rnor0 hsap0
tpal0 bbur0
tmar0
Spirochaetes
Clostridia
mpen0
mgen0
spne1
mpne0
spne0 past0
mgal0
smut0
uure0 mpul0
saga1
saga0
lpla0
llac0
efae0 spyo2
ljoh0 spyo0
spyo1
spyo3
Mollicutes
Lactobacillales
atha0
osat1
Viridiplantae
Eukaryota
mmus0
drer0
Metazoa
FIG. 5.—A network based on the domain organization repertoires. The network was constructed with NeighborNet (Bryant and Moulton 2004) using
the same evolutionary distances as used for the tree in figure 4. The network exhibits not only similar evolutionary relationship to that of the figure 4 tree
but also some conflicting signals harbored in the distance data, which are represented by many parallelograms in the network.
Construction of a Domain Organization Network
Using the evolutionary distances computed from the
Pfam domain organizations, we constructed a network as
shown in figure 5 (see Supplementary fig. S4, Supplementary Material online, for an enlarged version of the network). The network exhibits similar evolutionary
relationships among the organisms to that of the Pfam tree.
In addition, it reveals some conflicting signals, which are
represented by many parallelograms in the network.
Discussion
One of the new and strong points of our method is to
use domain organization for construction of a phylogenetic
tree. A tree of life necessarily incorporates all the available
species from the 3 superkingdoms, Archaea, Bacteria, and
Eukaryota. In addition, it is desirable to construct it from
whole genome not from a limited number of genes. Therefore, a method for constructing a tree of life should be applicable to a huge amount of genome data. In our method,
a protein domain is considered as an evolutionary unit,
instead of an amino acid residue. Such a coarse grained
way greatly compresses the genome data and enables us
to incorporate all the available species into a tree. Although
protein domains do not represent the whole genome of
an organism, our method shows one of the most feasible
ways to cover a skyrocketing amount of genome data.
Our method scales the change of a domain organization repertoire, that is, appearance and disappearance of domain organization in an organism, which is a rare event in
the course of evolution (Fukami-Kobayashi et al. 1993;
Apic et al. 2003; Fukami-Kobayashi et al. 2003). This feature is advantageous for elucidating phylogenetic relationships of remotely related species. On the other hand, our
method treats more detailed aspects of evolution than the
methods based on the domain contents because the domain
organization considers not only domain contents of a protein
but also their sequential order. The domain organization
tree demonstrated a better result than the domain contents
tree, which may suggest that domain contents does not give
sufficient information and domain organization gives us
moderately reduced information for construction of a tree
of life.
A Tree of Life Based on Domain Organizations
Our method has the following 6 features that would
perhaps make it superior to the methods using gene or protein sequences in the construction of a tree of life. The first 3
features are also inherent in other genome-based methods
using gene contents, gene order, or total evolutionary distances. 1) It can incorporate the information of a protein
even if the protein is not shared among the species in question. On the other hand, a sequence method can use only
genes or proteins that are shared among all the species studied. When constructing a tree of superkingdoms, one has to
use limited genes such as ribosomal RNA (rRNA), ribosomal protein, elongation factor Tu, and aminoacyl-tRNA
synthetase (AARS) genes, which are almost exclusively involved in information storage and processing (Ciccarelli
et al. 2006). 2) It does not require sequence alignment
for the proteins in question. When comparing among superkingdoms, it is difficult to obtain reliable alignments even
for the most conservative amino acid or nucleotide sequences. That is true of rRNAs (Hasegawa and Hashimoto 1993;
Gupta 1998). In addition, the results of tree construction
depend substantially upon which parts of the alignment
are used as ‘‘conservative’’ regions (Hansmann and Martin
2000). 3) It is resistant to mutational saturation, base (or
amino acid) composition biases, and heterogeneity of evolutionary rates among different evolutionary lineages.
Those pitfalls prevent sequence methods from constructing
the correct tree for remotely related species (Gribaldo and
Philippe 2002). 4) The computational time required is
roughly proportional to the number of genomes, because
most of the time is spared for mapping domains in each genome. If we conduct a round robin blasting of all genes of
a genome against the other, the computational time will be
roughly proportional to the square number of genomes. As
the number of genomes is increased, this problem will become more and more serious. 5) It does not require orthologous gene sets. Computation of orthologous gene sets is
accompanied with numerous complicated and burdensome
procedures, and certainly error prone. 6) It judges proteins
homologous or not by the medium of protein domain organizations. It is practically difficult to define homologous proteins using sequence similarity because the optimum cutoff
criteria, Blast score/e value and coverage of homologous
regions, for example, vary from protein to protein. Our
method takes simpler and more feasible strategy.
In addition, our method discriminates distantly related
species in appearance due to genome shrinkage from that in
reality and consequently places parasitic and symbiotic
organisms in the appropriate taxonomic positions in a tree.
Although some genome-based methods (Korbel et al. 2002;
Kunin et al. 2005; Yang et al. 2005) achieve the correct positioning by normalizing genome sizes, our method achieve
it by taking a geometric mean over the evolutionary distances (see Materials and Methods for details).
However, no method is free from drawbacks, and our
method is no exception. The factor that may most seriously
influence not only our method but also other genome-based
methods in general is a massive amount of HGTs (Kunin
and Ouzounis 2003) supposed to have occurred at the early
stage of the cellular-organism evolution (Aravind et al.
1998; Brown 2003). There is now pretty strong evidence
that there is a considerable amount of HGTs in genomes.
1187
It has also been shown that many genes in eukaryotes originated from bacteria and mitochondria (Rivera and Lake
2004; Embley and Martin 2006). Recently, it has been
reported that even vitally important genes such as the
recA/RAD51 family genes were acquired via endosymbiotic gene transfer (Lin et al. 2006). Dutilh et al. (2004) tried
to correct the HGT effect in their gene-content method, and
showed that positioning of Halobacterium at the base of
Archaea was attributed to extensive levels of directed
HGT in this organism. As shown in the network of figure
5, our distance data computed from domain organization
repertoire harbor some conflicting signals, which mainly
came from HGT. We have to keep in mind, therefore, that
even a genome-based tree is just an approximation of evolutionary relationship of species and might not accurately
describe their actual evolutionary process. This is a major
issue in molecular phylogeny and is too often overlooked.
Whereas the overall topology of our tree is in good
agreement with the conventional one, there are a number
of inconsistencies between them as already described in
Results. Among them, the phylogenetic positions of hyperthermophilic bacteria are particularly noteworthy. Contrary
to the 16S rRNA tree (Woese 1987; Brown 2003) or the
trees constructed from protein sequences (Brown et al.
2001; Katoh et al. 2001), hyperthermophilic bacteria, A.
aeolicus (aaeo0) and T. maritima (tmar0), are not the earliest branching lineages in our tree. Especially, tmar0 has
traditionally been located at the deepest branch of eubacteria (Woese 1987; Brown 2003), whereas our tree shows that
it is closely related to Clostridia in the Firmicutes cluster
with 98% bootstrap value. On the other hand, aaeo0 is located next to Deltaproteobacteria, although its bootstrap
value is low. We can reject the attribution of HGTs in this
case: if HGTs had occurred in these hyperthermophilic bacteria at the early stage of the cellular-organism evolution,
they would have moved the phylogenetic positions of these
species to earlier ones. Only a hyperthermophilic bacterium, Thermus thermophilus HB27 (tthe0), is shown to
have diverged very early in the bacterial lineage. Similar
results on the phylogenetic positioning of hyperthermophilic bacteria are reported also in other genome trees
(Korbel et al. 2002; Wolf et al. 2002; Yang et al. 2005).
Regarding this issue, it is interesting that some conflicting
signals are seen between T. thermophilus HB27 (tthe0) and
the Archaea cluster in figure 5.
Our tree together with other genome trees, therefore,
may urge to reconsider the conventional view that thermophilic organisms are the earliest cellular life-forms. The
conventional view has been derived from gene trees constructed from nucleotide or amino acid sequences and could
be suffered from nucleotide or amino acid composition
biases characteristic to hyperthermophilic DNAs/proteins
(Hasegawa and Hashimoto 1993; Cambillau and Claverie
2000; Fukuchi and Nishikawa 2001; Nakashima et al.
2003). The biases would make hyperthermophilic DNAs/
proteins of Bacteria and Archaea resemble in sequences
and bring them together toward the root of their gene tree.
Our method, as already discussed, is completely free from
such biases. Other genome trees (Korbel et al. 2002;
Wolf et al. 2002; Yang et al. 2005) are also free from such
biases and show similar results to ours. Therefore, the
1188 Fukami-Kobayashi et al.
phylogenetic positions of hyperthermophiles in the present
study are expected to be more realistic than the traditional
ones. Of course, in order to draw the conclusive lineages,
we would need more genome data of hyperthermophiles
and other evidence from various biological viewpoints.
Supplementary Materials
Supplementary table S1 and figures S1–S4 are
available at Molecular Biology and Evolution online
(http://www.mbe.oxfordjournals.org/).
Acknowledgments
We thank Mr S. Sakamoto for his help and advice
on the use of GTOP. This work was supported in part
by Grant-in-Aid for Scientific Research on Priority Areas
(C) ‘‘Genome Biology’’ from the Ministry of Education,
Culture, Sports, Science and Technology of Japan.
Funding to pay the Open Access publication charges
for this article was provided by the Japan Society for the
Promotion of Science Grant #16255006.
Literature Cited
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller
W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids
Res. 25:3389–3402.
Apic G, Huber W, Teichmann SA. 2003. Multi-domain protein
families and domain pairs: comparison with known structures
and a random model of domain recombination. J Struct Funct
Genomics. 4:67–78.
Aravind L, Tatusov RL, Wolf YI, Walker DR, Koonin EV. 1998.
Evidence for massive gene exchange between archaeal and
bacterial hyperthermophiles. Trends Genet. 14:442–444.
Bapteste E, Boucher Y, Leigh J, Doolittle WF. 2004. Phylogenetic
reconstruction and lateral gene transfer. Trends Microbiol.
12:406–411.
Bateman A, Coin L, Durbin R, et al. (13 co-authors) 2004. The Pfam
protein families database. Nucleic Acids Res. 32:D138–D141.
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA,
Wheeler DL. 2000. GenBank. Nucleic Acids Res. 28:15–18.
Brown JR. 2003. Ancient horizontal gene transfer. Nat Rev Genet.
4:121–132.
Brown JR, Douady CJ, Italia MJ, Marshall WE, Stanhope MJ.
2001. Universal trees based on large combined protein sequence data sets. Nat Genet. 28:281–285.
Bryant D, Moulton V. 2004. Neighbor-net: an agglomerative
method for the construction of phylogenetic networks. Mol
Biol Evol. 21:255–265.
Caetano-Anolles G, Caetano-Anolles D. 2003. An evolutionarily
structured universe of protein architecture. Genome Res.
13:1563–1571.
Cambillau C, Claverie J-M. 2000. Structural and genomic correlates of hyperthermostability. J Biol Chem. 275:32383–32386.
Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B,
Bork P. 2006. Toward automatic reconstruction of a highly
resolved tree of life. Science. 311:1283–1287.
Clarke GD, Beiko RG, Ragan MA, Charlebois RL. 2002. Inferring
genome trees by using a filter to eliminate phylogenetically
discordant sequences and a distance matrix based on mean
normalized BLASTP scores. J Bacteriol. 184:2072–2080.
Daubin V, Gouy M, Perriere G. 2001. Bacterial molecular phylogeny using supertree approach. Genome Inform. 12:155–164.
Durbin R, Eddy S, Krogh A, Mitchison G. 1998. Biological sequence analysis: probabilistic models of proteins and nucleic
acids. Cambridge: Cambridge University Press.
Dutilh BE, Huynen MA, Bruno WJ, Snel B. 2004. The consistent
phylogenetic signal in genome trees revealed by reducing the
impact of noise. J Mol Evol. 58:527–539.
Eddy S. 2003. HMMER user’s guide version 2.3.2 [Internet] [cited
2007 March 12]. Available from: ftp://ftp.genetics.wustl.edu/
pub/eddy/hmmer/CURRENT/Userguide.pdf.
Embley TM, Martin W. 2006. Eukaryotic evolution, changes and
challenges. Nature. 440:623–630.
Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 39:783–791.
Fitz-Gibbon ST, House CH. 1999. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids
Res. 27:4218–4222.
Fukami-Kobayashi K, Tateno Y, Nishikawa K. 2003. Parallel evolution of ligand specificity between LacI/GalR family repressors and periplasmic sugar-binding proteins. Mol Biol Evol.
20:267–277.
Fukami-Kobayashi K, Tomoda S, Go M. 1993. Evolutionary clustering and functional similarity of RNA-binding proteins.
FEBS Lett. 335:289–293.
Fukuchi S, Nishikawa K. 2001. Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. J Mol Biol. 309:835–843.
Garrity GM, Bell JA, Lilburn TG. 2004. Taxonomic outline of the
prokaryotes release 5.0. Bergey’s manual of systematic bacteriology. 2nd ed. New York: Springer.
Gerstein M. 1998. Patterns of protein-fold usage in eight microbial
genomes: a comprehensive structural census. Proteins. 33:
518–534.
Goff SA, Ricke D, Lan T-H, et al. (55 co-authors). 2002. A draft
sequence of the rice genome (Oryza sativa L. ssp. japonica).
Science. 296:92–100.
Graur D, Li W-H. 2000. Fundamentals of molecular evolution.
Sunderland (MA): Sinauer Associates, Inc.
Gribaldo S, Philippe H. 2002. Ancient phylogenetic relationships.
Theor Popul Biol. 61:391–408.
Gupta RS. 1998. Protein phylogenetics and signature sequences:
a reappraisal of evolutionary relationship among archaebacteria, eubacteria, and eukaryotes. Microbiol Mol Biol Rev.
62:1435–1491.
Hansmann S, Martin W. 2000. Phylogeny of 33 ribosomal and six
other proteins encoded in an ancient gene cluster that is conserved across prokaryotic genomes: influence of excluding
poorly alignable sites from analysis. Int J Syst Evol Microbiol.
50(Pt 4):1655–1663.
Hasegawa M, Hashimoto T. 1993. Ribosomal RNA trees misleading? Nature. 361:23.
Katoh K, Kuma K, Miyata T. 2001. Genetic algorithm-based
maximum-likelihood analysis for molecular phylogeny.
J Mol Evol. 53:477–484.
Kawabata T, Fukuchi S, Homma K, Ota M, Araki J, Ito T,
Ichiyoshi N, Nishikawa K. 2002. GTOP: a database of protein
structures predicted from genome sequences. Nucleic Acids
Res. 30:294–298.
Korbel JO, Snel B, Huynen MA, Bork P. 2002. SHOT: a web
server for the construction of genome phylogenies. Trends
Genet. 18:158–162.
Kunin V, Ahren D, Goldovsky L, Janssen P, Ouzounis CA.
2005. Measuring genome conservation across taxa: divided strains and united kingdoms. Nucleic Acids Res. 33:
616–621.
Kunin V, Ouzounis CA. 2003. GeneTRACE-reconstruction of
gene content of ancestral species. Bioinformatics. 19:1412–
1416.
A Tree of Life Based on Domain Organizations
Lin J, Gerstein M. 2000. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing
genomes on different levels. Genome Res. 10:808–818.
Lin Z, Kong H, Nei M, Ma H. 2006. Origins and evolution of
the recA/RAD51 gene family: evidence for ancient gene
duplication and endosymbiotic gene transfer. Proc Natl Acad
Sci USA. 103:10328–10333.
Murzin AG, Brenner SE, Hubbard T, Chothia C. 1995. SCOP:
a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 247:
536–540.
Nakashima H, Fukuchi S, Nishikawa K. 2003. Compositional
changes in RNA, DNA and proteins for bacterial adaptation
to higher and lower temperatures. J Biochem. (Tokyo).
133:507–513.
Rivera MC, Lake JA. 2004. The ring of life provides evidence for
a genome fusion origin of eukaryotes. Nature. 431:152–155.
Rokas A, Williams BL, King N, Carroll SB. 2003. Genome-scale
approaches to resolving incongruence in molecular phylogenies. Nature. 425:798–804.
Saitou N, Nei M. 1987. The neighbor-joining method: a new
method for reconstructing phylogenetic trees. Mol Biol Evol.
4:406–425.
Snel B, Bork P, Huynen MA. 1999. Genome phylogeny based on
gene content. Nat Genet. 21:108–110.
Tateno Y, Nei M, Tajima F. 1982. Accuracy of estimated phylogenetic trees from molecular data. I. Distantly related species.
J Mol Evol. 18:387–404.
Tekaia F, Lazcano A, Dujon B. 1999. The genomic tree as
revealed from whole proteome comparisons. Genome Res.
9:550–557.
1189
The Arabidopsis Genome Initiative. 2000. Analysis of the genome
sequence of the flowering plant Arabidopsis thaliana. 408:
796–815.
Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL,
Schuler GD, Tatusova TA, Rapp BA. 2000. Database resources of the National Center for Biotechnology Information.
Nucleic Acids Res. 28:10–14.
Winstanley HF, Abeln S, Deane CM. 2005. How old is your fold?
Bioinformatics. 21:i449–i458.
Woese CR. 1987. Bacterial evolution. Microbiol Rev. 51:221–271.
Wolf YI, Brenner SE, Bash PA, Koonin EV. 1999. Distribution
of protein folds in the three superkingdoms of life. Genome
Res. 9:17–26.
Wolf YI, Rogozin IB, Grishin NV, Koonin EV. 2002. Genome
trees and the tree of life. Trends Genet. 18:472–479.
Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV. 2001.
Genome trees constructed using five different approaches
suggest new major bacterial clades. BMC Evol Biol. 1:8.
Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV. 2001. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context.
Genome Res. 11:356–372.
Yang S, Doolittle RF, Bourne PE. 2005. Phylogeny determined by
protein domain content. Proc Natl Acad Sci USA. 102:373–378.
Yu J, Wang J, Lin W, et al. (117 co-authors). 2005. The genomes
of Oryza sativa: a history of duplications. PLoS Biol. 3:e38.
William Martin, Associate Editor
Accepted February 19, 2007