Download Prediction of Gene Function Using Gene Clusters and Genomic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of neurodegenerative diseases wikipedia , lookup

Oncogenomics wikipedia , lookup

Point mutation wikipedia , lookup

Transposable element wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Protein moonlighting wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Human genome wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Metagenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genomic library wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene wikipedia , lookup

Minimal genome wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene desert wikipedia , lookup

Genomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome editing wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
dn
Thesis Project Report
Improving Gene Function Prediction
Using Gene Neighborhoods
Kwangmin Choi
Bioinformatics Program, School of Informatics,
Indiana University, Bloomington, Indiana
Contents
1. Introduction
2. Scope
3. Materials & Methods
4. Results
5. Discussion
6. References
1. INTRODUCTION
Operon is a group of adjacent, co-expressed and co-regulated genes that encode
functionally linked proteins in prokaryote genomes. It has been reported that many operons
particularly those that encode subunits of multiprotein complexes are conserved in
phylogenetically-distant prokaryotic genomes or even between archaea and bacteria. This is
due, in part, to the conservation of operons over long stretches of evolutionary time since
the last common ancestor, and, in part, to horizontal transfer of operon components among
prokaryotes (Selfish-operon hypothesis. Ref.1 and Ref.2). The horizontal transfer of whole
components over transfer of individual genes is much more favored by natural selection,
because co-expression and co-regulation can be preserved in the former case.
Operon tends to undergo multiple rearrangements during evolution. As a result, gene
order at a level above is poorly conserved and genome comparison diagonal plots appear
completely disordered even between phylogenetically very close species (Ref.4). Thus, it
was very early noticed that conservation of gene neighborhood has biologically important
meanings (Ref.5 and Ref.6).
Computational algorithms to locate operons have been developed previously,
primarily for Escherichia coli (Ref.8 and Ref.9). Earlier methods were based on (1) finding
signals that occur on the boundaries of operons. In this method, promoters on the 5’-end
and terminators on the 3’-end were searched. But such approaches can only be useful when
transcription signals are completely known. However, even in E.coli, sequence motifs of
promoters and terminators are not completely characterized. (2) Another methods uses a
combination of gene expression data, functional annotation and other experimental data,
which is primarily applicable to well studied genomes such as E.coli. (3) Finally, some
methods rely on intergenic distance between adjacent genes and gene annotation. Such
methods are based on finding gene clusters where gene order and orientation is conserved
in 2 or more genomes (Ref.8, Ref.9, Ref.10, Ref.11, Ref.12).
In this study, a new method based on the third approach was investigated to
reconstruct gene neighborhoods by using sequence homologs among 2 or more whole
bacterial genomes. The idea underlying this study is that different genomes contain
different, overlapping parts of evolutionarily and functionally connected gene
neighborhoods and the entire neighborhoods can be reconstructed by generating a tiling
path through these overlaps. The method to safely predict cellular functions of
uncharacterized gene products was also investigated.
2. SCOPE
As a part of PLATCOM (A Platform for Computational Comparative Genomics)
project led by Dr. Sun Kim (Bioinformatics, Indiana University, Bloomington), the first
goal of this project was providing public users a web interface to access our protein-protein
and genome-genome pairwise comparison database in BIOKDD server and building an
integrated genome analysis system for the comparative analysis of multiple genomes.
Based on this web-based platform, several multiple genome analysis tools were
written in Perl and embedded into the PLATCOM system. The complete gene
neighborhood was reconstructed from gene clustering data and analyzed this genomic
context to predict operon or operon-like gene clusters. The web interface of PLATCOM
system is located at http://biokdd.informatics.indiana.edu/kwchoi/platcom/
3. MATERIALS & METHODS
3.1. PLATCOM
PLATCOM system consists of 3 components: (1) Databases of biological entities (e.g.
DNA sequence file (fna), protein sequence file (faa), integrated genomic information file
(ptt, gbk)), (2) Databases of relationships among entities (e.g. genome-genome, proteinprotein pairwise comparison file) and (3) Mining tools over these databases.
Fig 1. PLATCOM interface
3.2. Genomes
On PLATCOM webpage, users can choose multiple genomes for the comparative
genomics study. Totally 23 genomes (15 phylogenetic groups) were chosen for this study
and they are listed in Table 1.
1
2
3
4
5
Bacteria; Proteobacteria; beta subdivision; Neisseriaceae; Neisseria
AL157959
Neisseria meningitidis serogroup A strain Z2491 complete genome
Bacteria; Proteobacteria; gamma subdivision; Pseudomonas group; Pseudomonas
AE004091
Pseudomonas aeruginosa PA01, complete genome
BA000003
Buchnera sp. APS complete genome
Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia
L42023
Haemophilus influenzae Rd complete genome
U00096
Escherichia coli K-12 MG1655 complete genome
AL590842
Yersinia pestis CO92 complete genome
Bacteria; Proteobacteria; gamma subdivision; Vibrionaceae; Vibrio
AE003852
Vibrio cholerae chromosome I, complete chromosome
AE003853
Vibrio cholerae chromosome II, complete chromosome
Bacteria; Proteobacteria; epsilon subdivision; Campylobacter group; Campylobacter
AL111168
Campylobacter jejuni complete genome
6
7
8
9
10
11
12
13
14
15
Bacteria; Proteobacteria; epsilon subdivision; Helicobacter group; Helicobacter
AE000511
Helicobacter pylori 26695 complete genome
Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Salmonella
AL513382
Salmonella typhi strain CT18, complete chromosome
AE006468
Salmonella typhimurium LT2, complete genome
Bacteria; Firmicutes; Bacillus/Clostridium group; Mollicutes; Mycoplasmataceae; Mycoplasma
L43967
Mycoplasma genitalium G37 complete genome
U00089
Mycoplasma pneumoniae M129, complete genome
Bacteria; Firmicutes; Actinobacteria; Actinobacteridae; Actinomycetales; Corynebacterineae;
Mycobacteriaceae; Mycobacterium; Mycobacterium tuberculosis complex
AL123456
Mycobacterium tuberculosis complete genome
Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillus/Staphylococcus group; Bacillus
BA000004
Bacillus halodurans C-125, complete genome
AL009126
Bacillus subtilis complete genome
Bacteria; Firmicutes; Bacillus/Clostridium group; Streptococcaceae; Streptococcus
AE005672
Streptococcus pneumoniae TIGR complete genome
AE004092
Streptococcus pyogenes strain SF370 serotype M1, complete genome
Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillus/Staphylococcus group; Staphylococcus
BA000017
Staphylococcus aureus strain Mu50, complete genome
Bacteria; Thermus/Deinococcus group; Deinococcales; Deinococcus
AE000513
Deinococcus radiodurans R1 complete chromosome 1
Bacteria; Chlamydiales; Chlamydiaceae; Chlamydia
AE002161
Chlamydophila pneumoniae AR39, complete genome
Archaea; Euryarchaeota; Methanococcales; Methanococcaceae; Methanococcus
L77117
Methanococcus jannaschii complete genome
Table 1. A list of genomes used in this study.
http://www.infobiogen.fr/services/deambulum/english/genomes2a.html
3.3. All pairwise protein comparison data
The database of all pairwise protein comparisons was built using FASTA and
embedded into PLATCOM system. All pairwise genome-genome (DNA) sequence
comparison database was not used for this study.
3.4. Ptt files
Ptt files of each genome were used to get information about the location of Open
Reading Frames (ORF), gene names, Clusters of Orthologous Groups of proteins (COG)
number, the function of gene products and others. Ptt files contain both experimental and
computated /hypothetical data.
If a gene product is not yet identified, ptt files do not provide COG and PID numbers
of those proteins, instead suggest hypothetical function(s) of putative gene product if there
are well-characterized sequence homologs of them.
3.4. Genome comparison diagonal plot generating tool
A Perl/CGI program to generate genome comparison diagonal plot was embedded into
PLATCOM system. This tool was used to help users understand gene order conservation
between 2 chosen genomes. The tool can generate 2 different kinds of diagonal plots: one is
from protein-protein pairwise comparison data and the other is from whole genomegenome pairwise comparison data. This is a simple tool, but provides a strong intuition to
understand the genome structure.
3.3. Gene clustering data
Another program was written in Perl to generate gene clustering data for multiple
genomes. A user can select as many genomes as he wants on PLATCOM interface and also
choose the length of protein and other cutoff values (Z-score and E-value).
Perl’s graph package was used to perform clustering . Each gene pair was used as an
input (vertices) to the graph for clustering. Output is a list of the groups of genes that were
clustered according to the sequence homology. If two given proteins are longer than a given
length cutoff and Z-score of 2 proteins are greater than a given Z-score cutoff and E-value
of them are smaller than a given E-value cutoff, 2 proteins are considered in the same
cluster. For this study, Z-score was greater than 750 and E-value was less than e-20. The
miniumum length of protein product was 50. The clustering result is sent the result to the
user by e-mail.
For reality check, COG database (http://www.ncbi.nlm.nih.gov/COG/ ) was used to
check if each clusters contained only orthologs. This algorithm will be replaced with Dr.
Kim’s own clustering algorithm because of relatively slow performance.
3.4. Reconstructing gene clusters/ gene neighborhood/ genomic context
Another webtool to reconstruct gene clusters/gene neighborhood was written in Perl
and embedded into PLATCOM. When a clustering data file, a gene and its genome ID (GI
number) and intergenic distance are given by users, this tool automatically searches all
other genes in the same clustering groups by referring the gene clustering data and finds all
possible gene clusters/neighborhoods that contain the gene and its homologs genes.
Intergenic distance and directionality of DNA strand (+/-) are referred to connect
adjacent 2 genes. A default value of intergenic distance is < 300 base pairs (Ref.11). If
intergenic distance is greater than a given value, 2 adjacent genes were not connected and
considered that they belonged to different gene clusters. A user can choose all or some of
gene clusters to generate multiple alignments by ClustalW. Fig 2 shows one of the
reconstruction results.
Fig 2. Gene neighborhood reconstruction result
3.5. Prediction of gene function using genomic context
Different genomes contain different, overlapping parts of evolutionarily and
functionally connected gene neighborhoods and the entire gene neighborhood can be
reconstructed by generating a “Tiling Path” (Ref.12).
Although over 150 genomes have been completely sequenced until today, the
biological functions of many genes are still uncharacterized and many genomic data files
do not provide COG numbers, PID numbers or other biologically meaningful information
of such genes. In this case, it is impossible to predict gene function by referring to these
files. Instead, we need to use genomic context of well-known genomes, such as E.coli. That
is to say, we can start with observing gene neighborhoods in these well-known genomes
and then compare them with gene clusters/neighborhoods in the target genomes to analyze
genomic context of target genomes. Genomic context means the pattern of “runs” of COG
number. With this strategy, cellular functions of unknown genes and genomic “hitchhikers”
could be founnd. “Hitchhikers” are inserted genes that are originated from different
contexts/themes.
Fig 3. Tiling Path Method. (Cited from Ref.12)
For reality check, BioCyc database (http://biocyc.org:1555/ECOLI/class-subsinstances?object=Transcription-Units) and TIGR site (http://www.tigr.org/tigrscripts/operons/operons.cgi) were referred to compare the reconstructed gene cluters with
experimentally-proven or computationally-predicted transcription units in this database.,
but, nfortunately, only Echerichia coli has full features of transcription units information at
this time.
4. RESULTS
4.1. In general, the gene order is not conserved, even between
phylogenetically-close genomes.
Several genome comparison diagonal plots were generated by an embedded webtool
(see Materials & Methods) to investigate the relationship between phylogenetic distance
and the conservation of gene order. This tool finds homologs between given 2 genomes and
plots them on 2D space (Z-score > 750).
A basic idea underlying this research is that if gene order of two genomes is well
conserved, the sequence of homologs should appear as a line on the genome comparison
diagonal plot. It is obvious that gene order between phylogenetically-distant species are
poorly conserved (Fig 4).
Fig 4. Phylogenetically-distant genomes: (a) Eschericia coli K-12 vs Sulfolobus solfataricus (b)
Campylobacter jejuni vs E.coli K-12 (3) Agrobacterium tumefaciens C58 vs E.coli K-12
But one of most striking finding (Fig 5) is that gene order is not conserved, even
between phylogenetically-close genomes (e.g. E.coli vs H.Influenza and E.coli vs
V.cholerae).
Fig 5. Phylogenetically-close genomes: (a) E.coli K-12 vs Haemophilus influenza
(b) E.coli K-12 vs Vibrio cholerae (c) E.coli vs Salmonella enterica Typhi
Instead, short runs of genes were found when some areas were zoomed in (Fig 6).
They can be considered as gene clusters if intergenic distance is shorter than a given length
(e.g. 300). The possibility that some of these short runs were conserved in many other
genomes was investigated in the next part.
Fig 6. Fragmented gene clusters : E.coli K-12 vs V.cholerae
4.2. Rbs operon (Z-score > 750, Intergenic Distance < 300)
Rbs operon of E.coli consists of a set of genes involved in ribose transport across cell
membrane. The cellular function of each gene is described in Table 2.
COG
COG1869
COG1129
COG1172
COG1879
COG0524
COG1609
Cellular Functions
D-ribose high-affinity transport system; membrane-associated protein
ATP-binding component of D-ribose high-affinity transport system
D-ribose high-affinity transport system
D-ribose periplasmic binding protein
Ribokinase
Regulator for rbs operon
Table 2. Cellular functions of Rbs operon components: E.coli
Fig 7. Reconstructed gene neighborhood of Rbs operon
From the clustering data of 23 genomes, their gene neighborhoods were reconstructed
by generating tiling path (see Materials & Methods). 9 out of 23 genomes contain Rbs
operon and their gene order pattern seems conserved pretty well (Fig 7).
AE003852 (V.cholerae chromosome I) was ignored because another chromosome
(AE003853) of V.cholerae has a longer gene cluster containing all 8 components. 7 out of
these 9 genomes have a full set of 8 components, but 1 component (COG1879) was missing
in AE004091, 3 (COG1172, COG0524 and COG1609) were missing in AL590842. A short
runs of genes, [COG1869-COG1129-COG1172-COG1879-COG0524-COG1609], can be
considered as a conserved genomic context (general scheme) of Rbs operon.
The ptt files of AL513382 and AE006468 do not provide COG number of these 8
operon components, but their gene neighborhood could be reconstructed by generating
tiling path based on gene clustering data. Cellular functions of operonn components of
these 2 genomes could be reasonably guessed when their genomic contexts were carefully
examined.
4.3. Functional Coupling (Z-score > 750, Intergenic Distance < 300)
The next example shows more complex genomic context with 2 functional themes
(transcription control and translation control) and putative hitchhikers.
In bacterial genomes, transcription, translation and RNA modification/degradation
coupled and the advantages of co-regulation the corresponding genes are obvious. Table 3
shows the cellular functions of genes within 22 gene clusters searched.
Transcription
COG0779 Uncharacterized Conserved Protein
COG0195 Transcription elongation factor
COG2740 Predicted nucleic-acid-binding protein
(transcription termination?)
Translation
COG1358 Ribosomal protein S17E
COG0532 Translation initiation factor 2 (GTPase)
COG1550 Uncharacterized Conserved Protein
COG0858 Ribosome-binding factor A
COG0184 Ribosomal protein S15P/S13E
COG0130 tRNA Pseudouridine synthase
Hitchhiker ?
COG0196 FDA Synthase (Hitchhiker?)
Table 3. Cellular functions of operon components : B.subtilis
Fig 8 shows the reconsructed gene neighborhoods of 22 genomes. 6 components,
[COG0779-COG0195-COG0532-COG0858-COG0130-COG0184], were common among
22 genomes, thus this pattern can be considered as the general scheme of these gene
neighborhoods.
In 5 genomes (AL123456, BA000004, AL009126, AE005672, AE000513), the
context of [COG0779-COG0195-COG2740] was found. Although COG0779 is an
uncharacterized protein, it it possible COG0779 is a functional partner of COG0195
because it is always an adjacent neighbor of COG0195.
Fig 8. Reconstructed gene neighborhood
The context of translation control seems to be [COG0532-COG0858-COG0130COG0184]. It is reasonable to consider that COG0196 be genomic hitchhikers because it
belong to different functional theme (energy metabolism, not translation control). Another
uncharacterized protein, COG1550, is probably a hitchhiker, but it is still possible it belong
to the theme of translation control.
AE004092 has 3 uncharacterized gene products. They are respectively sequence
homologs of COG0779, COG2740 and COG1358. Sometimes it is not enough to say 2
proteins have the same cellular function because they are homologs. Now genomic context
can be used to confirm this guess. Gene neighborhood of AE004092 is [? (Yellow)COG0195-? (Sky blue)-? (Purple)-COG0532.....] and this scheme is exactly corresponding
to general scheme of other 21 genomes, [COG0779-COG0195-COG2740-COG1358-
COG0532...]. So it is much reasonable to say the cellular functions of these 3
uncharacterized proteins are COG0779, COG2740 and COG1358.
5. DISCUSSION
Identifying operon structure is one of the most important issues in prokaryotic
genomics, because operon structure is the most prominent characteristic of bacterial
genomes. According to a classical concept of operon, operon is defined as a set of adjacent
genes that are regulated by one promoter. Based on this concept, there have been enormous
efforts to identify operon structure by detecting promoter signals, but most of these trails
turned out to be not successful, because it is very hard to identify the location of promoters.
Although many promoters are found just upstream of the firtst component of operon, some
other promoters are found much far away from operon itself – either upstream or
downstream.
So many other approaches to identify operon structure have been introduced in recent
years (Ref 3). These methods focus on a set of adjacent genes on the same strand
(directron) and the location of promoters is not considered (Fig 8). Generally these adjacent
genes form a gene cluster, of which intergenic distance is shorter than 200-300 bps and
there have been many interesting reports that components of gene clusters belong to the
same or related functional theme. This is a big turn of the way to find operon structure.
When you identify one unknown gene cluster or operon, the prediction of unknown
operons is a difficult and error-prone procedure that has never been defined in algorithmic
terms. However detecting adjacent or close genes and reconstructing their connections are
much easier way if orthologous relationships have been identified correctly. Of course ,
many of these connections are functionally irrelevant, it is possible to pick up their hidden
biological meanings and to help but detailed analysis may help predicting new functional
connections (Ref.3).
Fig 9. Gene clusters and close homolog. PCBBH : Pair of Close Bidirectional Best Hits,
BBH: Bidirectional Best Hits, PCH: Pair of Close Homologs,
COG: Clusters of Orthologous Genes. Cited from Ref. 11
A systematic comparison of 23 prokaryotic genomes reveals a low level of gene-order
(and operon architecture) conservation. Nevertheless, a number of short and fragmented
gene clusters are found to be conserved. Only several operons, primarily those that code for
physically interacting proteins, are conserved in all or most of the bacterial and archaeal
genomes.
Recently the term “Über-operon” has emerged. Über-operon is defined as a set of
genes with close functional and regulatory contexts that tends to be conserved despite
numerous rearrangements. This concept focus on the functional themes of operons, not a
specific genes or gene order, but many evidences have been also published that the gene
order of some gene clusters are conserved during evolution and speciation (Ref 6 and Ref
7).
Fo a long time, biologists have focused on sequence homology of 2 or more proteins
to understand evolutionary distance and are relatively not interested in gene orger, gene
neighborhoods and genomic context, because, in part, gene order in prokaryotes is
conserved to a much lesser extent than protein sequences But many recently suggested
computational approaches in comparative genomics go beyond a simple sequence
comparison. Now a new possibiliy is open to mine the biological meaning of genomic
context (Ref.9 and Ref.10).
Here I suggested one improved method to predict cellular functions of uncharacterized
gene products. This methodology is basically based on the elegant idea of Koonin group at
NCBI (Ref.12). Koonin’s group reconstructed gene neighborhoods by referring to COG
numbers of gene products and their methodology looks working finely in most cases. But
this method has one critical problem, because, in some cases, genome data files (gbk or
ptt) do not provide COG number at all. To compensate this defect, I also used gene
clustering data based on sequence homology of all proteins of all chosen genomes. Because
we have gene clustering result, we can reconstruct a gene neighborhood even if there is no
COG number provided and also the putative functions of uncharacterized protein could be
guessed using referring to genomic context. The proteins encoded by conserved gene pairs
appear to interact physically or belong to the same functional theme. This observation can
therefore be used to predict functions of, and interactions between, prokaryotic gene
products.
It is clear that operons are physically gene clusters, but gene clusters are not always
operons. To say a specific gene cluster is a functional operon, we need to examine if all
components in the gene cluster are expressed together by the same promoter. If one
promoter that control the expression of the gene cluster is not identified yet, this gene
cluster should be considered as just an “operon candidate”. Unfortunately, our knowledge
on genomic context is very limited at this time, even though over 150 genomes have been
completely sequenced. Echerichia coli is the only prokaryotic species, of which whole
transcription units are reported.
In this study, I picked up several gene clusters in E.coli as a framework to search
corresponding conserved gene clusters in the rest 23 genomes. In many cases, it worked
pretty well, but in some cases, E.coli’s gene cluster is not the longest one (Fig 7 and Fig 8).
This can make reconstructing gene neighborhood a time-consuming job.
5. ACKNOWLEDGMENTS
I thank Haifeng Zhao for his work on our pairwise comparison database, Scott Martin
for his technical support and Dr. Sun Kim for his advising and helpful suggestions.
6. REFERENCES
1. Lawrence,J.G. and Roth,J.R. (1996) Selfish operons: horizontal transfer may drive
the evolution of gene clusters. Genetics, 143, 1843–1860.
2. Lawrence,J. (1999) Selfish operons: the evolutionary impact of gene clustering in
prokaryotes and eukaryotes. Curr. Opin. Genet. Dev., 9, 642–648.
3. Galperin,M.Y. and Koonin,E.V. (2000) Who’s your neighbor? New computational
approaches for functional genomics. Nat. Biotechnol., 18, 609–613
4. Mushegian,A.R. and Koonin,E.V. (1996) Gene order is not conserved in bacterial
evolution. Trends Genet., 12, 289–290.
5. Dandekar,T., Snel,B., Huynen,M. and Bork,P. (1998) Conservation of gene order: a
fingerprint of proteins that physically interact. Trends Biochem. Sci., 23, 324–328.
6. Lathe,W.C.,III, Snel,B. and Bork,P. (2000) Gene context conservation of a higher
order than operons. Trends Biochem. Sci., 25, 474–479
7. Huynen,M., Snel,B., Lathe,W.,III and Bork,P. (2000) Predicting protein function by
genomic context: quantitative evaluation and qualitative inferences. Genome Res.,
10, 1204–1210
8. Salgado,H., Moreno-Hagelsieb,G., Smith,T.F. and Collado-Vides,J. (2000) Operons
in Escherichia coli: genomic analyses and predictions. Proc. Natl Acad. Sci. USA,
97, 6652–6657.
9. Ermolaeva,M.D., White,O. and Salzberg,S.L. (2001) Prediction of operons in
microbial genomes. Nucleic Acids Res., 29, 1216–1221.
10. Wolf,Y.I., Rogozin,I.B., Kondrashov,A.S. and Koonin,E.V. (2001) Genome
alignment, evolution of prokaryotic genome organization and prediction of gene
function using genomic context. Genome Res., 11, 356–372.
11. Overbeek,R., Fonstein,M., D’Souza,M., Pusch,G.D. and Maltsev,N. (1999) The use
of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA, 96, 2896–
2901
12. Igor B. Rogozin, Kira S. Makarova, Janos Murvai, Eva Czabarka, Yuri I. Wolf,
Roman L. Tatusov, Laszlo A. Szekely and Eugene V. Koonin. (2002) Connected
gene neighborhoods in prokaryotic genomes. Nucleic Acids Res., 30, 2212-2223
13. Daubin, V. and Gouy, M. (2001) Bacterial molecular phylogeny using supertree
approach. Genome Informatics Genome Research, Vol 12, Issue 7, 1080-1090