Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HCG: a database for hierarchical classification of functionally equivalent genes in prokaryotes Fenglou Mao*, Hongwei Wu*, Victor Olman, Ying Xu1 Computational Systems Biology Laboratory Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics University of Georgia, Athens, GA 30602, USA *These authors contributed equally to this paper 1 Correspondence author Abstract Background: The existing gene annotation schemes generally classify genes into twolevels of parallel and unrelated homologous and/or orthologous gene groups, limiting our capabilities for gene function prediction at higher resolution. While homology and orthology are useful concepts for evolutionary studies of genes, they may not be the most appropriate ones for functional classification of genes, especially at a high-resolution level. Results: We present a new gene annotation database: the hierarchical classification system of genes (HCG), which provides functional annotation of prokaryotic genes in general at higher resolution than the existing functional classification schemes. The HCG database consists of clusters, hierarchically organized, of functionally equivalent genes at varying levels of resolution. Gene clusters at the top of the HCG hierarchy representing homologous gene groups and descendent gene clusters representing functionally equivalent genes at an increasingly higher resolution going down from the top to the leaflevel clusters along the classification hierarchy. We also provide several examples to demonstrate how HCG can be used to make specific gene function annotation. For each HCG cluster, we provide a p-value assessing the statistical significance in grouping its genes together, based on the functional relationship among its genes and their relationship with genes outside of the cluster. Conclusion: The HCG database, implemented using MySQL, currently consists of 658,174 genes, 51,205 clusters organized into 21,109 trees, from 224 prokaryotic genomes. The on-line database supports four search capabilities, namely (1) browsing HCG classification by trees, (2) browsing HCG classification by organisms, (3) querying 1 genes against the HCG database to find its gene cluster at the highest resolution possible and its parent clusters if any, and (4) annotating sequences provided by a user. 1. Background With the rapid accumulation of genome sequences along with their genes accurately predicted, numerous efforts have been devoted to the computer-aided functional annotation of genes, which have led to the development of a number of functional classification schemes and associated databases such as Clusters of Orthologous Groups (COG) [1], Pfam [2], and InterPro [3]. There are also other databases that integrate gene annotation information with pathway information, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) [4], BioCyc[5] and the subsystem annotation environment SEED [6]. While these and other functional classification schemes and databases provide highly useful information for functional annotation of genomes, they are generally limited to classification of genes into homologous and/or orthologous gene groups, although homology and orthology are originally defined from evolution and don’t indicate gene function relationship. The classification result of such schemes is generally represented as a collection of parallel and unrelated functionally “equivalent” gene groups, providing a two-level classification of functionally equivalent genes. We believe that the functional relationship between genes can be better represented using a hierarchical system, which is confirmed by recent development of Gene Ontology (GO) [7], which employs a DAG (Directed Acyclic Graph) structure, more general than a hierarchical structure. Generally gene function classifications can be grouped into two classes: two-level classification such as COG, KEGG orthologs and Pfam and multi-level classification such as GOA and our classification scheme HCG. The Gene Ontology Annotation (GOA) Database [8] is the only database that employs multi-level classification of for gene functions up until now. GOA annotates genes using GO terms so it stands on a solid ground for function classification. However most annotations in GOA are extracted from UniProt and InterPro by using three scripts (ec2go, skpw2go and InterPro2go), and others are annotated manually with the help of annotation tools such as GOAnnotator, thus it is hard to evaluate the annotation quality. There are other genome databases with gene annotation information, such as the integrated microbial genomes (IMG) system [9] and Integr8 [10]. While useful, the gene 2 annotation in IMG is created through using rather simple methods, namely RPS-BLAST (reverse position specific BLAST) and bidirectional best hits, which is widely thought to be inaccurate [11], have low sensitivity [12] and yield high false positive rates [13], and it also adopts the two level of classification strategies such as Pfam and COG. Integr8 also used the annotation from other database such as InterPro and Pfam. We have developed a functional classification scheme for prokaryotic genes, based on both sequence similarity information and genomic neighborhood information [14]. A key unique feature of this classification scheme is that it classifies genes into functionally equivalent clusters at multiple resolution levels, and these clusters are either parallel-to each other or inside-of one another, hence giving rise to a multi-level hierarchical structure, under which genes could have “equivalent” functions measured at varying resolution. For example, genes in any root-level cluster, in this functional hierarchy, are functionally equivalent in the sense that they are homologous, and genes in any lower-level cluster represent a group of functionally equivalent genes with higher specificity (or higher resolution). The functional equivalence relationships among genes at different resolution are derived based on a two-level classification scheme [14]. The algorithm first derives the functional relationships among individual gene pairs based on their sequence similarity and their co-location information in genomes, and then derives the functional relationships among a group of genes by detecting the groups of genes with high densities of pair-wise functional relationships within each group versus the (relatively) lower densities of relationships between each gene group and genes outside of the group. For each predicted gene cluster (group), we also provide a p-value to measure how standout the cluster is in the background where these genes sit. In some sense, this value also reflects the consistency of annotation of gene groups, or called annotation quality. By applying this classification scheme to genes of 224 prokaryotic genomes, we have established a database, HCG, of functionally equivalent gene clusters. Intuitively, the HCG system can be viewed as a “forest” of trees, where each tree consists of a rootlevel cluster and its descendent clusters, possibly at different levels. For each cluster in the HCG system, we have provided an annotation to characterize the common biological function of the cluster, based on the Gene Ontology (GO) annotation (GOA Proteome 3 Sets) and NCBI gene-product description. Other information such as Pfam and COG annotation is also provided for cross-reference purposes. 2. Construction and Content 2.1 The Construction of the Database The HCG database currently consists of the classification result from 224 complete prokaryotic genomes (released of NCBI, 03/05/2005). While the detailed description of the clustering algorithm and an analysis of the data has been published elsewhere [14], we here outline the procedure for database construction and application. The HCG system has been created using the following steps: (a) All homologous gene pairs are identified using reciprocal BLASTP [15] with evalues < 1 for both directions of the search against all the 658,174 genes. (b) The Smith-Waterman algorithm [16] is performed on all homologous gene pairs selected from (a) to obtain a multi-value feature vector for each homologous gene pair, representing the quality of their sequence alignment. (c) A positive training set consisting of orthologous gene pairs as well as a negative training set consisting of homologous but non-orthologous gene pairs is created for the purpose of training a classifier (see [14] for details) . (d) A parameterized linear classification function is employed to discriminate orthologous genes from homologous but non-orthologous genes, whose parameters are selected so that the classification function optimally discriminates the positive from the negative training data. (e) A scoring scheme is developed to measure the functional equivalence between two genes based on the sequence similarity information derived from (d) and genomic neighborhood information derived based on three operon prediction programs, namely (i) VIMSS [17], (ii) JPOP [18, 19], and (iii) GeneChords [20]. (f) A graph representation is constructed to represent all the 658,174 genes from 224 prokaryotic genomes and their functional equivalence relationship defined in (e). 4 (g) A graph-partition algorithm is applied to the representing graph of these genes and their functional relationships to generate a collection of dense sub-graphs (and sub-sub-graphs, etc), each of which represents a gene cluster. These gene clusters form a hierarchical structure. For each cluster, a p-value is calculated to assess its statistical significance. (h) Each gene cluster is annotated using a set of keywords and GO terms, based on common features of the NCBI and GO annotations [10] of individual genes of the cluster, where the keywords are extracted from the NCBI description of each gene product, and the GO terms for each cluster are selected based on a majority-rule vote among GO assignments to individual genes in the cluster. (i) All gene-classification data is integrated into a MySQL database; and a web server is created at http://csbl.bmb.uga.edu/HCG to facilitate searching and accessing the database. The validity of the predicted gene clusters are checked through comparing the HCG classification against the genome taxonomy, COG classification [1] and Pfam classification [2] of genes. The detailed validation procedure and results are given in [14]. 2.2 Database Tables To store the tree structure of the HCG system in a MySQL relational database, we have designed two tables, Node and Edge shown in Figure 1, to represent the HCG clusters and the parent-child relationship. Other information such as gene attributes, cluster annotation, and the p-values of each cluster are also stored in the MySQL tables. Figure 1 shows the relationship among the tables. The table “Gene” is used to store the information of individual genes, such as gene attributes. The tables “GO”, “Node_GO” and “Gene_GO” are used to store GO terms, GO annotation for individual genes and GO term-based annotation for individual clusters, respectively. The table “Gene_Node” is used to store the genes in each cluster, and the table “Species” is used to store species information of a genome. There are several additional internal tables that are not described in Figure 1 and are omitted for further discussion. 2.3 Information Available at HCG 5 HCG stores and facilitates accessing the basic information about each gene in its database, including a gene’s position in a genome, PID, locus tag, chain ID, COG number, gene product description, gene name, sequence, etc, all extracted from the NCBI database. In addition, we have run COGNITOR [21] to generate the COG numbers for all genes, including both functionally assigned and unassigned by the NCBI database. So for the vast majority of the genes in HCG, we have COG numbers. We have also integrated the GO annotations and Pfam accession ID into the HCG database in a similar fashion. In addition to the information extracted from other data sources, HCG has a large quantity of its own data. At the highest level, HCG is a forest of trees, each being a collection of gene clusters that are either parallel-to or part-of each other. At the toplevel of each tree is a cluster containing all genes in the tree, which are homologous to each other. Each lower-level cluster consists of genes that are functionally more equivalent than the genes in the parent cluster. For each cluster, we have calculated a pvalue to estimate the statistical significance of having the genes in this cluster forming an outstanding cluster in the background of other genes [14]. For each gene cluster, we assign its functional annotation using two methods. First, we assign GO terms to each cluster based on a majority-rule vote using the GO annotations of individual genes in the cluster [14]. For each HCG cluster, some individual genes have been annotated by GOA, one or more consensus GO terms are generated and the consensus GO terms are used to annotate the cluster. A probability value is calculated for each of the consensus GO terms, which can be used to assess the reliability of each function assignment – the higher the probability, the higher the prediction reliability. We have also assigned text descriptions to each gene cluster, which are derived from the NCBI gene product descriptions of individual genes, and used to describe the overall function of the cluster. For each cluster, we calculate a consistency score between 0 and 1, measuring the consistency among the NCBI descriptions for the individual genes of the cluster, with 1 representing the most consistent and 0 representing the least consistent. A detailed description of the algorithm is given in [14]. A user can use both the cluster GO annotation and the text description to infer the function of genes assigned to each cluster. 6 2.4 HCG Data Statistics The HCG database consists of 658,174 genes from 224 genomes, including 376 DNA chains (both chromosomes and plasmids) from NCBI (release of 03/05/2005). Among the 658,174 genes, 609,887 genes are assigned with HCG codes. 139,495 genes have COG numbers extracted from the NCBI database, and 459,955 genes are assigned with COG numbers by running COGNITOR [21]. When comparing the COGNITOR-calculated COG numbers with the NCBI-assigned COG numbers, we have noticed that only 108,620 genes have the same COG numbers, and other 30,875 genes have different COG numbers. This inconsistency most likely comes from the multiple COG numbers returned by COGNITOR. 318,326 genes have been assigned with GO terms in [10]. HCG has 51,205 clusters of genes (they are numbered consecutively in an arbitrary manner so are the sub-clusters and sub-sub-clusters, etc), organized into 21,109 HCG trees. Among these trees, 2,092 trees have more than 50 genes, totaling 518,703 genes. 10,716 trees are annotated with text descriptions, covering 568,717 genes. 4,877 trees are annotated with cluster GO terms, covering 500,996 genes. 4,330 trees have both cluster GO terms and text descriptions, covering 497,350 genes. 182,670 genes that are not annotated in Integr8 [10] are successfully annotated by HCG; and for those genes that are annotated by both HCG and Integr8, most of them are annotated with more specific GO terms in HCG than in Integr8. By combining both the text description and cluster GO annotation, a clear function description of each gene can be inferred. The HCG database is implemented using MySQL 4.0.18, running on a SuSE 9.0 linux computer with 4GB memory and two 2.8GHz XEON processors. A web interface, which is hosted by an Apache 2.0.40 web server, is developed to facilitate access to the database through the Internet. PHP server-side script language is used to create dynamic web pages. The response time for browsing most pages of the HCG database server is less than one second, while the response time of the “query” page depends on the complexity of the query, which is typically within a couple of seconds. 3. Utility and Discussion 3.1 Web Access 7 The HCG database can be accessed at http://csbl.bmb.uga.edu/HCG. A user can retrieve data using one of the following four methods. The first one is to browse HCG in a hierarchical way. The user can start from the virtual root of the “forest” to list all the trees. From this list, the user can select a tree that he/she may want to browse, and then go to its off-springs. The second method is to browse the gene annotation for each species. The user can select a specific species and a chain, and browse the HCG annotation page by page. The third method is to search the HCG database for genes using keywords selected from a pre-prepared list of fields. The user can specify the value of any gene attribute, such as the words in the product description, the HCG number of the genes, or a species name, etc. The user can also create a combination of these conditions by using “AND” and “OR”. In the fourth method, the user can submit his/her own protein sequence to the server to find the related HCG ids, and then annotate the sequence using the GO numbers, text descriptions associated with the returned HCG id. Figure 2 shows a workflow for page browsing and a few screen shots of using HCG. 3.2 Gene Annotation at Multiple Resolutions by HCG As discussed in [14], the multi-level classification scheme provides substantially more information than the one- or two-level classification schemes such as COG [1] and Pfam[2],. Figure 3 shows the structure of the HCG tree rooted at cluster “HCG-21” and its descendent clusters. Among the 1,294 genes included in cluster HCG-21, 1,089 genes are assigned with GO terms; and 98.3% and 97.6% of the 1,089 genes are annotated as GO:0000155 (two-component sensor activity) and GO:0005524 (ATP binding activity), respectively. Hence the biological functions of the HCG-21 genes can be summarized using GO:0000155 and GO:0005524; and those HCG-21 genes without an identified biological function are predicted to have the biological functions defined by the cluster, i.e., GO:0000155 and GO:0005524. Comparing to these GO annotations assigned to the root-level cluster, the hierarchical structure of HCG-21 provides much richer functional information to genes in the lowerlevel sub-clusters of this cluster. For example, a large portion of genes in HCG-21 are further partitioned into 38 child-level clusters labeled as “HCG-21.0” to “HCG-21.37”. 8 The numbers of genes in these clusters range from 3 to 91. Almost all of these child-level sub-clusters are annotated with more specific functions, using GO terms and NCBI-based text description than their parent cluster “HCG-21”. As we demonstrate using the following examples, genes in the same child cluster do have stronger functional relationship than the relationship among genes in the parent cluster. Cluster “HCG-21.0” contains 91 kdpD genes, all of which are the sensor genes for high-affinity potassium transport system; and cluster “HCG-21.4” contains 46 phoR genes, which are all the sensor genes in the phosphate regulons. Some of the other childlevel clusters each contain genes of similar but distinct biological functions, which are then further divided into a group of grandchild-level sub-clusters containing genes with equivalent functions with higher resolution. For example, the cluster “HCG-21.3” contains 49 genes annotated as either “cpxA” (the envelope stress sensor genes) or “envZ” (the osmolarity sensor genes). In its child level, the genes of “HCG-21.3” are further grouped into two smaller sub-clusters, “HCG-21.3.0” and “HCG-21.3.1”, which contains “cpxA” and “envZ” genes, respectively. The fact that these “cpxA” and “envZ” genes are grouped in the same cluster “HCG-21.3” suggests that the cpxA and envZ genes are more equivalent to each other than they are to other genes, which is supported by their NCBI annotation, where both “cpxA” and “envZ” genes are annotated to sense the extracellular pressure, and “envZ” genes are to sense the pressure from water (i.e., osmolarity). Similar can be said about another child-level cluster “HCG-21.2”, which contains 52 genes annotated as either “vanS” or “resE”. In the grandchild level, these “vanS” and “resE” genes are further grouped into two smaller clusters, “HCG-21.2.0” and “HCG-21.2.1”, which contains “vanS” and “resE” genes, respectively. Among the 1,294 HCG-21 genes, 689 cannot be further grouped into lower-level clusters, suggesting that these genes can only be annotated at low resolution, i.e., “two-component sensor activity” and “ATP binding activity”, because of the high functional diversity of these genes. Interestingly while the annotation derived from NCBI descriptions match well with our gene clusters, the GO annotations we derived from the GOA database are not as specific. For example, most genes in cluster “HCG-21” are assigned with two GO terms: GO:0000155 (two-component sensor activity) and GO:0005524 (ATP binding activity), so we cannot make any specific GO assignment for any of the offspring clusters of 9 “HCG-21”. However since we have used different information sources in our gene/cluster annotation, we have achieved annotations with higher specificity. This also indicates that to get more specific gene function annotation, one should look at more information sources. It should be noted that though our GO-based and NCBI-based annotations do not have any conflict, in general GOA-based annotation is not as specific as the NCBI-based ones. 3.3 Application Examples We now illustrate how to use the HCG database and demonstrate the power of the HCG system for functional prediction of genes, using the following examples. Example 1: find the function of a gene. Suppose we want to find out the function of gene “GI-16801886” of Listeria innocua Clip11262. The gene product is labeled as a “hypothetical protein” in the NCBI database. The COG number of this gene is COG0745, which represents the gene class of “response regulators consisting of a CheY-like receiver domain and a winged-helix DNA-binding domain”. Clearly, this annotation is not particularly useful as there are 3,119 genes assigned with this COG number across the 224 genomes covered by HCG. The GO annotation of this gene is GO:0000156 (twocomponent response regulator activity) and GO:0003677 (DNA binding), which is not very specific either as 3,866 genes in HCG are annotated with both GO terms. To use the HCG system to derive more specific functional information of this gene, a user can use the following steps. 1) Go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link “Search” to bring up the “Query Builder” page. 2) Fill the query information with “GI == 16801886”, and leave the other entries blank. Then click “Submit” to query the database. 3) The search will return the HCG code of gene 16801886 as “10.3.0” in the result page. Then click the link “10.3.0” to bring up the annotation page for this HCG cluster. 4) In the annotation page of “10.3.0”, the gene name for “10.3.0” is “kdpE”, and there are also two descriptions about the specific function of “10.3.0”: i) “kdp 10 operon transcriptional regulatory protein kdpE”; ii) “two-component regulatory protein response regulator kdpE”; iii) putative turgor pressure regulator; iv) probable transcriptional regulator. The first two annotations indicate more specific function while the other two indicate a general function. HCG has extracted all four descriptions because the score for all of them are above our threshold. Clearly HCG provides much more specific functional information about this gene than the other functional classification databases. Example 2: find a gene which carries a specific function. Suppose we want to find out which gene encodes the protein “bioA” in Vibrio fischeri ES114, an important gene in biotin synthesis. We know the another name of bioA is “7,8-diaminopelargonic acid synthetase”. To find “bioA” in “Vibrio fischeri ES114”, the user needs to do the following. 1) First we need to find out which HCG cluster represents “bioA” genes. To do this, go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link “Search” to bring up the “Query Builder” page. 2) Set the query information with “(Gene == bioA) OR (Product include 7,8diaminopelargonic)”, and leave the other entries blank. Then click “Submit” to query the database. To construct the query, the user needs to click the checkbox corresponding “(“ and “)” in condition1 and condition2. It also needs to click the radio button corresponding “Or” in condition2. 3) The user should now see the returned gene cluster labeled as cluster “69”, and some of its genes further clustered into “69.1”, “69.4”, “69.5” and “69.8”, etc. Many of these genes are annotated as “adenosylmethionine-8-amino-7oxononanoate (KAPA) aminotransferase”. It should be noted that the reactant of “7,8-diaminopelargonic acid”(DAPA) synthesis reaction is “7-keto-8- aminopelargonic acid” (another name of KAPA). Since these genes are from several different bacterial genomes, one needs to find the gene in the right genome. The user should click the link “69” to bring up the annotation page of its HCG annotation. 4) By checking the annotation pages of HCG cluster “69” and some annotation pages of its children like “69.1”, “69.4”, “69.5” and “69.8”, the user can see that 11 the children clusters are annotated as “bioA”. By checking the genes in the children clusters, the user should be able to see why they are further clustered; that is because the genes in same cluster belong to closer species. 5) Therefore one can determine that some children clusters of “69” are related to “bioA”, and their parent cluster “69” might include “bioA” homologs. Now the user should go back to the http://192.168.0.3/HCG/query_builder.php, “Query and enter Builder” the page query at “(HCG Begin_With 69.) AND (Species_Name include Vibrio fischeri ES114)”, and submit. 6) In the result page, the user should be able to see three genes NCBI:59712891, NCBI:59713931 and NCBI:59714306 in cluster “69”. Their HCG codes are “69.2”, “69.1.0.0” and “69.6”, respectively. By checking the annotation of these three HCG clusters, only “69.1.0.0” is annotated as “bioA”, the user should be able to confidently conclude that gene NCBI:59713931 encodes the “bioA” in Vibrio fischeri ES114, and its enzyme name is either “7,8-diaminopelargonic acid(DAPA) synthetase” or “adenosylmethionine-8-amino-7- oxononanoate(KAPA) aminotransferase”. Example 3: annotate the function of new genes from a newly genome. Two new cyanobacterial genomes have been recently sequenced by Grossman’s lab (personal communication), and these genomes are not included in current release of HCG. Here we use gene NCBI:86604767 as an example to illustrate how to use HCG to annotate the function of a new gene. 1) Go to the HCG main page at http://csbl.bmb.uga.edu/HCG, and then click the link “MyHCG” to bring up the sequence input page; 2) Enter the sequence of gene NCBI: 86604767, and click “submit”; 3) HCG returns 10 genes in the database as hits with cluster “6514.” ranked as the No. 1 hit. 4) Click the link “6514.” to open the annotation page for this cluster, and we found that the description is “photosystem I subunit XI” and the gene name is “psaL”; 12 5) The user can also click the link “Display Hit Genes” to display all the hit genes; and the descriptions for these genes are “photosystem I subunit XI” or “photosystem I reaction center subunit XI”; 6) Both the function information obtained from 4) and 5) can be used to annotate the gene: NCBI:86604767. A user can also send the sequence to COGNITOR. For this example, it returned “NO related COG”, suggesting that COG does have its annotation. We have also sent the sequence to the Pfam server, which returned “PF02605”, representing “Photosystem I reaction centre subunit XI”, which is consistent with the HCG annotation. We noted that KEGG doesn’t allow such data retrieval. 4. Conclusion We have developed a database, HCG, for hierarchical classification of functionally equivalent genes, which can be used to annotate genes at multiple resolution, depending on the availability of related data. The HCG system is based on a new method for prediction of functional relationship through combining information of sequence similarity and genomic context. The hierarchical organization of genes, grouped together with other functionally equivalent genes, facilitates functional annotations of new genes with higher accuracy compared to other functional classification schemes. We plan to extend this system to include all complete prokaryotic genomes, in the very near future, and update it on regular basis (monthly). We expect that this new system for gene annotation will provide a powerful tool for genome analysis and annotation to the biological community. Availability and requirements The database can be accessed at http://csbl.bmb.uga.edu/HCG, the users who want to analysis the whole database can download the classification data at http://csbl.bmb.uga.edu/HCG/HCG.tar.gz. The database is freely available for academic users; non-academic users should contact the corresponding author to obtain a license. Any modern Internet Browser should be capable of using the online database server. 13 Authors' contributions Fenglou Mao designed the database and implemented the online server; Fenglou Mao and Hongwei Wu worked together to generate the data of HCG; Victor Olman designed the hierarchical clustering program; Ying Xu coordinated the whole procedure and provided the financial support. Acknowledgement This work was supported in part by National Science Foundation (NSF/DBI-0354771, NSF/ITR-IIS-0407204, NSF/DBI-0542119) and by a “Distinguished Scholar” grant from the Georgia Cancer Coalition. Reference 1. 2. 3. 4. 5. 6. 7. 8. 9. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631-637. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R et al: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(Database issue):D247-251. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L et al: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33(Database issue):D201-205. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32(Database issue):D277280. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M, Karp PD: EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res 2005, 33(Database issue):D334-337. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33(17):5691-5702. Print 2005. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C et al: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32(Database issue):D258-261. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32(Database issue):D262-266. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I et al: The integrated microbial genomes (IMG) system. Nucleic Acids Res 2006, 34(Database issue):D344-348. 14 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I et al: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2005, 33(Database issue):D297-302. Fulton DL, Li YY, Laird MR, Horsman BG, Roche FM, Brinkman FS: Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics 2006, 7:270. Wall DP, Fraser HB, Hirsh AE: Detecting putative orthologs. Bioinformatics 2003, 19(13):1710-1711. Mao F, Su Z, Olman V, Dam P, Liu Z, Xu Y: Mapping of orthologous genes in the context of biological pathways: An application of integer programming. Proc Natl Acad Sci U S A 2006, 103(1):129-134. Wu H, Mao F, Olman V, Xu Y: Hierarchical Classification of Functionally Equivalent Genes of Prokaryotes. accepted by Nucleic Acids Research 2007, 0(0):0. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. Smith TF, Waterman MS: Comparison of biosequences. Advances in Applied Mathematics 1981, 2(4):482-489. Price MN, Huang KH, Alm EJ, Arkin AP: A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res 2005, 33(3):880892. Print 2005. Chen X, Su Z, Dam P, Palenik B, Xu Y, Jiang T: Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome. Nucleic Acids Res 2004, 32(7):2147-2157. Chen X, Su Z, Xu Y, Jiang T: Computational Prediction of Operons in Synechococcus sp WH8102. Proceedings of 15th International Conference on Genome Informatics 2004:211-222. Zheng Y, Anton BP, Roberts RJ, Kasif S: Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics 2005, 6:243. Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28(1):33-36. 15 Figure 1: HCG database table relationship 16 Figure 2: A screenshot of the HCG browser 17 Figure 3: The tree structure of cluster HCG-21, consisting of a group of twocomponent sensors. A circle represents a cluster which cannot be further divided; a rectangle represents a cluster containing only genes from the same genome; a triangle represents a cluster that does not have genes from the same genome. Colors do not have any particular meaning here. 18