Download Prediction of Gene Function Using Gene Clusters and Genomic

dn Thesis Project Report Improving Gene Function Prediction Using Gene Neighborhoods Kwangmin Choi Bioinformatics Program, School of Informatics, Indiana University, Bloomington, Indiana Contents 1. Introduction 2. Scope 3. Materials & Methods 4. Results 5. Discussion 6. References 1. INTRODUCTION Operon is a group of adjacent, co-expressed and co-regulated genes that encode functionally linked proteins in prokaryote genomes. It has been reported that many operons particularly those that encode subunits of multiprotein complexes are conserved in phylogenetically-distant prokaryotic genomes or even between archaea and bacteria. This is due, in part, to the conservation of operons over long stretches of evolutionary time since the last common ancestor, and, in part, to horizontal transfer of operon components among prokaryotes (Selfish-operon hypothesis. Ref.1 and Ref.2). The horizontal transfer of whole components over transfer of individual genes is much more favored by natural selection, because co-expression and co-regulation can be preserved in the former case. Operon tends to undergo multiple rearrangements during evolution. As a result, gene order at a level above is poorly conserved and genome comparison diagonal plots appear completely disordered even between phylogenetically very close species (Ref.4). Thus, it was very early noticed that conservation of gene neighborhood has biologically important meanings (Ref.5 and Ref.6). Computational algorithms to locate operons have been developed previously, primarily for Escherichia coli (Ref.8 and Ref.9). Earlier methods were based on (1) finding signals that occur on the boundaries of operons. In this method, promoters on the 5’-end and terminators on the 3’-end were searched. But such approaches can only be useful when transcription signals are completely known. However, even in E.coli, sequence motifs of promoters and terminators are not completely characterized. (2) Another methods uses a combination of gene expression data, functional annotation and other experimental data, which is primarily applicable to well studied genomes such as E.coli. (3) Finally, some methods rely on intergenic distance between adjacent genes and gene annotation. Such methods are based on finding gene clusters where gene order and orientation is conserved in 2 or more genomes (Ref.8, Ref.9, Ref.10, Ref.11, Ref.12). In this study, a new method based on the third approach was investigated to reconstruct gene neighborhoods by using sequence homologs among 2 or more whole bacterial genomes. The idea underlying this study is that different genomes contain different, overlapping parts of evolutionarily and functionally connected gene neighborhoods and the entire neighborhoods can be reconstructed by generating a tiling path through these overlaps. The method to safely predict cellular functions of uncharacterized gene products was also investigated. 2. SCOPE As a part of PLATCOM (A Platform for Computational Comparative Genomics) project led by Dr. Sun Kim (Bioinformatics, Indiana University, Bloomington), the first goal of this project was providing public users a web interface to access our protein-protein and genome-genome pairwise comparison database in BIOKDD server and building an integrated genome analysis system for the comparative analysis of multiple genomes. Based on this web-based platform, several multiple genome analysis tools were written in Perl and embedded into the PLATCOM system. The complete gene neighborhood was reconstructed from gene clustering data and analyzed this genomic context to predict operon or operon-like gene clusters. The web interface of PLATCOM system is located at http://biokdd.informatics.indiana.edu/kwchoi/platcom/ 3. MATERIALS & METHODS 3.1. PLATCOM PLATCOM system consists of 3 components: (1) Databases of biological entities (e.g. DNA sequence file (fna), protein sequence file (faa), integrated genomic information file (ptt, gbk)), (2) Databases of relationships among entities (e.g. genome-genome, proteinprotein pairwise comparison file) and (3) Mining tools over these databases. Fig 1. PLATCOM interface 3.2. Genomes On PLATCOM webpage, users can choose multiple genomes for the comparative genomics study. Totally 23 genomes (15 phylogenetic groups) were chosen for this study and they are listed in Table 1. 1 2 3 4 5 Bacteria; Proteobacteria; beta subdivision; Neisseriaceae; Neisseria AL157959 Neisseria meningitidis serogroup A strain Z2491 complete genome Bacteria; Proteobacteria; gamma subdivision; Pseudomonas group; Pseudomonas AE004091 Pseudomonas aeruginosa PA01, complete genome BA000003 Buchnera sp. APS complete genome Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia L42023 Haemophilus influenzae Rd complete genome U00096 Escherichia coli K-12 MG1655 complete genome AL590842 Yersinia pestis CO92 complete genome Bacteria; Proteobacteria; gamma subdivision; Vibrionaceae; Vibrio AE003852 Vibrio cholerae chromosome I, complete chromosome AE003853 Vibrio cholerae chromosome II, complete chromosome Bacteria; Proteobacteria; epsilon subdivision; Campylobacter group; Campylobacter AL111168 Campylobacter jejuni complete genome 6 7 8 9 10 11 12 13 14 15 Bacteria; Proteobacteria; epsilon subdivision; Helicobacter group; Helicobacter AE000511 Helicobacter pylori 26695 complete genome Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Salmonella AL513382 Salmonella typhi strain CT18, complete chromosome AE006468 Salmonella typhimurium LT2, complete genome Bacteria; Firmicutes; Bacillus/Clostridium group; Mollicutes; Mycoplasmataceae; Mycoplasma L43967 Mycoplasma genitalium G37 complete genome U00089 Mycoplasma pneumoniae M129, complete genome Bacteria; Firmicutes; Actinobacteria; Actinobacteridae; Actinomycetales; Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium tuberculosis complex AL123456 Mycobacterium tuberculosis complete genome Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillus/Staphylococcus group; Bacillus BA000004 Bacillus halodurans C-125, complete genome AL009126 Bacillus subtilis complete genome Bacteria; Firmicutes; Bacillus/Clostridium group; Streptococcaceae; Streptococcus AE005672 Streptococcus pneumoniae TIGR complete genome AE004092 Streptococcus pyogenes strain SF370 serotype M1, complete genome Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillus/Staphylococcus group; Staphylococcus BA000017 Staphylococcus aureus strain Mu50, complete genome Bacteria; Thermus/Deinococcus group; Deinococcales; Deinococcus AE000513 Deinococcus radiodurans R1 complete chromosome 1 Bacteria; Chlamydiales; Chlamydiaceae; Chlamydia AE002161 Chlamydophila pneumoniae AR39, complete genome Archaea; Euryarchaeota; Methanococcales; Methanococcaceae; Methanococcus L77117 Methanococcus jannaschii complete genome Table 1. A list of genomes used in this study. http://www.infobiogen.fr/services/deambulum/english/genomes2a.html 3.3. All pairwise protein comparison data The database of all pairwise protein comparisons was built using FASTA and embedded into PLATCOM system. All pairwise genome-genome (DNA) sequence comparison database was not used for this study. 3.4. Ptt files Ptt files of each genome were used to get information about the location of Open Reading Frames (ORF), gene names, Clusters of Orthologous Groups of proteins (COG) number, the function of gene products and others. Ptt files contain both experimental and computated /hypothetical data. If a gene product is not yet identified, ptt files do not provide COG and PID numbers of those proteins, instead suggest hypothetical function(s) of putative gene product if there are well-characterized sequence homologs of them. 3.4. Genome comparison diagonal plot generating tool A Perl/CGI program to generate genome comparison diagonal plot was embedded into PLATCOM system. This tool was used to help users understand gene order conservation between 2 chosen genomes. The tool can generate 2 different kinds of diagonal plots: one is from protein-protein pairwise comparison data and the other is from whole genomegenome pairwise comparison data. This is a simple tool, but provides a strong intuition to understand the genome structure. 3.3. Gene clustering data Another program was written in Perl to generate gene clustering data for multiple genomes. A user can select as many genomes as he wants on PLATCOM interface and also choose the length of protein and other cutoff values (Z-score and E-value). Perl’s graph package was used to perform clustering . Each gene pair was used as an input (vertices) to the graph for clustering. Output is a list of the groups of genes that were clustered according to the sequence homology. If two given proteins are longer than a given length cutoff and Z-score of 2 proteins are greater than a given Z-score cutoff and E-value of them are smaller than a given E-value cutoff, 2 proteins are considered in the same cluster. For this study, Z-score was greater than 750 and E-value was less than e-20. The miniumum length of protein product was 50. The clustering result is sent the result to the user by e-mail. For reality check, COG database (http://www.ncbi.nlm.nih.gov/COG/ ) was used to check if each clusters contained only orthologs. This algorithm will be replaced with Dr. Kim’s own clustering algorithm because of relatively slow performance. 3.4. Reconstructing gene clusters/ gene neighborhood/ genomic context Another webtool to reconstruct gene clusters/gene neighborhood was written in Perl and embedded into PLATCOM. When a clustering data file, a gene and its genome ID (GI number) and intergenic distance are given by users, this tool automatically searches all other genes in the same clustering groups by referring the gene clustering data and finds all possible gene clusters/neighborhoods that contain the gene and its homologs genes. Intergenic distance and directionality of DNA strand (+/-) are referred to connect adjacent 2 genes. A default value of intergenic distance is < 300 base pairs (Ref.11). If intergenic distance is greater than a given value, 2 adjacent genes were not connected and considered that they belonged to different gene clusters. A user can choose all or some of gene clusters to generate multiple alignments by ClustalW. Fig 2 shows one of the reconstruction results. Fig 2. Gene neighborhood reconstruction result 3.5. Prediction of gene function using genomic context Different genomes contain different, overlapping parts of evolutionarily and functionally connected gene neighborhoods and the entire gene neighborhood can be reconstructed by generating a “Tiling Path” (Ref.12). Although over 150 genomes have been completely sequenced until today, the biological functions of many genes are still uncharacterized and many genomic data files do not provide COG numbers, PID numbers or other biologically meaningful information of such genes. In this case, it is impossible to predict gene function by referring to these files. Instead, we need to use genomic context of well-known genomes, such as E.coli. That is to say, we can start with observing gene neighborhoods in these well-known genomes and then compare them with gene clusters/neighborhoods in the target genomes to analyze genomic context of target genomes. Genomic context means the pattern of “runs” of COG number. With this strategy, cellular functions of unknown genes and genomic “hitchhikers” could be founnd. “Hitchhikers” are inserted genes that are originated from different contexts/themes. Fig 3. Tiling Path Method. (Cited from Ref.12) For reality check, BioCyc database (http://biocyc.org:1555/ECOLI/class-subsinstances?object=Transcription-Units) and TIGR site (http://www.tigr.org/tigrscripts/operons/operons.cgi) were referred to compare the reconstructed gene cluters with experimentally-proven or computationally-predicted transcription units in this database., but, nfortunately, only Echerichia coli has full features of transcription units information at this time. 4. RESULTS 4.1. In general, the gene order is not conserved, even between phylogenetically-close genomes. Several genome comparison diagonal plots were generated by an embedded webtool (see Materials & Methods) to investigate the relationship between phylogenetic distance and the conservation of gene order. This tool finds homologs between given 2 genomes and plots them on 2D space (Z-score > 750). A basic idea underlying this research is that if gene order of two genomes is well conserved, the sequence of homologs should appear as a line on the genome comparison diagonal plot. It is obvious that gene order between phylogenetically-distant species are poorly conserved (Fig 4). Fig 4. Phylogenetically-distant genomes: (a) Eschericia coli K-12 vs Sulfolobus solfataricus (b) Campylobacter jejuni vs E.coli K-12 (3) Agrobacterium tumefaciens C58 vs E.coli K-12 But one of most striking finding (Fig 5) is that gene order is not conserved, even between phylogenetically-close genomes (e.g. E.coli vs H.Influenza and E.coli vs V.cholerae). Fig 5. Phylogenetically-close genomes: (a) E.coli K-12 vs Haemophilus influenza (b) E.coli K-12 vs Vibrio cholerae (c) E.coli vs Salmonella enterica Typhi Instead, short runs of genes were found when some areas were zoomed in (Fig 6). They can be considered as gene clusters if intergenic distance is shorter than a given length (e.g. 300). The possibility that some of these short runs were conserved in many other genomes was investigated in the next part. Fig 6. Fragmented gene clusters : E.coli K-12 vs V.cholerae 4.2. Rbs operon (Z-score > 750, Intergenic Distance < 300) Rbs operon of E.coli consists of a set of genes involved in ribose transport across cell membrane. The cellular function of each gene is described in Table 2. COG COG1869 COG1129 COG1172 COG1879 COG0524 COG1609 Cellular Functions D-ribose high-affinity transport system; membrane-associated protein ATP-binding component of D-ribose high-affinity transport system D-ribose high-affinity transport system D-ribose periplasmic binding protein Ribokinase Regulator for rbs operon Table 2. Cellular functions of Rbs operon components: E.coli Fig 7. Reconstructed gene neighborhood of Rbs operon From the clustering data of 23 genomes, their gene neighborhoods were reconstructed by generating tiling path (see Materials & Methods). 9 out of 23 genomes contain Rbs operon and their gene order pattern seems conserved pretty well (Fig 7). AE003852 (V.cholerae chromosome I) was ignored because another chromosome (AE003853) of V.cholerae has a longer gene cluster containing all 8 components. 7 out of these 9 genomes have a full set of 8 components, but 1 component (COG1879) was missing in AE004091, 3 (COG1172, COG0524 and COG1609) were missing in AL590842. A short runs of genes, [COG1869-COG1129-COG1172-COG1879-COG0524-COG1609], can be considered as a conserved genomic context (general scheme) of Rbs operon. The ptt files of AL513382 and AE006468 do not provide COG number of these 8 operon components, but their gene neighborhood could be reconstructed by generating tiling path based on gene clustering data. Cellular functions of operonn components of these 2 genomes could be reasonably guessed when their genomic contexts were carefully examined. 4.3. Functional Coupling (Z-score > 750, Intergenic Distance < 300) The next example shows more complex genomic context with 2 functional themes (transcription control and translation control) and putative hitchhikers. In bacterial genomes, transcription, translation and RNA modification/degradation coupled and the advantages of co-regulation the corresponding genes are obvious. Table 3 shows the cellular functions of genes within 22 gene clusters searched. Transcription COG0779 Uncharacterized Conserved Protein COG0195 Transcription elongation factor COG2740 Predicted nucleic-acid-binding protein (transcription termination?) Translation COG1358 Ribosomal protein S17E COG0532 Translation initiation factor 2 (GTPase) COG1550 Uncharacterized Conserved Protein COG0858 Ribosome-binding factor A COG0184 Ribosomal protein S15P/S13E COG0130 tRNA Pseudouridine synthase Hitchhiker ? COG0196 FDA Synthase (Hitchhiker?) Table 3. Cellular functions of operon components : B.subtilis Fig 8 shows the reconsructed gene neighborhoods of 22 genomes. 6 components, [COG0779-COG0195-COG0532-COG0858-COG0130-COG0184], were common among 22 genomes, thus this pattern can be considered as the general scheme of these gene neighborhoods. In 5 genomes (AL123456, BA000004, AL009126, AE005672, AE000513), the context of [COG0779-COG0195-COG2740] was found. Although COG0779 is an uncharacterized protein, it it possible COG0779 is a functional partner of COG0195 because it is always an adjacent neighbor of COG0195. Fig 8. Reconstructed gene neighborhood The context of translation control seems to be [COG0532-COG0858-COG0130COG0184]. It is reasonable to consider that COG0196 be genomic hitchhikers because it belong to different functional theme (energy metabolism, not translation control). Another uncharacterized protein, COG1550, is probably a hitchhiker, but it is still possible it belong to the theme of translation control. AE004092 has 3 uncharacterized gene products. They are respectively sequence homologs of COG0779, COG2740 and COG1358. Sometimes it is not enough to say 2 proteins have the same cellular function because they are homologs. Now genomic context can be used to confirm this guess. Gene neighborhood of AE004092 is [? (Yellow)COG0195-? (Sky blue)-? (Purple)-COG0532.....] and this scheme is exactly corresponding to general scheme of other 21 genomes, [COG0779-COG0195-COG2740-COG1358- COG0532...]. So it is much reasonable to say the cellular functions of these 3 uncharacterized proteins are COG0779, COG2740 and COG1358. 5. DISCUSSION Identifying operon structure is one of the most important issues in prokaryotic genomics, because operon structure is the most prominent characteristic of bacterial genomes. According to a classical concept of operon, operon is defined as a set of adjacent genes that are regulated by one promoter. Based on this concept, there have been enormous efforts to identify operon structure by detecting promoter signals, but most of these trails turned out to be not successful, because it is very hard to identify the location of promoters. Although many promoters are found just upstream of the firtst component of operon, some other promoters are found much far away from operon itself – either upstream or downstream. So many other approaches to identify operon structure have been introduced in recent years (Ref 3). These methods focus on a set of adjacent genes on the same strand (directron) and the location of promoters is not considered (Fig 8). Generally these adjacent genes form a gene cluster, of which intergenic distance is shorter than 200-300 bps and there have been many interesting reports that components of gene clusters belong to the same or related functional theme. This is a big turn of the way to find operon structure. When you identify one unknown gene cluster or operon, the prediction of unknown operons is a difficult and error-prone procedure that has never been defined in algorithmic terms. However detecting adjacent or close genes and reconstructing their connections are much easier way if orthologous relationships have been identified correctly. Of course , many of these connections are functionally irrelevant, it is possible to pick up their hidden biological meanings and to help but detailed analysis may help predicting new functional connections (Ref.3). Fig 9. Gene clusters and close homolog. PCBBH : Pair of Close Bidirectional Best Hits, BBH: Bidirectional Best Hits, PCH: Pair of Close Homologs, COG: Clusters of Orthologous Genes. Cited from Ref. 11 A systematic comparison of 23 prokaryotic genomes reveals a low level of gene-order (and operon architecture) conservation. Nevertheless, a number of short and fragmented gene clusters are found to be conserved. Only several operons, primarily those that code for physically interacting proteins, are conserved in all or most of the bacterial and archaeal genomes. Recently the term “Über-operon” has emerged. Über-operon is defined as a set of genes with close functional and regulatory contexts that tends to be conserved despite numerous rearrangements. This concept focus on the functional themes of operons, not a specific genes or gene order, but many evidences have been also published that the gene order of some gene clusters are conserved during evolution and speciation (Ref 6 and Ref 7). Fo a long time, biologists have focused on sequence homology of 2 or more proteins to understand evolutionary distance and are relatively not interested in gene orger, gene neighborhoods and genomic context, because, in part, gene order in prokaryotes is conserved to a much lesser extent than protein sequences But many recently suggested computational approaches in comparative genomics go beyond a simple sequence comparison. Now a new possibiliy is open to mine the biological meaning of genomic context (Ref.9 and Ref.10). Here I suggested one improved method to predict cellular functions of uncharacterized gene products. This methodology is basically based on the elegant idea of Koonin group at NCBI (Ref.12). Koonin’s group reconstructed gene neighborhoods by referring to COG numbers of gene products and their methodology looks working finely in most cases. But this method has one critical problem, because, in some cases, genome data files (gbk or ptt) do not provide COG number at all. To compensate this defect, I also used gene clustering data based on sequence homology of all proteins of all chosen genomes. Because we have gene clustering result, we can reconstruct a gene neighborhood even if there is no COG number provided and also the putative functions of uncharacterized protein could be guessed using referring to genomic context. The proteins encoded by conserved gene pairs appear to interact physically or belong to the same functional theme. This observation can therefore be used to predict functions of, and interactions between, prokaryotic gene products. It is clear that operons are physically gene clusters, but gene clusters are not always operons. To say a specific gene cluster is a functional operon, we need to examine if all components in the gene cluster are expressed together by the same promoter. If one promoter that control the expression of the gene cluster is not identified yet, this gene cluster should be considered as just an “operon candidate”. Unfortunately, our knowledge on genomic context is very limited at this time, even though over 150 genomes have been completely sequenced. Echerichia coli is the only prokaryotic species, of which whole transcription units are reported. In this study, I picked up several gene clusters in E.coli as a framework to search corresponding conserved gene clusters in the rest 23 genomes. In many cases, it worked pretty well, but in some cases, E.coli’s gene cluster is not the longest one (Fig 7 and Fig 8). This can make reconstructing gene neighborhood a time-consuming job. 5. ACKNOWLEDGMENTS I thank Haifeng Zhao for his work on our pairwise comparison database, Scott Martin for his technical support and Dr. Sun Kim for his advising and helpful suggestions. 6. REFERENCES 1. Lawrence,J.G. and Roth,J.R. (1996) Selfish operons: horizontal transfer may drive the evolution of gene clusters. Genetics, 143, 1843–1860. 2. Lawrence,J. (1999) Selfish operons: the evolutionary impact of gene clustering in prokaryotes and eukaryotes. Curr. Opin. Genet. Dev., 9, 642–648. 3. Galperin,M.Y. and Koonin,E.V. (2000) Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol., 18, 609–613 4. Mushegian,A.R. and Koonin,E.V. (1996) Gene order is not conserved in bacterial evolution. Trends Genet., 12, 289–290. 5. Dandekar,T., Snel,B., Huynen,M. and Bork,P. (1998) Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci., 23, 324–328. 6. Lathe,W.C.,III, Snel,B. and Bork,P. (2000) Gene context conservation of a higher order than operons. Trends Biochem. Sci., 25, 474–479 7. Huynen,M., Snel,B., Lathe,W.,III and Bork,P. (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res., 10, 1204–1210 8. Salgado,H., Moreno-Hagelsieb,G., Smith,T.F. and Collado-Vides,J. (2000) Operons in Escherichia coli: genomic analyses and predictions. Proc. Natl Acad. Sci. USA, 97, 6652–6657. 9. Ermolaeva,M.D., White,O. and Salzberg,S.L. (2001) Prediction of operons in microbial genomes. Nucleic Acids Res., 29, 1216–1221. 10. Wolf,Y.I., Rogozin,I.B., Kondrashov,A.S. and Koonin,E.V. (2001) Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res., 11, 356–372. 11. Overbeek,R., Fonstein,M., D’Souza,M., Pusch,G.D. and Maltsev,N. (1999) The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA, 96, 2896– 2901 12. Igor B. Rogozin, Kira S. Makarova, Janos Murvai, Eva Czabarka, Yuri I. Wolf, Roman L. Tatusov, Laszlo A. Szekely and Eugene V. Koonin. (2002) Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res., 30, 2212-2223 13. Daubin, V. and Gouy, M. (2001) Bacterial molecular phylogeny using supertree approach. Genome Informatics Genome Research, Vol 12, Issue 7, 1080-1090

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Prediction of Gene Function Using Gene Clusters and Genomic