* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Computational Biology
Ancestral sequence reconstruction wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Transcriptional regulation wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Gene expression wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Magnesium transporter wikipedia , lookup
Molecular evolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Expression vector wikipedia , lookup
Genome evolution wikipedia , lookup
Gene regulatory network wikipedia , lookup
List of types of proteins wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Protein adsorption wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Western blot wikipedia , lookup
Protein moonlighting wikipedia , lookup
Computational functional genomics The goal of computational functional genomics is to assign the function, localization, and interactions of genes (proteins) from the genome organisation, homology to other proteins, occurrence in different species ... - phylogenetic profiles - assignment of function and localization - combination with operon method, rosetta stone method, genome neighborhood method 10. Lecture WS 2003/04 Bioinformatics III 1 Assigning protein functions by comparative genome analysis: protein phylogenetic profiles Hypothesis: functionally linked proteins evolve in a correlated fashion, and, therefore, they have homologs in the same subset of organisms. In general, pairs of functionally linked proteins have no amino acid sequence similarity with each other and, therefore, cannot be linked by conventional sequencealignment techniques. Phylogenetic profile of a particular protein: a string with n entries, each one bit, where n corresponds to the number of genomes. The presence of a homolog to a given protein in the nth genome is indicated by an entry of 1 at the nth position. If no homolog is found, the entry is 0. Variation: assign 1/E-value from BLAST to distinguish levels of similarity. Pellegrini et al. PNAS 96, 4285 (1999) 10. Lecture WS 2003/04 Bioinformatics III 2 Protein phylogenetic profiles Illustrate method for hypothetical case of four fully sequenced genomes (from E. coli, Saccharomyces cerevisiae, Haemophilus influenzae, and Bacillus subtilis) in which we focus on seven proteins (P1-P7). For each E. coli protein, a profile is constructed, indicating which genomes code for homologs of the protein. We next cluster the profiles to determine which proteins share the same profiles. Proteins with identical (or similar) profiles are boxed to indicate that they are likely to be functionally linked. Boxes connected by lines have phylogenetic profiles that differ by one bit and are termed neighbors. Pellegrini et al. PNAS 96, 4285 (1999) 10. Lecture WS 2003/04 Bioinformatics III 3 Test the method for known case To test whether proteins with similar phylogenetic profiles are functionally linked, examine the phylogenetic profiles for two proteins that are known to participate in structural complexes: the ribosomal protein RL7 and the flagellar structural protein FlgL, as well as a protein known to participate in a metabolic pathway, the histidine biosynthetic protein HIS5. Identify all other E.coli ORFs with phylogenetic profiles identical to those 3 proteins, and the ORFs that differ by one bit. Pellegrini et al. PNAS 96, 4285 (1999) 10. Lecture WS 2003/04 Bioinformatics III 4 3 phylogenetic profiles for E.coli proteins Proteins with phylogenetic profiles in the neighborhood of ribosomal protein RL7 (A), flagellar structural protein FlgL (B), and histidine biosynthetic protein His5 (C). All proteins with profiles identical to the query proteins are shown in the double boxes. All the proteins with profiles that differed by one bit are shown in the single boxes. Proteins in bold participate in the same complex or pathway as the query protein. Proteins in italics participate in a different but related complex or pathway. Proteins with identical profiles are shown within the same box. Single lines between boxes represent a one-bit difference between the two profiles. Homologous proteins are connected by a dashed line or are indented. Each protein is labeled by a four-digit E. coli gene number, a SwissProt gene name, and a brief description. Note that proteins within a box or in boxes connected by a line have similar functions. Proteins in the double boxes in A, B, and C have 11, 6, and 10 ones, respectively, in their phylogenetic profiles, of a possible 16 for the 17 genomes presently sequenced. 10. Lecture WS 2003/04 Pellegrini et al. PNAS 96, 4285 (1999) Bioinformatics III 5 results from phylogenetic profile analysis The phylogenetic profile of a protein describes the presence or absence of homologs in organisms. Proteins that make up multimeric structural complexes are likely to have similar profiles. Also, proteins that are known to participate in a given biochemical pathway are likely to be neighbors in the space of phylogenetic profiles. Proteins that are functionally linked are far more likely to be neighbors in profile space than randomly selected proteins. However, only a fraction of all possible neighbors is found with a group. Therefore, not all functionally linked proteins have similar profiles. They may fall into multiple clusters in profile space. Interestingly, hypothetical are also more likely to be neighbors than random proteins, suggesting that many hypothetical proteins are part of uncharacterized pathways or complexes. Pellegrini et al. PNAS 96, 4285 (1999) 10. Lecture WS 2003/04 Bioinformatics III 6 Localizing proteins in the cell from their phylogenetic profiles Observation: proteins localized to a given organelle by experiments tend to share a characteristic phylogenetic distribution of their homologs – the phylogenetic profile. Marcotte et al. PNAS 97, 12115 (2000) 10. Lecture WS 2003/04 Bioinformatics III 7 Phylogenetic profile of yeast proteins (A) The mean phylogenetic profiles (horizontal bars of 31 elements) of yeast proteins experimentally localized to different cellular locations. Each profile shows the distribution among genomes of homologs of proteins from one subcellular location. Plasma Mb, plasma membrane. Colors express the average degree of sequence similarity of proteins in that organelle to their sequence homologs in the indicated genomes, with red indicating greater average similarity and blue indicating less. (B) A tree of the observed relationships among the yeast proteins from different subcellular compartments. Overlaid on the tree is our interpretation of the relationships, showing ellipses clustering compartments thought to be derived from the progenitor of mitochondria (orange ellipse) and of the eukaryote nucleus (yellow ellipse). A distance matrix was calculated of pairwise Euclidian distances between the mean phylogenetic profiles (A) of proteins known to be localized in each compartment. A tree was generated from this matrix by the neighbor-joining method implemented in PHYLIP 3.5C. Marcotte et al. PNAS 97, 12115 (2000) 10. Lecture WS 2003/04 Bioinformatics III 8 Classification scheme The scheme by which proteins are classified into mitochondrial or nonmitochondrial cellular localizations. Each horizontal bar is a phylogenetic profile; that for the protein of interest x0 is compared with the mean profiles for mitochondrial and nonmitochondrial proteins to determine its localization. In this example, the protein of interest is assigned to the mitochondrion because the query protein's phylogenetic profile more closely resembles the mean profile of mitochondrial proteins than the mean profile of cytosolic proteins. Marcotte et al. PNAS 97, 12115 (2000) 10. Lecture WS 2003/04 Bioinformatics III 9 Assignment of nuclear genome-encoded proteins to Mitochondria Assignment of nuclear genomeencoded proteins to mitochondria. (Left) For yeast, a jackknife test on experimentally localized yeast proteins showing the method coverage (fraction of mitochondrial proteins correctly assigned) plotted versus the method accuracy (fraction of proteins assigned to mitochondria known to be mitochondrial). (Inset) The (noncumulative) number of known (gray curve) and newly predicted (black curve) mitochondrial proteins for each coverage level, along with the number of known false positive predictions (white curve). One hundred jackknife trials were performed, randomly removing 10% of the proteins for each trial. (Right) Predicted localization of experimentally localized worm proteins by using yeast proteins as the training set. Marcotte et al. PNAS 97, 12115 (2000) 10. Lecture WS 2003/04 Bioinformatics III 10 Functions of mitochondrial proteins Functions of yeast mitochondrial proteins are plotted for known mitochondrial proteins (upper three pie charts) and for the newly predicted mitochondrial proteins (lower pie chart). Each pie chart shows the percentage of proteins with a given function. Known mitochondrial proteins can be operationally divided into three populations: those with homologs in eubacteria or archaea (prokaryote-derived mitochondrial proteins), those with homologs only in other eukaryotes (eukaryotederived mitochondrial proteins), and those without detectable homologs in the set of complete genomes (organism-specific mitochondrial proteins). Many functional systems, such as the mitochondrial ribosome, have components from more than one category of genes. The organismspecific mitochondrial proteins may be conserved in related species; many of the yeast-specific genes are conserved in other fungi as well, although absent in the more distantly related eukaryotes listed in Fig. 1A. Functional categories are defined as in the MIPS (Munich Information Center for Protein Sequences) database (29). For this analysis, mitochondrial proteins were predicted with an accuracy of 70% as scored by the selfconsistency test. 10. Lecture WS 2003/04 Marcotte et al. PNAS 97, 12115 (2000) Bioinformatics III 11 Inference of protein function and protein lineages in Mycobacterium tuberculosis based on prokaryotic genome organization One difference between prokaryotic and eukaryotic genomes is the organization of the prokaryotic genome into multi-gene units, known as operons. Prokaryotic operon organization enables the highly controlled co-expression of multiple genes, by transcribing them together onto a single transcript. The encoded proteins of common operons often have related functions, form common complexes, or participate in shared biochemical pathways. Strong, Mallick, Pellegrini, Thompson, Eisenberg Genome Biology 2003 4:R59 10. Lecture WS 2003/04 Bioinformatics III 12 Prokaryotic operon organization (a) Prokaryotic operon organization. Genes A, B, and C are transcribed together onto a single polycistronic transcript, which is then translated to produce three separate proteins. Proteins originating from genes of a common operon often have similar functions, interact physically through protein-protein interactions, or participate in shared biochemical pathways. (b) Functional Linkages based on the Operon method. Genes A, B and C are 'linked' if the intergenic nucleotide distance between pairs of adjacent genes is less than or equal to the specified threshold. In this case the distance between gene A and B, and the distance between gene B and C is less than the hypothetical distance threshold, thereby allowing links between all possible sets of genes. 10. Lecture WS 2003/04 Strong, Mallick, Pellegrini, Thompson, Eisenberg Genome Biology 2003 4:R59 Bioinformatics III 13 Prokaryotic operon organization Although the operon structure has been well studied at the biochemical level in microorganisms such as E.coli , genome-wide operon organization in pathogenic organisms, such as M. tuberculosis, remains largely unknown. One can exploit the conservation of certain genetic elements present in many prokaryotic organisms, including M. tuberculosis, to learn about operon structure and gene function: -10 and -35 bp promoter elements - ribosome binding sites (RBS) - the 5‘ and 3‘ untranslated regions (UTR) Strong, Mallick, Pellegrini, Thompson, Eisenberg Genome Biology 2003 4:R59 10. Lecture WS 2003/04 Bioinformatics III 14 Independent vs. consecutive transcription Schematic representation of the minimum genetic requirements for adjacent genes that are transcribed independently and those transcribed together as a single operon. Cases 1, 2 and 3 depict instances where gene A and gene B are transcribed independently as distinct transcriptional units, while Case 4 depicts genes organized into a common operon. The minimum requirement for genes of a common operon is only a RBS, while Case 3 emphasizes the numerous genetic elements required if gene A and gene B are organized into separate transcription units Strong, Mallick, Pellegrini, Thompson, Eisenberg Genome Biology 2003 4:R59 10. Lecture WS 2003/04 Bioinformatics III 15 gene linkage based on Operon method Strong et al.,Genome Biology (2003) 4:R59 10. Lecture WS 2003/04 Bioinformatics III 16 Conservation of Swissprot annotation Swissprot-keyword recovery scores as a function of combined intergenic distances between pairs of genes in a run. All gene members of a run (bordered on each side by genes in opposite orientations) were linked and given a value equal to the combined intergenic distances between them. While the keyword recovery of genes linked by a combined intergenic distance less than 150 bp is fairly high (34-52%), it is apparent that as the total intergenic distance increases above 150 bp, there is a decrease in keyword recovery. At combined intergenic distances above 250 bp the keyword recovery is comparable to that of randomly linked genes. 10. Lecture WS 2003/04 Strong et al.,Genome Biology (2003) 4:R59 Bioinformatics III 17 Combine computational methods of functional assignment 4 methods for functional assignment used: Operon method (intergenic distance criterion) Rosetta Stone (RS): genes A and B have common function if a fused gene AB is found in any other organism Phlogenetic Profile (PP) Conserved Gene Neighbor (GN) method: identify genes that are in close proximity in multiple genomes Keyword recovery scores for the Operon method alone and in combination with RS, PP, and GN methods. Notice that the combination of either RS, PP, or GN has a dramatic effect on the keyword recovery, with the best score resulting from a combination of the 100 bp Operon, RS and PP methods. Strong et al.,Genome Biology (2003) 4:R59 10. Lecture WS 2003/04 Bioinformatics III 18 Distance profiles of adjacent genes Distance profile of adjacent M. tuberculosis genes in the same orientation that are functionally linked by the Rosetta Stone, Phylogenetic Profiles or conserved Gene Neighbor methods, compared to adjacent genes in the same orientation that are not linked by these methods. Strong et al.,Genome Biology (2003) 4:R59 10. Lecture WS 2003/04 Bioinformatics III 19 Gene finding (c) Distance profile of adjacent genes in the same orientation in experimentally documented operons in E. coli. E. coli operon data obtained from RegulonDB. The linked profile (a) yielded a mean intergenic distance of 27 base pairs, as compared with (b) 94 base pairs for the mean intergenic distance for genes not linked by any of the three methods. This demonstrates that adjacent genes in the same orientation that have small intergenic spacing are more likely to be functionally linked that those that are separated farther apart. Strong et al.,Genome Biology (2003) 4:R59 10. Lecture WS 2003/04 Bioinformatics III 20 Determine operon distance threshold Keyword recovery and maximum false positive fraction scores as the Operon distance threshold increases from 0 bp to 300 bp. Notice the decrease in the keyword recovery and the increase in maximum false positive fraction as the distance threshold increases. Strong et al.,Genome Biology (2003) 4:R59 10. Lecture WS 2003/04 Bioinformatics III 21 Verify predictions on known examples Comparison of the genomic organization of the leucine biosynthesis genes in M. tuberculosis and S. pombe. (a) Genomic organization of the leuC and leuD genes of M. tuberculosis. (b) S. pombe alpha-isopropylmalate isomerase, containing both the leuC and leuD coding regions in a single fusion gene. This example illustrates the power of the Rosetta Stone, Phylogenetic Profile, Gene Neighbor and Operon methods to infer a functional linkage, in this case one that is already established. 10. Lecture WS 2003/04 Strong et al.,Genome Biology (2003) 4:R59 Bioinformatics III 22 Inference of protein function Inference of M. tuberculosis protein function and operon organization based on multiple method overlap. (a) Inference of an operon encoding members involved in thiamine biosynthesis. (b) Operon inference for a region possibly involved in RNA degradation. (c) Functional links and operon inference for a region likely to be involved in cell wall metabolism. In these cases, inferences are made for the functions of uncharacterized genes by their functional linkages to genes of known function. Strong et al.,Genome Biology (2003) 4:R59 10. Lecture WS 2003/04 Bioinformatics III 23 Identification of novel genes Identification of two novel genes linked to the arabinogalactan biosynthesis pathway, an important target of M. tuberculosis specific drugs. Based on the close proximity of adjacent genes (Operon method) and the functional linkage established by the Rosetta Stone method, the authors infer that Rv1503c and Rv1504c may be organized into a common operon. Both genes also have functional links to the genes rfe and rmlB, important components in the arabinogalactan biosynthesis pathway. Strong et al.,Genome Biology (2003) 4:R59 10. Lecture WS 2003/04 Bioinformatics III 24 Assignment of possible function A unique M. tuberculosis gene linked to a glutamine synthetase paralog. Few homologs of Rv1879 exist in prokaryotes, but some plants and certain fungi contain a fusion protein containing domains homologous to both Rv1879 and to glutamine synthetase. The Operon and Rosetta Stone linkages suggest a possible role for Rv1879, and a possible functional association with the glnA3 gene product. 10. Lecture WS 2003/04 Strong et al.,Genome Biology (2003) 4:R59 Bioinformatics III 25 Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages Find computational approaches for finding gene and protein interactions to complement and extend experimental approaches such as: - synthetic lethal and suppressor screens - yeast two-hybrid experiments - high-throughput mass spectrometry interaction assays. Approach followed here: phylogenetic profiles Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 10. Lecture WS 2003/04 Bioinformatics III 26 Identify novel cellular sytems Top: Using computational genetics, the genome-wide protein network of an organism is reconstructed. Middle Suitable candidate clusters that contain three or more linked proteins, at least 50% of which are uncharacterized, are selected for further evaluation. Bottom: Such core clusters are then extended to include operon partners and other proteins that are naturally linked with the protein cluster. Thick boxes and lines indicate proteins in the core cluster; thin boxes and lines indicate proteins extending the core cluster. Shaded boxes represent homologs; thick gray lines represent links to operon partners. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 10. Lecture WS 2003/04 Bioinformatics III 27 Metric of phylogenetic profile similarity The mutual information MI(A,B) measures the similarity of a pair of phylogenetic profiles A and B. MI(A,B) is maximum when there is complete covariance between the occurrences of the genes A and B and tends to 0 as variation decreases or the gene occurrences vary independently. M(A,B) = H(A) + H(B) – H(A,B) H A pa ln pa H(A) represents the marginal entropy of the probability distribution p(a) occurring among the organisms in the reference database, and H A, B pa, bln pa, b represents the relative entropy of the joint probability distribution p(a,b) of occurrences of genes A and B accross the set of reference organisms. 10. Lecture WS 2003/04 Bioinformatics III 28 Quality of functional linkages The inherent information in phylogenetic profiles can be seen from the distributions of scores from comparisons of all possible protein pairs in each of seven organisms. Pairwise comparisons of actual phylogenetic profiles (solid lines) show significantly more similar profiles (indicated by larger mutual information values) than pairwise comparisons of shuffled profiles (dashed lines). Mutual information scores MI between shuffled profiles exceed 0.7 at a rate of 1 in 107 pairs, whereas scores between actual profiles are greater than 1.2, indicating that scores above 0.7 are statistically likely to indicate legitimate functional linkages between pairs of genes. 10. Lecture WS 2003/04 Bioinformatics III Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 29 Quality of functional linkages 1,131 S. cerevisiae and 1,231 E. coli proteins whose functions are precisely known were used to test the quality of the phylogenetic profile linkages. The quality of predicted functional linkages, measured as the mutual information scores between all pairs of phylogenetic profiles, is plotted versus the agreement between the proteins' experimentally known pathways, measured as the Jaccard coefficient between the proteins' pathway memberships in the KEGG database24. Each point represents the average values for 1,000 pairs of proteins. Shuffled profiles rarely show high mutual information values (inset). 10. Lecture WS 2003/04 Bioinformatics III Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 30 Quality of functional linkages Mutual information scores plotted versus pathway similarity on a linear scale show increasing trends. The solid and dashed lines represent analytical curves fit to the data of b by least squares. Scores of 0.75 indicate approximately 35–50% accurate predictions by this test, higher scores approach 100% functional accuracy. For comparison, the percentage of proteins that share no pathways in common show a decreasing trend, as mutual information values increase (inset). The accuracies of experimentally determined protein interactions from large scale yeast two-hybrid screens14, 15 indicating 14% and 44% accuracies, and mass spectrometry experiments16, 17 indicating 27% and 76% accuracies are shown with the dot-dashed horizontal lines. As in b, each point represents the average values of 1,000 pairs of proteins. 10. Lecture WS 2003/04 Bioinformatics III Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 31 Predicted genome-wide protein networks for yeast Proteins are represented as vertices, and derived functional linkages are shown as lines connecting the corresponding proteins. All linkages with scores above a mutual information value of 0.75 are drawn, essentially by modeling the linkages as springs that pull functionally linked proteins together on the page. (Thus, the lengths of the lines are not meaningful, only the connections). Groups of proteins sharing functional links are seen to cluster together, representing portions of genetic or functional networks. Systems in gray circles are labeled with their corresponding functions. (For visual clarity, small protein networks, including 1 five-protein system, 2 fourprotein systems and 31 two-protein systems, have been omitted.) 10. Lecture WS 2003/04 Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) Bioinformatics III 32 Predicted networks for pathogenic E.coli O157:H7 All linkages with scores above a mutual information value of 0.85 are included. For visual clarity, small protein networks, including 1 six-protein system, 2 four-protein systems, 9 three-protein systems and 40 two-protein systems have been omitted. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 10. Lecture WS 2003/04 Bioinformatics III 33 Clusters representing potentially new pathways Clusters representing potentially new pathways selected from reconstructions of genome-wide interaction networks of four different organisms. Boxes with thicker borders, and bold lines denote the cluster core. Each cluster was extended to include operon partners, as well as secondarily linked proteins that are naturally grouped with the proteins in the cluster but with a mutual information value less than the selected threshold; these are represented by dotted lines and boxes with thinner borders. Thick red lines represent connections between genes in an operon, whereas colored boxes represent homologous proteins. All selected core clusters are composed of proteins, at least 50% of which lack precise functional assignments. Boxes with dashed outlines represent such uncharacterized proteins. 10. Lecture WS 2003/04 Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) Bioinformatics III 34 Phylogenetic profiles for new gene clusters The genes corresponding to proteins within a cluster show similar patterns of presence and absence, indicated by red and blue squares, respectively, among the 57 genomes, labeled across the top. The intensity of red denotes the degree of homology between the protein labeled at the left with the best matching protein sequence of the corresponding genome. Deeper red indicates stronger sequence similarity, blue indicates no detectable similarity (BLAST E-value 1). Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 10. Lecture WS 2003/04 Bioinformatics III 35 Why can one still find entirely new systems? In well-characterized systems like yeast, ca. 90% of the uncharacterized proteins are linked in networks to proteins of known function. Most uncharacterized proteins therefore appear to be additional components of known systems. The few characterized proteins of the novel cellular systems detected here seem to be strongly biased towards metabolic functions that occur commonly as more or less discrete systems within cells, which can easibly be coinherited or horizontally transferred. The presented analysis seems an ideal way of discovering such systems. Of course, it cannot indicate the precise biological function of these systems. In traditional biology, the biological knowledge extended gradually along known sets of pathways, rather than sampling all pathways evenly. Again, the presented approach allows new discoveries. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 10. Lecture WS 2003/04 Bioinformatics III 36 Summary Computational functional annotation of genes may be based on (a) annotation by homology to genes with known function in other organisms (b) combination of several, relatively search techniques as presented today. Proteins often have multiple functions! We need to detect all of them. The search techniques under (b) are biology-driven. This area is still in the exploratory phase. Soon certain rules will emerge and allow to apply more sophisticated computational techniques a job for computer scientists/bioinformaticians. 10. Lecture WS 2003/04 Bioinformatics III 37