* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt - Chair of Computational Biology
Gene expression programming wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Metabolic network modelling wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Minimal genome wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
V5: Functional inference from interaction networks 5. Lecture SS 20005 Cell Simulations 1 Aim: associated protein function at the proteomic scale Even the best-studied model organisms contain a large number of proteins whose functions are currently unknown. E.g. about one-third of the proteins in Saccharomyces cerevisiae remain uncharacterized. Traditionally, computational methods to assign protein function have relied largely on sequence homology. The recent emergence of high-throughput experimental datasets have led to a number of alternative, non-homology based methods for functional annotation. These methods have generally exploited the concept of guilt by association, where proteins are functionally linked through either experimental or computational means. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 2 Associating protein function from proteomic data Large-scale experiments have linked proteins that - physically interact (Ito et al., 2001; Uetz et al., 2000; Gavin et al., 2002; Ho et al., 2002; Rain et al., 2001; Giot et al., 2003; Li et al., 2004), - are synthetic lethals (Tong et al., 2001, 2004), - are coexpressed (Edgar et al., 2002) or - are coregulated (Lee et al., 2002; Harbison et al., 2004). In addition, computational techniques linking pairs of proteins include - phylogenetic profiles (Gaasterland and Ragan, 1998; Pellegrini et al., 1999), - gene clusters (Overbeek et al., 1999), - conserved gene neighbors (Dandekar et al., 1998) and - gene fusion analysis (Enright et al., 1999; Marcotte et al., 1999a). Perhaps, though not surprisingly, integrating the information from several sources provides the best method for linking proteins functionally. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 3 Features of integrated networks Several groups have attempted to partition interaction networks into functional modules that correspond to sets of proteins that are part of the same cellular function or take part in the same protein complex. These functional modules, or clusters, are useful for annotating uncharacterized proteins, as the most common functional annotation within a cluster can be transferred to uncharacterized proteins. Proteins in experimentally and computationally determined interaction graphs have been grouped together based on - shared interactions, - the similarity between shortest path vectors to all other proteins in the network - and shared membership within highly connected components or cliques. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 4 Features of integrated networks The strategy described here is more closely related to recent attempts to classify proteins according to the functional annotations of their network neighbors; these methods do not explicitly cluster proteins. Schwikowski et al. (2000) use physical interaction data for baker’s yeast, and predict the biological process for each protein by considering its neighboring interactions and taking the three most frequent annotations. Although such a simple majority vote approach, named Majority, has clear predictive value, it takes only limited advantage of the underlying graph structure of the network. For example, in the interaction network given in Figure 1, ‘Majority’ would assign functions to proteins d and f , but not to protein e, even though our intuition might indicate that protein e has the same function as proteins d and f ; there are several examples in the yeast proteome similar to this one (Schwikowski et al., 2000). Naturally, one wishes to generalize this principle to consider functional linkages beyond the immediate neighbors in the interaction graph, both to provide a systematic framework for analyzing the entirety of physical interaction data for a given proteome and to make predictions for proteins with no annotated interaction partners. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 5 Features of integrated networks Hishigaki et al. (2001) extended Majority algorithm by predicting a protein’s function by looking at all proteins within a particular radius and finding over-represented functional annotations. However, this approach, named Neighborhood, does not consider any aspect of network topology within the local neighborhood. E.g., Figure 2 shows two interaction networks that are treated equivalently when considering a radius of 2 and annotating protein a; however, in the first case, there is a single link that connects protein a to the annotated proteins, and in the second case, there are several independent paths between a and the annotated proteins, and moreover, two of these proteins are directly adjacent to a. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 6 Method: physical interaction network Construct the protein–protein physical interaction network using the protein interaction dataset compiled by GRID (Breitkreutz et al., 2003). The resulting network is a simple undirected graph G = (V , E), where there is a vertex or node v V for each protein, and an edge between nodes u and v if the corresponding proteins are known to interact physically (as determined by one or more experiments). Initially, we consider a graph with unit-weighted edges, and then consider weighting the edges by our ‘confidence’ in the edge. The weight of the edge between u and v is denoted by wu,v. For all reported results, consider only the proteins making up the largest connected component of the physical interaction map (4495 proteins and 12 531 physical interaction links). Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 7 Method: functional annotations Several controlled vocabulary systems exist for describing biological function, including Munich Information Center for Protein Sequences (MIPS) and the Gene Ontology (GO) project. We use the MIPS functional hierarchy, and consider the 72 MIPS biological processes that comprise the second level of hierarchy. Of the 4495 proteins in the largest connected component of the yeast physical interaction map, 2946 have MIPS biological process annotations. We also experimented with GO annotations; the overall conclusions made in this paper are not affected. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 8 Method: weighted functional linkages The reliabilities of different data sources vary, even if they are based on the same underlying technology. In the context of network-based algorithms, it is possible to weight edges so as to model the reliability of each interaction. For physical interactions, this reliability is in turn based on the experimental sources that contribute to our knowledge about the existence of the interaction. To determine these values, separate all experimental sources of physical interaction data into several groups, placing each high-throughput dataset into a separate group, and allocating one group for the family of all specific experiments. For each group of experiments, compute what fraction of its interactions connect proteins with a known shared function. Assume that the reliabilities of different sources are independent, and thus conclude by estimating the reliability of an interaction to be the noise (or unreliability) of the underlying data sources. If ri is the reliability of experimental group i, compute the reliability of the edge by 1 - i (1-ri ), where the product runs over all experiments i where this interaction is found. This treats each ri as a probability and assumes independence. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 9 Algorithms: Majority as in Schwikowsky et al. (2000): consider all neighboring proteins and sum up the number of times each annotation occurs for each protein. „The annotated functions of all neighbors of P are ordered in a list, from the most frequent to the least frequent. Functions that occur the same number of times are ordered arbitrarily. Everything after the third entry in the list is discarded, and the remaining three or fewer functions are declared as predictions for the function of P.“ In the case of weighted interaction graphs take a weighted sum. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 10 Algorithms: Neighborhood algorithm Adapted from Hishigaki et al. (2001) Protein interaction map (Figure B): each node represents a protein and each edge represents the interaction between two proteins. Predict function of each protein in the map (black circle in Figure C), based on the functions of ‘n-neighbouring proteins’, which are defined as a set of proteins reached via n physical interactions at most (n is an integer parameter). E.g. all proteins enclosed by the inside dashed circle are ‘1-neighbouring proteins’, and those enclosed by the outside circle are ‘2-neighbouring proteins’. The protein of interest is assigned the function with the highest 2 value among functions of all nneighbouring proteins. For each member of the function category, the 2 value is calculated using the following formula: where i denotes a protein function, e.g. ‘Golgi’, ‘DNA repair’ and ‘transcription factor’, ei denotes an expectation number of i in n-neighbouring proteins expected from the distribution on the total map, and ni denotes an observed number of i in n-neighbouring proteins. Then, the function of a query protein is predicted to be the function i with the maximum 2 value. When there are multiple functions with the largest 2 value, both functions are assigned. The optimal n value is determined by a so-called self-consistency test, where the predicted functions of all proteins in the map are compared with their annotated functions for each n. Nabieva et al., Bioinformatics 21, i1 (2005) Here: consider neighborhoods of radius 1, 2 and 3. This method does not extend naturally to the case of weighted interaction graphs. 5. Lecture SS 20005 Cell Simulations 11 Algorithms: GenMultiCut It was suggested that functional annotations on interaction networks should be made in order to minimize the number of times different annotations are associated with neighboring proteins. Some authors used simulated annealing in an attempt to minimize this objective function and aggregate results from multiple runs, whereas others used a deterministic approximation, and consider the case where edges are weighted using gene expression information. The formulation in these two studies is similar to the minimum multiway k-cut problem. In multiway k-cut, the task is to partition a graph in such a way that each of k terminal nodes belongs to a different subset of the partition and so that the (weighted) number of edges that are ‘cut’ in the process is minimized. In the more general version of the multiway k-cut problem considered here, the goal is to assign a unique function to all the unannotated nodes so as to minimize the sum of the costs of the edges joining nodes with no function in common. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 12 Algorithms: GenMultiCut Although minimum multiway k-cut is NP-hard (Dahlhaus et al., 1994), it was found that the particular instances of minimum multiway cut arising here can, in practice, be solved exactly when stated as an ILP. Introduce a node variable xu,a for each protein u and function a. Set xu,a = 1 if protein u is predicted to have function a. If a protein u has known functional annotations, variable xu,a is fixed as 1 for its known annotations a and as 0 for all other annotations. We also introduce an edge variable xu,v,a for each function a and each pair of adjacent proteins u and v. This variable is set to 1 if both proteins u and v are annotated with function a. Minimizing the weighted number of neighboring proteins with different annotations is the same as maximizing the number with the same annotation, and so we have the following ILP: Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 13 Algorithms: GenMultiCut annot(u) : set of known annotations for protein u, FUNC = Uu annot(u) : set of all functional annotations. The first constraint specifies that exactly one functional annotation is made for any protein. The second and third constraints ensure that if protein u is annotated with function a, xu,a is set as a constant to 1, and if protein u is annotated but not with function a, xu,a is set as a constant to 0. The third and fourth constraints ensure that a particular function is picked for an edge only if it is also chosen for the corresponding proteins. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 14 considering multiple GenMultiCut optimal solutions An important consideration in this framework is the existence of multiple optimal solutions. E.g. the network in Figure 3 has seven minimum cuts of value 1, and while the GenMultiCut criterion does not favor any one cut over the other, if we find all optimal cuts for this graph, we observe that x2 is in fact annotated with F1 more often than with F2 in the assignments made by these cuts. Thus, a sense of distance to annotated nodes is in fact present in the set of all optimal solutions. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 15 Algorithms: GenMultiCut attempt to sample from the space of optimal solutions: (1) add constraints to the ILP which require that each consecutive solution is different from any previous solution in the value it assigns to at least 5% of the node variables xu,a. (2) introduce uniform self-weights wu,a for each protein u and function a. These self-weights are then perturbed randomly by adding a very small offset to each. Then maximize a number of times Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 16 Algorithms: FunctionalFlow The functional flow algorithm generalizes the principle of ‘guilt by association’ to groups of proteins that may or may not interact with each other physically. Treating each protein of known functional annotation as a ‘source’ of ‘functional flow’ for that function. After simulating the spread over time of this functional flow through the neighborhoods surrounding the sources, the ‘functional score’ is obtained for each protein in the neighborhood; this score corresponds to the amount of ‘flow’ that the protein has received for that function, over the course of the simulation. The functional flow-based model allows us to incorporate a distance effect, i.e. the effect of each annotated protein on any other protein depends on the distance separating these two proteins. Running this process for each biological function in turn, we obtain, for each protein, the score for each function (the score may be 0 if the ‘flow’ for a function did not reach that protein during the simulation). Thereupon, for any protein, we take the functions for which the highest score was obtained as its predicted functions. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 17 Algorithms: FunctionalFlow More specifically, for each function in turn, we simulate the spread of functional flow by an iterative algorithm using discrete time steps. We associate with each node (protein) a ‘reservoir’ which represents the amount of flow that the node can pass on to its neighbors at the next iteration, and with each edge, a capacity constraint that dictates the amount of flow that can pass through the edge during one iteration. The capacity of an edge is taken to be its weight. Each iteration of the algorithm updates the reservoirs using simple local rules: a node pushes the flow residing in its reservoir to its neighbors proportionally to the capacities of the respective edges and subject to further constraints that the amount of flow pushed through an edge during an iteration does not exceed the capacity of the edge, and that flow only spreads ‘downhill’ (i.e. from proteins with more filled reservoirs to nodes with less filled reservoirs). Finally, at each iteration, an ‘infinite’ amount of flow is pumped into the source protein nodes; thus, the sources always have enough flow in their reservoir to fill the capacity of their outgoing edges. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 18 Algorithms: FunctionalFlow The functional score is the amount of flow that has entered a protein’s reservoir in the course of all iterations. Since the flow is ‘pumped’ into the sources at each step, the amount of flow a node receives from each source is greater for nodes that are closer to that source than for nodes that are farther away from it. Thus, a source’s immediate neighbor in the graph receives d iterationsworth of flowfrom the source, whereas a node that is two links away from the source receives d - 1 iterations worth of flow. Similarly, the number of iterations for which the algorithm is run determines the maximum shortest-path distance that can separate a recipient node from a source in order for the flowto propagate from the source to the recipient. In the context of protein interaction, a relatively small number of iterations is sufficient. We choose d = 6, which is half the diameter of the yeast physical interaction network. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 19 Algorithms: FunctionalFlow More formally, for each protein u in the interaction network, we define a variable Rta(u) that corresponds to the amount in the reservoir for function a that node u has at time t . For each edge (u, v) in the interaction network, we define variables gta(u, v) and gta (v, u) that represent the flow of function a at time t from protein u to protein v, and from protein v to protein u. We will run the algorithm for d time steps or iterations. At time 0, we only have reservoirs of function a at annotated nodes: At each subsequent time step, we recompute the reservoir of each protein by considering the amount of flow that has entered the node and the amount that has left: Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 20 Algorithms: FunctionalFlow Initially, at time 0, there is no flow on the edges, and ga0 (u, v) = 0. At each subsequent time step, we have the flow proceeding downhill and satisfying the capacity constraints: Finally, the functional score for node u and function a over d iterations is calculated as the total amount of flow that has entered the node: Why don‘t we simply take the current value of the reservoir? Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 21 Comparison of the four methods Comparison of four basic methods on the unweighted physical interaction map We compare the performance of Majority, Neighborhood, GenMultiCut and FunctionalFlow on the unweighted yeast physical interaction map, using a 2-fold cross-validation. Figure 4 plots as a function of FP the number of TPs each method predicts (i.e. these graphs are obtained by varying the scoring threshold for each of the methods). The Functional-Flow algorithm identifies more TPs over the entire range of FPs than either GenMultiCut or Neighborhood using radius 1, 2 or 3. FunctionalFlow performs better than Majority when proteins are not directly interacting with at least three proteins of the same function; this is evident from Figure 4 since the score for Majority counts up the the most frequent neighboring annotation (e.g. the rightmost point for Majority corresponds to proteins whose highest functional scores are one). Thus, FunctionalFlow is the method of choice when considering proteins that do not interact with many annotated proteins. Even in well-characterized proteomes, such as baker’s yeast, there are ca. 1200 proteins that have fewer than three annotated neighbors. The Neighborhood algorithm performs similarly with either radius 1 or 2 in the high-confidence region (i.e. corresponding to a low FP rate, given in left-most portion of the ROC curve). However, radius 1 (i.e. considering just direct interactions) has better overall performance than radius 2 or 3, demonstrating that Neighborhood’s strategy of ignoring topology is not optimal. Moreover, comparing Majority with Neighborhood using radius 1 demonstrates that the ÷2-test is not as effective in scoring as just summing up the number of times a particular annotation occurs in the neighboring proteins. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 22 Reliability and data integration Figure 5 shows that rewarding multiple experimental evidence is beneficial, but that the main advantage comes from taking into account the actual reliability values for the different experiments. To evaluate the approach for modeling physical interaction reliability as edge weights, we test the performance of FunctionalFlow using three ways of assigning physical interaction weights. (1) assign each edge a unit weight; this corresponds to the unweighted physical interaction map. (2) assign each experimental source a reliability score of 0.5; this rewards interactions that are found by more than one experiment. (3) assign each experimental source the predictive value (estimated in crossvalidation); here, edges obtained from multiple, more reliable experiments are given higher weights. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 23 All methods perform better on weighted map Figure 6 shows how Majority, GenMultiCut and FunctionalFlow perform on the yeast physical interaction map, where edges are weighted by individual experimental reliability. The baseline performance of Majority on the unweighted physical interaction graph is also shown. There is substantial improvement in predictions using all three methods when incorporating edges weighted by reliability. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 24 Is it useful to provide more information? We further explored whether the network analysis algorithms would perform well when other types of experimental information are added. As a proof of principle, we explore the effect of adding genetic linkages to the graph (synthetic lethals). As is evident from the figure, adding genetic interaction data significantly improves prediction quality. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 25 Conclusions Network analysis algorithm FunctionalFlow provides an effective means for predicting protein function from protein interaction maps. The algorithm utilizes indirect network interactions, network topology, network distances and edges weighted by reliability estimated from multiple data sources. The simplest methods, such as Majority, perform well if there are enough direct neighbors with known function. Simple independence assumptions were made for estimating the reliability of interactions. Although these work reasonably well, it may be even more beneficial to use a more sophisticated approach for weight assignment and perform more complete data integration. Finally, although the method was applied to baker’s yeast, FunctionalFlow is likely to be especially useful when analyzing largely uncharacterized proteomes where computational methods are used to infer protein interaction maps. Nabieva et al., Bioinformatics 21, i1 (2005) 5. Lecture SS 20005 Cell Simulations 26 Alternative scheme to combine information from various source: Bayesian network Knowing the correct overall structures of gene networks will be invaluable for characterizing the complex roles of individual genes and the interplay between the many systems in a cell. Deriving gene networks from heterogeneous functional genomics data is often difficult, because experiments such as microarray analyses of gene expression or systematic protein interaction mapping measure different aspects of gene or protein associations. E.g. Affinity purification of proteins analyzed by MS: measures the tendency for proteins to be components of the same physical complex, although not necessarily to contact each other directly. Yeast two-hybrid assays: often indicate direct physical interactions (stable or transient) between proteins. Synthetic lethal screens: measure the tendency for genes to compensate for the loss of other genes. These analyses range considerably in accuracy, and it is not clear a priori which measurements are correct. In spite of these differences, these data sets can, in principle, be computationally integrated, primarily by the reconstruction of network models of the relations between genes. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 27 Aim: construct complete network of gene association Network reconstructions have largely focused on physical protein interaction and so represent only a subset of biologically important relations. Aim: construct a more accurate and extensive gene network by considering functional, rather than physical, associations, realizing that each experiment, whether genetic, biochemical, or computational, adds evidence linking pairs of genes, with associated error rates and degree of coverage. In this framework, gene-gene linkages are probabilistic summaries representing functional coupling between genes. Only some of the links represent direct protein-protein interactions; the rest are associations not mediated by physical contact, such as regulatory, genetic, or metabolic coupling, that, nonetheless, represent functional constraints satisfied by the cell during the course of the experiments. Working with probabilistic functional linkages allows many diverse classes of experiments to be integrated into a single coherent network which enables the linkages themselves to be more reliably Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 28 Method for integrating functional genomics data Functional genomics data sets are first benchmarked for their relative accuracies; these are used as weights in a probabilistic integration of the data. Several raw data sets already have intrinsic scoring schemes, indicated in parentheses (e.g., CC, correlation coefficients; P, probabilities, and MI, mutual information scores). These data are rescored with LLS, then integrated into an initial network (IntNet). Additional linkages from the genes’ network contexts (ContextNet) are then integrated to create the final network (FinalNet) , with È34,000 linkages between 4681 genes (ConfidentNet) scoring higher than the gold standard (small-scale assays of protein interactions). Hierarchical clustering of ConfidentNet defined 627 modules of functionally linked genes spanning 3285 genes (‘‘ModularNet’’), approximating the set of cellular systems in yeast. 5. Lecture SS 20005 Cell Simulations Lee, ..., Marcotte, Science 306, 1555 (2004) 29 Scoring scheme for linkages Unified scoring scheme for linkages is based on a Bayesian statistics approach. Each experiment is evaluated for its ability to reconstruct known gene pathways and systems by measuring the likelihood that pairs of genes are functionally linked conditioned on the evidence, calculated as a log likelihood score: P(L|E) and P(L|E) : frequencies of linkages (L) observed in the given experiment (E) between annotated genes operating in the same pathway and in different pathways P(L) and P(L): the prior expectations (i.e., the total frequency of linkages between all annotated yeast genes operating in the same pathway and operating in different pathways). Scores > 0 indicate that the experiment tends to link genes in the same pathway, with higher scores indicating more confident linkages. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 30 Benchmarks As scoring benchmarks, the method was tested against two primary annotation references: the Kyoto-based KEGG pathway database and the experimentally observed yeast protein subcellular locations determined by genome-wide green fluorescent protein (GFP)–tagging and microscopy. KEGG scores were used for integrating linkages, with the other benchmark withheld as an independent test of linkage accuracy. Cross-validated benchmarks and benchmarks based on the Gene Ontology (GO) and KOG gene annotations provided comparable results. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 31 Functional inference from interaction networks Benchmarked accuracy and extent of functional genomics data sets and the integrated networks. A critical point is the comparable performance of the networks on distinct benchmarks, which assess the tendencies for linked genes to share (A) KEGG pathway annotations or (B) protein subcellular locations. Each x axis indicates the percentage of protein-encoding yeast genes provided with linkages by the plotted data; each y axis indicates relative accuracy, measured as the of the linked genes’ annotations on that benchmark. The gold standards of accuracy (red star) for calibrating the benchmarks are smallscale protein-protein interaction data from DIP. Colored markers indicate experimental linkages; gray markers, computational. The initial integrated network (lower black line), trained using only the KEGG benchmark, has measurably higher accuracy than any individual data set on the subcellular localization benchmark; adding context-inferred linkages in the final network (upper black line) further improves the size and accuracy of the network. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 32 Features of integrated networks Functional linkages were first inferred on the basis of genes‘ mRNA coexpression across each of 12 sets of DNA microarray experiments (497 microarray experiments in total), then integrated via a rank-weighted sum of log likelihood scores to create the combined set of coexpression-derived linkages. To construct the initial integrated network (BIntNet,[ Fig. 1), we combined 8 categories of data, including the physical and genetic interaction data sets, mRNA coexpression linkages, functional linkages from literature mining, and computational linkages from two comparative genomics methods, Rosetta stone (gene-fusion) linkages and phylogenetic profiles. Integrating functional genomics data also allowed discovery of additional relations between genes linked, in turn, to a common set of genes [BContextNet]; these linkages were scored and integrated as above to construct the final gene network (BFinalNet, see figure). The final network has 34,000 linkages at an accuracy comparable to the gold standard small-scale interaction assays, which provides linkages (BConfidentNet) for more than 4681 yeast genes (81% of the yeast proteome). The network is reasonably distinct from networks of physical interacting proteins Ee.g., sharing only È16% of linkages with (Jansen et al.). The final network shows extensive clustering of genes into modules, evident in the ‘‘clumping’’ (A). Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 33 Features of integrated networks At an intermediate degree of clustering that maximizes cluster size and functional coherence, 564 (of 627) modules are shown connected by the 950 strongest intermodule linkages. Module colors and shapes indicate associated functions, as defined by Munich Information Center for Protein Sequencing (MIPS), with sizes proportional to the number of genes, and connections inversely proportional to the fraction of genes linking the clusters. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 34 Features of integrated networks Adding context-inferred linkages increased clustering of genes, which produced a highly modular gene network with well-defined subnetworks. We expected these gene clusters to reflect gene systems and modules. We could therefore generate a simplified view of the major trends in the network (Fig. 3B) by clustering genes of ConfidentNet according to their connectivities. Of the 4681 genes, 3285 (70.2%) were grouped into 627 clusters, reflecting the high degree of modularity. Genes‘ functions within each cluster are highly coherent, and with 2 to 154 genes per cluster (ca. 5 genes per cluster on average), the clusters effectively capture typical gene pathways and/or systems. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 35 Features of integrated networks Portions of the final, confident gene network are shown for (C) DNA damage response and/or repair, where modularity gives rise to gene clusters, indicated by similar colors, and (D) chromatin remodeling, with several uncharacterized genes (red labels). Networks are visualized with Large Graph Layout (LGL). Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 36 Features of integrated networks A region of the modular network centered on the DNA damage response and repair systems is shown in Fig. 3C. The network is clearly hierarchical: Individual clusters represent distinct systems related to DNA damage response and/or repair; these clusters are in turn connected to modules of cell cycle regulatory genes and chromatin silencing, functionally linked to the DNA damage response and/or repair system. One can infer individual genes‘ functions on the basis of linked neighbors. (Comment: this simple scheme can be improved by combining approach with Nabieva et al.) For example, seven uncharacterized genes are implicated in chromatin remodeling (Fig. 3D). All but 1 of the 18 linkages made by these genes arise from the comparative genomics analysis or from the network context methods, which represent examples of the insights that arise only after data integration. Three of the uncharacterized proteins are predicted by sequence homology to have helicase activity, which is reasonable for a relation to chromatin remodeling; four of these proteins localize to the nucleus, further supporting their association. After this network‘s construction, one gene, VID21, was implicated in chromatin modification as a component of the NuA4 histone acetyl transferase. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 37 Experimental verification of one prediction The function of the RNA helicase PRP43, previously thought to be involved only in pre-mRNA splicing and implicated in lariatintron release from the spliceosome, is also clarified in the network. PRP43 is linked most strongly to genes of ribosome biogenesis and rRNA processing. The tightest links are to ERB1, RRB1, NUG1, LHP1, and PWP1, the first three of which are confirmed ribosome biogenesis factors. These links derive only from the coexpression and context methods; data integration is therefore critical. The association of PRP43 with ribosome biogenesis has now been experimentally validated: the growth defect conferred by a PRP43 conditional lethal mutation corresponds to a rapid and major defect in rRNA processing. These data indicate that rRNA processing is the essential function of PRP43, and it joins a growing group of RNA helicases with two or more distinct functions. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 38 Summary The probabilistic gene network integrates evidence from diverse sources to reconstruct an accurate network, by estimating the functional coupling among yeast genes, and provides a view of the relations between yeast proteins distinct from their physical interactions. The application of this strategy to other organisms, such as to the human genome, is conceptually straightforward: (i) assemble benchmarks for measuring the accuracy of linkages between human genes based on properties shared among genes in the same systems, (ii) assemble gold standard sets of highly accurate interactions for calibrating the benchmarks, and (iii) benchmark functional genomics data for their ability to correctly link human genes, then integrate the data as described. New data can be incorporated in a simple manner serving to reinforce the correct linkages. Thus, the gene network will ultimately converge by successive approximation to the correct structure simply by continued addition of functional genomics data in this framework. Lee, ..., Marcotte, Science 306, 1555 (2004) 5. Lecture SS 20005 Cell Simulations 39