Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hierarchical Organization of Functional Modules in Weighted Protein Interaction Networks Using Clustering Coefficient Min Li1 , Jianxin Wang1, , Jianer Chen1,2 , and Yi Pan1,3 1 School of Information Science and Engineering, Central South University, Changsha 410083, P. R. China 2 Department of Computer Science, Texas A&M University, College Station, TX 77843, USA 3 Department of Computer Science, Georgia State University, Atlanta, GA 30302-4110, USA Abstract. As advances in the technologies of predicting protein interactions, huge data sets portrayed as networks have been available. Several graph clustering approaches have been proposed to detect functional modules from such networks. However, all methods of predicting protein interactions are known to yield a nonnegligible amount of false positives. Most of the graph clustering algorithms are challenging to be used in the network with high false positives. We extend the protein interaction network from unweighted graph to weighted graph and propose an algorithm for hierarchically clustering in the weighted graph. The proposed algorithm HC-Wpin is applied to the protein interaction network of S.cerevisiae and the identified modules are validated by GO annotations. Many significant functional modules are detected, most of which are corresponding to the known complexes. Moreover, our algorithm HCWpin is faster and more accurate compared to other previous algorithms. The program is available at http://bioinfo.csu.edu.cn/limin/HC-Wpin. Keywords: Protein interaction network, clustering, functional module. 1 Introduction High-throughput methods, such as yeast-two-hybrid and mass spectrometry, have led to the emergence of large protein-protein interaction datasets [1–6]. These protein-protein interactions can be naturally represented in the form of networks and provide useful insights into functional associations between proteins[7, 8]. A wide range of graph clustering algorithms have been developed for identifying functional modules from such protein interaction networks. This research was supported in part by the National Basic Research 973 Program of China No. 2008CB317107, the National Natural Science Foundation of China under Grant No. 60773111, the Program for New Century Excellent Talents in University No. NCET-05-0683, the Program for Changjiang Scholars and Innovative Research Team in University No. IRT0661. Corresponding author. I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 75–86, 2009. c Springer-Verlag Berlin Heidelberg 2009 76 M. Li et al. Recently, hierarchical model of modular network has been introduced and several hierarchical clustering approaches have been applied to identify functional modules[9–12]. In general, the hierarchical clustering approaches can represent the protein interaction networks in hierarchy by tree. According to the differences of constructing the tree, hierarchical clustering approaches can be classed into two groups: the top-down approach and the bottom-up approach. The topdown approaches start from one cluster with all vertices and recursively dividing it into several dissimilar sub-clusters. A typical example is the betweenness centrality-based algorithm proposed by Girvan and Newman [9]. In contrast, the bottom-up approaches start at single vertex clusters and iteratively merge similar clusters. MoNet algorithm [12] is an example of the bottom-up approaches. However, the hierarchical clustering approaches are known to be sensitive to noisy data [8]. Up to now, all methods of predicting protein-protein interactions can not avoid yielding a non-negligible amount of noisy data (false positives)[13]. Thus, the conventional hierarchial clustering approaches are challenging to be used directly in the networks with false positives. A series of density-based clustering approaches have been proposed to identify densely connected regions from protein interaction networks which are somewhat robust to noisy data. An extreme example is the maximum clique algorithm [14] which detects fully connected subnetworks. However, only mining fully connected subnetworks is too strict to be used in real biological networks. A variety of alternative density functions have been proposed to detect dense subnetworks [15–19]. However, the density-based approaches neglect many peripheral proteins that connect to the core protein clusters with few links, even though these peripheral proteins may represent true interactions that have been experimentally verified [12]. As Barabási and Oltvai have pointed out that the density-based clustering approaches are not able to partition the biological networks whose degree distributions are typically power-law [20]. In addition, biologically meaningful functional modules that do not have highly connected topologies are ignored by these approaches [12]. In this paper, we extend the protein interaction network to a weighted graph and develop an algorithm, named HC-Wpin, for hierarchical clustering in the weighted graph. Algorithm HC-Wpin is applied to the weighted protein interaction network of S.cerevisiae and the identified modules are validated by GO annotations (including Biological Process, Molecular Function, and Cellular Component ). The experimental results show that the identified modules are statistically significant in terms of three types of GO annotations. Compared to other previous competing algorithms, our algorithm HC-Wpin is faster and more accurate, which can be used in even larger protein interaction networks. 2 Methods Edge Clustering Coefficient In Weighted Protein Interaction Network A weighted protein interaction network can be represented as a weighted undirected graph G = (V, E), where V is a set of vertices and E is a set of edges Hierarchical Organization of Functional Modules 77 between the vertices. Each edge (u, v) is assigned with a weight w(u, v), which represents the probability of this interaction being a true positive. Clustering coefficient is first proposed to describe the property of a vertex in a network. Clustering coefficient of a vertex is the ratio of the number of connections in the neighborhood of the vertex and the number of connections if the neighborhood is fully connected [21]. Roughly speaking, clustering coefficient tells how well connected the neighborhood of the vertex is. Recently, Radicchi et al. [22] generalized the clustering coefficient of a vertex to an edge, and defined it as the number of triangles to which a given edge belonged, divided by the number of triangles that might potentially include it. The definition is not feasible when the network has few triangles. In our previous studies, we have redefined the clustering coefficient of an edge by calculating the common neighbors instead of the triangles [23]. However, all the definitions of clustering coefficient in the unweighted protein interaction networks do not consider the edge reliability. Here, we redefine the clustering coefficient of an edge in the weighted graph. Let Nu be the set of neighbors of vertex u and Nv be the set of neighbors of vertex v, respectively. Then the clustering coefficient of an edge (u, v) in a weighted graph G is defined as: k∈Iu,v CCu,v = w(u, k) · s∈Nu w(u, s) · k∈Iu,v w(v, k) t∈Nv w(v, t) (1) where Iu,v denotes the set of common vertices in Nu and Nv (i.e. Iu,v = Nu ∩Nv ). Two vertices of an edge with larger clustering coefficient are more likely to lie in the same module. Quantitative Definition of Modules In Weighted Protein Interaction Network Generally, functional modules in protein interaction networks are loosely referred to as highly connected subgraphs, which have more internal edges than external edges. Several module definitions have been proposed, such as strong module and weak module [22]. For an unweighted protein interaction network with high false positive, false prediction may be generated by using these definitions. To avoid the effect of false positive interactions in protein interaction networks, we define a functional module in the weighted graph. For a weighted graph G = (V, E), the degree of a vertex v is denoted as dw (v) which is the sum of weights of the edges connecting v: dw (v) = w(u, v). (2) u∈Nv ;(u,v)∈E For a vertex v in a subgraph H ⊆ G, its in-degree, denoted as din w (H, v), is the sum of weights of edges connecting vertex v to other vertices belonging to H, and its out-degree, denoted as dout w (H, v), is the sum of weights of edges connecting 78 M. Li et al. out vertex v to other vertices in the rest of the graph G. din w (H, v) and dw (H, v) can be formed as formula (3) and formula (4), respectively. din w (H, v) = w(u, v). (3) u,v∈H;(u,v)∈E dout w (H, v) = w(u, v). (4) v∈H;u∈H;(u,v)∈E / It is clearly that the degree dw (v) of a vertex v is equaled to the sum of din w (H, v) and dout (H, v). w Definition 1. Given a weighted undirected graph G = (V, E, W ) and a threshold λ, a subgraph H ⊆ G is a λ-module if v∈H din w (H, v) > λ dout w (H, v) (5) v∈H where λ is a parameter determined by user. By changing the values of parameter λ , we can get different modules in the weighted protein interaction network. Hierarchical Clustering Based on the definitions of edge clustering coefficient and λ-module in weighted protein interaction networks, we propose a novel hierarchical clustering algorithm HC-Wpin. The whole description of algorithm HC-Wpin is shown in Figure 1. In Figure 1, Ci denotes a cluster, V (Ci ) and E(Ci ) denote the set of vertices and edges in the cluster Ci , respectively. Let Din (Ci ) be the sum of the in-degree of the vertices in Ci , and Dout (Ci ) be the sum of the out-degree of the vertices in Ci . For a vertex v, L(v) indicates which cluster it is in. The input to algorithm HC-Wpin is a weighted undirected graph G(V, E). Firstly, all vertices in the graph G are initialized as singleton clusters. Then, the clustering coefficient of each edge in the graph G is calculated. We enqueue all the edges into a queue Sq in non-increasing order in terms of their clustering coefficients. The higher clustering coefficient the edge has, the more likely its two vertices are inside a module. By gradually adding edges in the queue Sq to clusters, algorithm HC-Wpin finally assembles all the singleton clusters into λ-modules. In the end, the λ-modules which consist of s or more than s proteins are outputted. Parameter s is used to control the minimum size of the output functional modules. Let n and m denote the number of vertices and edges in a weighted protein interaction network respectively and k be the average number of neighbors of all the vertices, i.e. k = n1 v∈V |Nv |. Then, the complexity of calculating all the edge clustering coefficients is O(k 2 m), the complexity of the iterative merging is O(m). Thus, the total computational complexity of algorithm HC-Wpin is O(k 2 m). In general, k is very small and can be considered as a constant. Hierarchical Organization of Functional Modules 79 Algorithm HC-Wpin input: a weighted graph G = (V, E), parameters λ and s; output: identified modules; 1. for each vertex vi ∈ V do V (Ci ) = {vi }; E(Ci ) = ∅ 2. for each edge (u, v) ∈ E do compute its clustering coefficient; 3. sort all edges to queue Sq in non-increasing order in terms of clustering coefficients; 4. while Sq = ∅ do { e(u, v) ← Sq ; if L(u) = L(v) then i = L(u); E(Ci ) = E(Ci ) ∪ {e(u, v)}; else i = L(u); j = L(v); Din (Cj ) in (Ci ) if DDout ≤ λ or Dout ≤ λ then (Ci ) (Cj ) V (Ci ) = V (Ci ) ∪ V (Cj ); E(Ci ) = E(Ci ) ∪ E(Cj ); Cj = {∅, ∅}; Sq = Sq − {e(u, v)};} 5. for i=1 to |V | do if |V (Ci )| ≥ s then output Ci ; Fig. 1. The description of algorithm HC-Wpin 3 Experiments and Results Identification of λ Modules In The Network of S.cerevisiae The original unweighted protein interaction network of S.cerevisiae, consisting of 4,726 proteins and 15,166 interactions, was downloaded from the DIP database[24]. To construct weighted protein interaction network, we assign confidence scores to these interactions using the logisticregression-based scheme employed in [25, 26]. Roughly speaking, the confidence score is computed based on the experimental evidences which include the type of experiments in which the interaction is observed, and the number of observations in each experimental type. We apply algorithm HC-Wpin to the weighted protein interaction network and achieve five output sets of modules by changing the values of parameter λ from 1.0 to 3.0 with 0.5 increment. Table 1 illustrates the effect of parameter λ on clustering. In Table 1, Max.Size represents the size of the largest module, and Avg.Size represents the average size of all the identified modules. As shown in Table 1, Table 1. The effect of λ on clustering Parameter Modules Max.size Avg.size λ = 1.0 145 79 9.24 λ = 1.5 132 125 10.54 λ = 2.0 117 263 12.25 λ = 2.5 91 982 16.89 λ = 3.0 77 1192 20.82 80 M. Li et al. #3 (982) #11 (8) #11 (5) #14 (72) #3 #16 (4) (125) #3 (263) #15 (7) #19 (19) #27 (3) #18 (73) #35 (9) #4 #69 #51 (79) (33) (8) Y L R 4 1 8 C Y B R 2 7 9 W Y O L 1 4 5 C Y G L 2 4 4 W #22 (4) Y M L 0 1 0 W Y G L 2 0 7 W #30 #33 (22) (8) #36 (3) #37 #38 (18) (7) #44 (73) #0 (26) Y M L 0 6 9 W #39 (6) #43 (4) #51 (72) #52(73) #46 (15) #20 #64 (45) (12) Y O R 1 2 3 C #26 (5) Ȝ=2.5 Y M R 1 9 7 C Y K L 1 9 6 C #35 (4) Y B L 0 5 0 W #36 #38 (4) (50) Y O R 1 0 6 W Y G L 0 9 5 C Y D R 4 6 8 C #55 (3) #19 (11) #46 #61 #94 (9) (17) (9) Y L R 0 9 3 C #54 #55 #56 (8) (255) (3) Y O L 0 1 8 C Y G L 2 1 2 W #64 (17) #65 (123) #63 (5) #68 (8) #71 (3) #80 (4) #69 (16) #72 (17) #87 (4) #94 (5) #41 #54 #76 (27) (29) (32) Y O L 0 0 4 W #89 (3) #126 (3) Ȝ=2.0 Ȝ=1.5 Ȝ=1.0 #58 (7) Y N L 3 3 0 C Y D L 0 7 6 C Y P L 1 3 9 C Y M R 2 6 3 W Y I L 0 8 4 C Y P L 1 8 1 W The identified module consisting of 982 proteins which is generated by HC-Wpin with Ȝ=2.5 The identified modules generated by HC-Wpin with Ȝ=2.0 which are included in a larger module generated by HC-Wpin with Ȝ=2.5 The identified modules generated by HC-Wpin with Ȝ=1.5 which are included in larger modules generated by HC-Wpin with Ȝ=2.0 The identified modules generated by HC-Wpin with Ȝ=1.0 which are included in larger modules generated by HC-Wpin with Ȝ=1.5 Proteins Fig. 2. An example of hierarchical modules generated by algorithm HC-Wpin with different values of parameter λ. All the identified modules listed in this figure are available from Additional file 1. the number of the identified modules is decreasing with the increase of λ. The average size of all the identified modules and the size of the biggest module are increasing as λ increases. Bigger modules are generated by HC-Wpin when larger value of λ is used. This is because the larger value of λ may lead to the merging of clusters in the agglomerative process. Figure 2 illustrates an example that small modules are iteratively merged with the increase of λ. As shown in Figure 2, the modules #19, #41, #54, #58 and #76 generated by HC-Wpin with λ=1.0 are merged into a larger module #65 which is generated by HC-Wpin with λ=1.5. When λ=2.0, the module #65 and the other nine green modules generated with λ=1.5 are merged into a larger module #55 which consists of 255 proteins. When λ=2.5, all the 24 modules generated with λ=2.0 are merged into a more larger module which consists of 982 proteins. By changing the values of parameter λ, we can obtain the hierarchial organization of functional modules in the protein interaction network. In the following subsection we will evaluate the significance of the identified modules by using the SGD GO Term Finder (http://www.yeastgenome.org/). The evaluation results show that the hierarchical organization of modules are approximatively corresponding to the hierarchical structure of GO annotations. Statistical Assessment of the Identified Modules To test whether the identified modules are significant, we validate them using all the three types of GO terms: Biological Process, Molecular Function, and Cellular Hierarchical Organization of Functional Modules 81 Component. For each identified module, the P-value from the hypergeometric distribution is calculated based on the three types of GO annotations. A cutoff parameter is used to differentiate significant groups from insignificant ones. If an identified module is associated with a P-value larger than cutoff, it is considered insignificant. We use the recommended cutoff of 0.05 for all our validations. Firstly, we evaluate the significance of the hierarchical modules shown in Figure 2. As pointed out by Gavin et al. and Krogan et al., the larger complexes are composed, sometimes transiently, from smaller subcomplexes [3, 6]. We obtain the similar results that the smaller functional modules are hierarchically organized into the larger functional modules which is approximatively corresponding to the hierarchical structure of GO annotations. For example, the five green modules (#38, #64, #65, #69 and #72) generated with λ=1.5 are merged into a larger module #55 generated with λ=2.0. Correspondingly, the Biological Process annotation for the module #55 is the common ancestor of that for the five smaller modules #38, #64, #65, #69 and #72 in the hierarchical structure of GO annotations, as shown in Additional file 2. Similar corresponding relation between the hierarchical modules and the hierarchical structure of GO annotations is obtained both for Molecular Function and for Cellular Component annotations. Next, we evaluate all the identified modules generated by HC-Wpin. Take the modules with λ=1.0 for example, 130 out of 145 identified modules are validated to be significant with Biological Process annotations. The lowest P-values of the 130 significant modules range from 4.93E-02 to 5.82E-73. For Molecular Function annotations, 115 identified modules are validated to be significant, whose lowest P-values range from 4.70E-02 to 2.50E-52. The module with the lowest P-value of 2.50E-52 is composed of 33 members. Of all the 33 proteins, more than 75% have the function of “RNA polymerase activity”. For Cellular Component annotations, the lowest P-value of all the identified modules is 1.51E-68. The module with the lowest P-value is composed of 45 proteins, in which 37 proteins belong to the known complex “small nuclear ribonucleoprotein complex”. There are a series of small identified modules matching the known complexes perfectly in the network of S.cerevisiae. For example, the module #10 consisting of 3 proteins (EFB1, TEF4, TEF1) is exactly the “eukaryotic translation elongation factor 1 complex”, the module #136 consisting of 3 proteins (POL3, POL31, POL32) is exactly the “delta DNA polymerase complex”, the module #33 consisting of 4 proteins (APL3, APL1, APS2, APM4) is exactly the “AP-2 adaptor complex”, the module #90 consisting of 4 proteins (SEC66, SEC72, SEC63, SEC62) is exactly the “endoplasmic reticulum Sec complex”, and the module #131 consisting of 5 proteins (VPS29, PEP8, VPS35, VPS5, VPS17) is exactly the “retromer complex”. There are some identified modules which have the similar P-values validated by different types of GO annotations. For the example of module #134, all the 4 proteins (SEN34, SEN2, SEN15, SEN54) in it are exactly the entire members in the network of S.cerevisiae which have the same molecular function of “endoribonuclease activity, producing 3’-phosphomonoesters”, participate in the same biological process of “RNA splicing, via endonucleolytic cleavage and ligation” and are in the same cellular component “tRNA-intron endonuclease complex”. 82 M. Li et al. There are also a number of identified modules whose lowest P-values are very different with different validation of GO terms. For example, the module #60 is composed of 4 proteins (APM3, APL6, APS3, APL5). For Biological Process annotations, it has a lowest P-value of 4.55E-09. The 4 proteins combining with other 18 proteins participate in the process of “Golgi to vacuole transport”. For Molecular Function annotations, 3 proteins (APM3, APL6, APS3) out of the 4 proteins are directly annotated to the root term “molecular function unknown”. It has no P-value result. However, it is exactly the known complex “AP-3 adaptor complex” for Cellular Component annotation with the lowest P-value of 6.70E13. The detailed annotations of one type of GO terms may give clues to study another type of GO terms. The above analyses show that our algorithm HC-Wpin not only can identify significant functional modules but also can detect significant functional modules in hierarchy. Comparison with Other Methods To evaluate the effectiveness of our algorithm HC-Wpin, we compare it with several previous state-of-the-art algorithms: the MoNet algorithm [12] and FAG-EC algorithm [23] as hierarchial clustering approaches, the MCODE algorithm and DPClus algorithm as density-based methods, and the STM algorithm [27]. The values of the parameters in each algorithm are selected from those recommended by the author. The accuracy of each algorithm is calculated and shown in Table 2. The accuracy of an algorithm indicates the average f -measure of the significant modules generated by it. f -measure of an identified module is defined as a harmonic mean of its recall and precision. f -measure = 2 ∗ recall ∗ precision recall + precision recall = |M ∩ Fi | |Fi | precision = |M ∩ Fi | |M | (6) (7) (8) where Fi is a functional category mapped to module M . The proteins in functional category Fi are considered as true predictions, the proteins in module M are considered as positive predictions, and the common proteins of Fi and M are considered as true positive predictions. Recall is the fraction of the true-positive predictions out of all the true predictions, and precision is the fraction of the true-positive predictions out of all the positive predictions[8]. As shown in Table 2, the accuracy of our algorithm HC-Wpin is much higher than that of the other four algorithms: FAG-EC, MCODE, DPClus and STM with all the validations of Biological Process (abbreviated as B.P.), Molecular Function (abbreviated as M.F.), and Cellular Component (abbreviated as C.C.). MoNet produces a giant module consisting of 3336 proteins and three small Hierarchical Organization of Functional Modules 83 Table 2. Comparison of the accuracy of algorithm HC-Wpin and other previous algorithms Algorithms Modules Average of Maximum Accuracy size ≥ 3 Size Size B.P. M.F. C.C. HC-Wpin 145 9.24 79 0.34 0.29 0.50 MoNet 4 837.50 3336 FAG-EC 326 8.52 237 0.27 0.20 0.40 MCODE 59 74.78 555 0.29 0.24 0.39 DPClus 236 4.02 13 0.26 0.19 0.35 STM 10 467.80 4647 0.21 0.10 0.01 (a) Biological Process Parameters λ=1.0 S=1 λ=1.0 fluff =0.1;VWP =0.2 CPin =0.5;Din =0.9 Merge=1.0 (b) Molecular Function 0 (c) Cellular Component 0 0 -10 -10 -10 -20 -20 log(P-value) log(P-value) -40 HC-Wpin MoNet FAG-EC MCODE DPClus STM -50 -60 -70 0 5 10 15 20 Significant modules 25 -30 HC-Wpin MoNet FAG-EC MCODE DPClus STM -40 -50 -80 30 log(P-value) -20 -30 -30 -40 HC-Wpin MoNet FAG-EC MCODE DPClus STM -50 -60 -70 -60 -80 0 5 10 15 20 Significant modules 25 30 0 5 10 15 20 25 30 Significant modules Fig. 3. Comparison of P-value distribution of significant modules generated by HC-Wpin and other algorithms. The x axis represents the number of significant modules and the y axis represents the log(P-value) for each corresponding module. modules. Only one module is validated to be significant for all the three type of GO annotations. Thus, we does not list the accuracy of MoNet. The false clustering of MoNet is mostly caused by the miscalculation of betweenness. The false positive interactions can yield the incorrect shortest paths in a network and the incorrect shortest paths cause miscalculation of betweenness. Figure 3 (a), (b), and (c) illustrate the P-value distributions of the significant modules generated by all these algorithms. Since the network has a lower probability to produce the module by chance, the module is more significant with lower P-value. As observed from Figure 3, the 30 best significant modules identified by our algorithm HC-Wpin are consistently more significant than those generated by other algorithms. The comparison results in Table 2 and Figure 3 show that our algorithm HC-Wpin outperforms the other previous algorithms. Efficiency Analysis All experiments in this paper are implemented on a Linux server with 4 Intel Xeon 3.6GHz CPU and 4GByte RAM. Table 3 illustrates a comparison of the running time of our algorithm HC-Wpin and the other five algorithms for identifying functional modules. The network of S.cerevisiae (4726 proteins and 15166 interactions) and the other four networks with different confidence level obtained from [28] are used as the test data. We call the network of S.cerevisiae 84 M. Li et al. Table 3. Comparison of the running time of algorithm HC-Wpin and other algorithms Algorithms HC-Wpin FAG-EC MCODE DPClus MoNet STM The running time (second) Y2k Y11k Y45k Y78k NSc (988, 2455) (2401, 11000) (4687, 45000) (5321, 78390) (4726, 15166) 0.1 0.6 5.6 9.8 0.7 0.1 0.6 5.6 9.8 0.7 0.2 7.2 1037.2 4129.8 480.2 0.3 32.6 1116.4 5638.6 1194.7 2.0 88.2 6593.8 9516.2 2852.4 1.0 7944.3 62360.2 140174.8 27073.2 NSc and name the other four networks Y2k, Y11k, Y45k and Y78k, respectively, according to the number of edges included in them. From Table 3, we can see the running time of MoNet, STM, MCODE and DPClus increase sharply with the size of network. It takes more than 4,000 seconds for MCODE and DPClus, about 10,000 seconds for MoNet, and more than 100,000 seconds for STM to identify modules from the network consisting of 5,321 proteins and 78,390 interactions. However, the running time of HC-Wpin and FAG-EC detecting functional modules from the same network is still small and less than 10 seconds. As one can see, algorithm HC-Wpin is extremely fast, which is hundreds of times faster than MCODE, DPClus, MoNet and STM. As the protein-protein interactions accumulating, algorithm HC-Wpin can be used in even larger protein interaction networks. 4 Conclusions In previous studies, the protein interaction networks are generally represented as unweighted graphs. As is well known, the protein interaction networks can not avoid of false positives. Thus, directly clustering from the unweighted graphs with high false positives, most of the previous graph clustering approaches may generate a number of false predictions. In this paper, we extend the protein interaction network from unweighted graph to weighted graph and develop a fast hierarchical clustering algorithm HC-Wpin to identify functional modules from the weighted graph. The reliability of interactions is measured by the logisticregression-based scheme [25, 26]. By changing the values of parameter λ, we can identify the functional modules in a hierarchy. We use all the three types of GO Terms to validate the identified modules of S.cerevisiae. Many significant functional modules are detected, most of which are exactly corresponding to the known complexes. For most cases, the value of λ is recommended to be between 1.0 and 3.0. When you want to get small modules, you should select a small value for λ. On the contrary, you should select a relative large value for λ to obtain modules consisting of more proteins. We also compare the performances of our algorithm HC-Wpin and the other five algorithms: MoNet, FAG-EC, MCODE, DPClus, and STM. Unexpected Hierarchical Organization of Functional Modules 85 giant modules are generated by MoNet and STM which are caused by their weakness to the false positives. Although MCODE and DPClus are somewhat robust to the false positives, they are not adept at identifying hierarchically distributed functional modules. The quantitative comparison of accuracy reveal that algorithm HC-Wpin outperforms the other five algorithms. Another strength of our algorithm HC-Wpin is efficiency. It is very fast and can be applied to even larger protein interaction networks of other higher-level organisms. Acknowledgments. The authors would like to thank F. Luo and his colleagues for sharing their program of MoNet, to W. Hwang and his colleagues for sharing the source code of STM. The authors are also thankful to M. Altaf-UI-Amin and his colleagues for sharing the tool of DPClus, to G.D. Bader and C.W. Hogue for their publicity of MCODE. The authors also thank T. Shlomi for providing and discussing about the data. Additional Files Additional file 1 — Example for hierarchical organization of functional modules. This file contains the hierarchical modules shown in Figure 2 which is available at http://bioinfo.csu.edu.cn/limin/HC-Wpin/Additional file 1.txt. Additional file 2 —Supplemental Figure 1. This file contains a supplemental Figure 1 which shows the reduction of hierarchical structure of biological process annotations for the hierarchical modules shown in Figure 2. This file is available at http://bioinfo.csu.edu.cn/limin/HC-Wpin/Additional file 2.pdf. References 1. Uetz, P., et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627 (2000) 2. Gavin, A.C., et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415(6868), 141–147 (2002) 3. Gavin, A.C., et al.: Proteome survey reveals modularity of the yeast cell machinery. Nature 440(7084), 631–636 (2006) 4. Ho, Y., et al.: Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415(6868), 180–183 (2002) 5. Krogan, N.J., et al.: High-definition macromolecular. composition of yeast RNAprocessing complexes. Molecular Cell 13, 225–239 (2004) 6. Krogan, N.J., et al.: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084), 637–643 (2006) 7. Harwell, L.H., Hopfield, J.J., Leibler, S., Murray, A.W.: From molecular to modular cell biology. Nature 402, c47–c52 (1999) 8. Cho, Y.R., Hwang, W., Ramanmathan, M., Zhang, A.D.: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics 8, 265 (2007) 9. Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99, 7821–7826 (2002) 86 M. Li et al. 10. Rives, A.W., Galitski, T.: Modular organization of cellular networks. Proc. Natl. Acad. Sci 100, 1128–1133 (2003) 11. Ravasz, E., et al.: Hierarchical organization of modularity in metaboli networks. Science 297, 1551–1555 (2002) 12. Luo, F., Yang, Y.F., Chen, C.F., Chang, R., Zhou, J.Z.: Modular organization of protein interaction networks. Bioinformatics 23(2), 207–214 (2007) 13. Brohee, S., van Helden, J.: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics, 7–488 (2006) 14. Spirin, V., Mirny, L.A.: Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci., 12123–12128 (2003) 15. Bu, D., et al.: Topological structure analysis of the protein-protein interaction networks in budding yeast. Nucleic Acid Research 31(9), 2443–2450 (2003) 16. Brun, C., et al.: Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics 7, 488 (2004) 17. Bader, G.D., Hogue, C.W.: An Automated Method for Finding Molecular Complexes in Large Protein Interaction Networks. BMC Bioinformatics 4, 2 (2003) 18. Altaf-Ul-Amin, M., Shinbo, Y., Mihara, K., Kurokawa, K., Kanaya, S.: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics, 7–207 (2006) 19. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814–818 (2005) 20. Barabási, A.L., Oltvai, Z.N.: Network biology: understanding the cell’s functional organization. Nature Reviews: Genetics 5, 101–114 (2004) 21. Friedel, C., Zimmer, R.: Inferring topology from clustering coefficients in proteinprotein interaction networks. BMC Bioinformatics, 7–519 (2006) 22. Radicchi, F., et al.: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. 101, 2658–2663 (2004) 23. Li, M., Wang, J.X., Chen, J.E.: A fast agglomerate algorithm for mining functional modules in protein interaction networks. In: Peng, Y., Zhang, Y. (eds.) Proceedings of the First International Conference on BioMedical Engineering and Informatics: Hainan, China, May 27-30, pp. 3–7 (2008) 24. Xenarios, I., et al.: DIP: the Database of Interaction Proteins: a research tool for studying cellular networks of protien interactions. Nucleic Acids Res. 30, 303–305 (2002) 25. Sharan, R., et al.: Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. 102(6), 1974–1979 (2005) 26. Shlomi, T., Segal, D., Ruppin, E., Sharan, R.: Qpath: a method for querying pathways in a protein-protein interaction network. BMC Bioinformatics, 7–199 (2006) 27. Hwang, W., Cho, Y.R., Zhang, A., Ramanathan, M.: A novel functional module detection algorithm for protein-protein interaction networks. Algorithms for Molecular Biology 12, 1–24 (2006) 28. von Mering, C., et al.: Comparative assessment of large-scale data sets of proteinprotein interactions. Nature 417(6887), 399–403 (2002)