Download Hierarchical Organization of Functional Modules in Weighted

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Hierarchical Organization of Functional Modules
in Weighted Protein Interaction Networks Using
Clustering Coefficient
Min Li1 , Jianxin Wang1, , Jianer Chen1,2 , and Yi Pan1,3
1
School of Information Science and Engineering,
Central South University, Changsha 410083, P. R. China
2
Department of Computer Science,
Texas A&M University, College Station, TX 77843, USA
3
Department of Computer Science,
Georgia State University, Atlanta, GA 30302-4110, USA
Abstract. As advances in the technologies of predicting protein interactions, huge data sets portrayed as networks have been available. Several graph clustering approaches have been proposed to detect functional
modules from such networks. However, all methods of predicting protein
interactions are known to yield a nonnegligible amount of false positives.
Most of the graph clustering algorithms are challenging to be used in
the network with high false positives. We extend the protein interaction network from unweighted graph to weighted graph and propose an
algorithm for hierarchically clustering in the weighted graph. The proposed algorithm HC-Wpin is applied to the protein interaction network
of S.cerevisiae and the identified modules are validated by GO annotations. Many significant functional modules are detected, most of which
are corresponding to the known complexes. Moreover, our algorithm HCWpin is faster and more accurate compared to other previous algorithms.
The program is available at http://bioinfo.csu.edu.cn/limin/HC-Wpin.
Keywords: Protein interaction network, clustering, functional module.
1
Introduction
High-throughput methods, such as yeast-two-hybrid and mass spectrometry,
have led to the emergence of large protein-protein interaction datasets [1–6].
These protein-protein interactions can be naturally represented in the form
of networks and provide useful insights into functional associations between
proteins[7, 8]. A wide range of graph clustering algorithms have been developed
for identifying functional modules from such protein interaction networks.
This research was supported in part by the National Basic Research 973 Program of
China No. 2008CB317107, the National Natural Science Foundation of China under
Grant No. 60773111, the Program for New Century Excellent Talents in University
No. NCET-05-0683, the Program for Changjiang Scholars and Innovative Research
Team in University No. IRT0661.
Corresponding author.
I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 75–86, 2009.
c Springer-Verlag Berlin Heidelberg 2009
76
M. Li et al.
Recently, hierarchical model of modular network has been introduced and several hierarchical clustering approaches have been applied to identify functional
modules[9–12]. In general, the hierarchical clustering approaches can represent
the protein interaction networks in hierarchy by tree. According to the differences of constructing the tree, hierarchical clustering approaches can be classed
into two groups: the top-down approach and the bottom-up approach. The topdown approaches start from one cluster with all vertices and recursively dividing it into several dissimilar sub-clusters. A typical example is the betweenness
centrality-based algorithm proposed by Girvan and Newman [9]. In contrast, the
bottom-up approaches start at single vertex clusters and iteratively merge similar clusters. MoNet algorithm [12] is an example of the bottom-up approaches.
However, the hierarchical clustering approaches are known to be sensitive to
noisy data [8]. Up to now, all methods of predicting protein-protein interactions
can not avoid yielding a non-negligible amount of noisy data (false positives)[13].
Thus, the conventional hierarchial clustering approaches are challenging to be
used directly in the networks with false positives.
A series of density-based clustering approaches have been proposed to identify
densely connected regions from protein interaction networks which are somewhat
robust to noisy data. An extreme example is the maximum clique algorithm [14]
which detects fully connected subnetworks. However, only mining fully connected
subnetworks is too strict to be used in real biological networks. A variety of alternative density functions have been proposed to detect dense subnetworks [15–19].
However, the density-based approaches neglect many peripheral proteins that
connect to the core protein clusters with few links, even though these peripheral
proteins may represent true interactions that have been experimentally verified
[12]. As Barabási and Oltvai have pointed out that the density-based clustering approaches are not able to partition the biological networks whose degree
distributions are typically power-law [20]. In addition, biologically meaningful
functional modules that do not have highly connected topologies are ignored by
these approaches [12].
In this paper, we extend the protein interaction network to a weighted graph
and develop an algorithm, named HC-Wpin, for hierarchical clustering in the
weighted graph. Algorithm HC-Wpin is applied to the weighted protein interaction network of S.cerevisiae and the identified modules are validated by
GO annotations (including Biological Process, Molecular Function, and Cellular Component ). The experimental results show that the identified modules are
statistically significant in terms of three types of GO annotations. Compared
to other previous competing algorithms, our algorithm HC-Wpin is faster and
more accurate, which can be used in even larger protein interaction networks.
2
Methods
Edge Clustering Coefficient In Weighted Protein Interaction Network
A weighted protein interaction network can be represented as a weighted undirected graph G = (V, E), where V is a set of vertices and E is a set of edges
Hierarchical Organization of Functional Modules
77
between the vertices. Each edge (u, v) is assigned with a weight w(u, v), which
represents the probability of this interaction being a true positive.
Clustering coefficient is first proposed to describe the property of a vertex
in a network. Clustering coefficient of a vertex is the ratio of the number of
connections in the neighborhood of the vertex and the number of connections if
the neighborhood is fully connected [21]. Roughly speaking, clustering coefficient
tells how well connected the neighborhood of the vertex is. Recently, Radicchi et
al. [22] generalized the clustering coefficient of a vertex to an edge, and defined it
as the number of triangles to which a given edge belonged, divided by the number
of triangles that might potentially include it. The definition is not feasible when
the network has few triangles. In our previous studies, we have redefined the
clustering coefficient of an edge by calculating the common neighbors instead
of the triangles [23]. However, all the definitions of clustering coefficient in the
unweighted protein interaction networks do not consider the edge reliability.
Here, we redefine the clustering coefficient of an edge in the weighted graph.
Let Nu be the set of neighbors of vertex u and Nv be the set of neighbors
of vertex v, respectively. Then the clustering coefficient of an edge (u, v) in a
weighted graph G is defined as:
k∈Iu,v
CCu,v = w(u, k) ·
s∈Nu w(u, s) ·
k∈Iu,v
w(v, k)
t∈Nv w(v, t)
(1)
where Iu,v denotes the set of common vertices in Nu and Nv (i.e. Iu,v = Nu ∩Nv ).
Two vertices of an edge with larger clustering coefficient are more likely to lie in
the same module.
Quantitative Definition of Modules In Weighted Protein Interaction
Network
Generally, functional modules in protein interaction networks are loosely referred
to as highly connected subgraphs, which have more internal edges than external
edges. Several module definitions have been proposed, such as strong module and
weak module [22]. For an unweighted protein interaction network with high false
positive, false prediction may be generated by using these definitions. To avoid
the effect of false positive interactions in protein interaction networks, we define
a functional module in the weighted graph. For a weighted graph G = (V, E),
the degree of a vertex v is denoted as dw (v) which is the sum of weights of the
edges connecting v:
dw (v) =
w(u, v).
(2)
u∈Nv ;(u,v)∈E
For a vertex v in a subgraph H ⊆ G, its in-degree, denoted as din
w (H, v), is the
sum of weights of edges connecting vertex v to other vertices belonging to H, and
its out-degree, denoted as dout
w (H, v), is the sum of weights of edges connecting
78
M. Li et al.
out
vertex v to other vertices in the rest of the graph G. din
w (H, v) and dw (H, v)
can be formed as formula (3) and formula (4), respectively.
din
w (H, v) =
w(u, v).
(3)
u,v∈H;(u,v)∈E
dout
w (H, v) =
w(u, v).
(4)
v∈H;u∈H;(u,v)∈E
/
It is clearly that the degree dw (v) of a vertex v is equaled to the sum of din
w (H, v)
and dout
(H,
v).
w
Definition 1. Given a weighted undirected graph G = (V, E, W ) and a threshold
λ, a subgraph H ⊆ G is a λ-module if
v∈H
din
w (H, v) > λ
dout
w (H, v)
(5)
v∈H
where λ is a parameter determined by user. By changing the values of parameter
λ , we can get different modules in the weighted protein interaction network.
Hierarchical Clustering
Based on the definitions of edge clustering coefficient and λ-module in weighted
protein interaction networks, we propose a novel hierarchical clustering algorithm
HC-Wpin. The whole description of algorithm HC-Wpin is shown in Figure 1.
In Figure 1, Ci denotes a cluster, V (Ci ) and E(Ci ) denote the set of vertices and
edges in the cluster Ci , respectively. Let Din (Ci ) be the sum of the in-degree of
the vertices in Ci , and Dout (Ci ) be the sum of the out-degree of the vertices in
Ci . For a vertex v, L(v) indicates which cluster it is in.
The input to algorithm HC-Wpin is a weighted undirected graph G(V, E).
Firstly, all vertices in the graph G are initialized as singleton clusters. Then, the
clustering coefficient of each edge in the graph G is calculated. We enqueue all
the edges into a queue Sq in non-increasing order in terms of their clustering
coefficients. The higher clustering coefficient the edge has, the more likely its
two vertices are inside a module. By gradually adding edges in the queue Sq
to clusters, algorithm HC-Wpin finally assembles all the singleton clusters into
λ-modules. In the end, the λ-modules which consist of s or more than s proteins
are outputted. Parameter s is used to control the minimum size of the output
functional modules.
Let n and m denote the number of vertices and edges in a weighted protein
interaction network respectively
and k be the average number of neighbors of all
the vertices, i.e. k = n1 v∈V |Nv |. Then, the complexity of calculating all the
edge clustering coefficients is O(k 2 m), the complexity of the iterative merging
is O(m). Thus, the total computational complexity of algorithm HC-Wpin is
O(k 2 m). In general, k is very small and can be considered as a constant.
Hierarchical Organization of Functional Modules
79
Algorithm HC-Wpin
input: a weighted graph G = (V, E), parameters λ and s;
output: identified modules;
1. for each vertex vi ∈ V do
V (Ci ) = {vi }; E(Ci ) = ∅
2. for each edge (u, v) ∈ E do
compute its clustering coefficient;
3. sort all edges to queue Sq in non-increasing order in terms of clustering coefficients;
4. while Sq = ∅ do
{ e(u, v) ← Sq ;
if L(u) = L(v) then i = L(u); E(Ci ) = E(Ci ) ∪ {e(u, v)};
else i = L(u); j = L(v);
Din (Cj )
in (Ci )
if DDout
≤ λ or Dout
≤ λ then
(Ci )
(Cj )
V (Ci ) = V (Ci ) ∪ V (Cj ); E(Ci ) = E(Ci ) ∪ E(Cj ); Cj = {∅, ∅};
Sq = Sq − {e(u, v)};}
5. for i=1 to |V | do
if |V (Ci )| ≥ s then output Ci ;
Fig. 1. The description of algorithm HC-Wpin
3
Experiments and Results
Identification of λ Modules In The Network of S.cerevisiae
The original unweighted protein interaction network of S.cerevisiae, consisting of 4,726 proteins and 15,166 interactions, was downloaded from the DIP
database[24]. To construct weighted protein interaction network, we assign confidence scores to these interactions using the logisticregression-based scheme employed in [25, 26]. Roughly speaking, the confidence score is computed based on
the experimental evidences which include the type of experiments in which the
interaction is observed, and the number of observations in each experimental
type. We apply algorithm HC-Wpin to the weighted protein interaction network
and achieve five output sets of modules by changing the values of parameter λ
from 1.0 to 3.0 with 0.5 increment. Table 1 illustrates the effect of parameter λ
on clustering.
In Table 1, Max.Size represents the size of the largest module, and Avg.Size
represents the average size of all the identified modules. As shown in Table 1,
Table 1. The effect of λ on clustering
Parameter
Modules
Max.size
Avg.size
λ = 1.0
145
79
9.24
λ = 1.5
132
125
10.54
λ = 2.0
117
263
12.25
λ = 2.5
91
982
16.89
λ = 3.0
77
1192
20.82
80
M. Li et al.
#3 (982)
#11
(8)
#11
(5)
#14
(72)
#3
#16
(4) (125)
#3
(263)
#15
(7)
#19
(19)
#27
(3)
#18
(73)
#35
(9)
#4 #69 #51
(79) (33) (8)
Y
L
R
4
1
8
C
Y
B
R
2
7
9
W
Y
O
L
1
4
5
C
Y
G
L
2
4
4
W
#22
(4)
Y
M
L
0
1
0
W
Y
G
L
2
0
7
W
#30 #33
(22) (8)
#36
(3)
#37 #38
(18) (7)
#44
(73)
#0
(26)
Y
M
L
0
6
9
W
#39
(6)
#43
(4)
#51
(72)
#52(73)
#46
(15)
#20 #64
(45) (12)
Y
O
R
1
2
3
C
#26
(5)
Ȝ=2.5
Y
M
R
1
9
7
C
Y
K
L
1
9
6
C
#35
(4)
Y
B
L
0
5
0
W
#36 #38
(4) (50)
Y
O
R
1
0
6
W
Y
G
L
0
9
5
C
Y
D
R
4
6
8
C
#55
(3)
#19
(11)
#46 #61 #94
(9) (17) (9)
Y
L
R
0
9
3
C
#54 #55 #56
(8) (255) (3)
Y
O
L
0
1
8
C
Y
G
L
2
1
2
W
#64
(17)
#65
(123)
#63
(5)
#68
(8)
#71
(3)
#80
(4)
#69
(16)
#72
(17)
#87
(4)
#94
(5)
#41 #54 #76
(27) (29) (32)
Y
O
L
0
0
4
W
#89
(3)
#126
(3)
Ȝ=2.0
Ȝ=1.5
Ȝ=1.0
#58
(7)
Y
N
L
3
3
0
C
Y
D
L
0
7
6
C
Y
P
L
1
3
9
C
Y
M
R
2
6
3
W
Y
I
L
0
8
4
C
Y
P
L
1
8
1
W
The identified module consisting of 982 proteins which is generated by HC-Wpin with Ȝ=2.5
The identified modules generated by HC-Wpin with Ȝ=2.0 which are included in a larger module generated by HC-Wpin with Ȝ=2.5
The identified modules generated by HC-Wpin with Ȝ=1.5 which are included in larger modules generated by HC-Wpin with Ȝ=2.0
The identified modules generated by HC-Wpin with Ȝ=1.0 which are included in larger modules generated by HC-Wpin with Ȝ=1.5
Proteins
Fig. 2. An example of hierarchical modules generated by algorithm HC-Wpin with
different values of parameter λ. All the identified modules listed in this figure are
available from Additional file 1.
the number of the identified modules is decreasing with the increase of λ. The
average size of all the identified modules and the size of the biggest module
are increasing as λ increases. Bigger modules are generated by HC-Wpin when
larger value of λ is used. This is because the larger value of λ may lead to the
merging of clusters in the agglomerative process. Figure 2 illustrates an example
that small modules are iteratively merged with the increase of λ.
As shown in Figure 2, the modules #19, #41, #54, #58 and #76 generated
by HC-Wpin with λ=1.0 are merged into a larger module #65 which is generated by HC-Wpin with λ=1.5. When λ=2.0, the module #65 and the other
nine green modules generated with λ=1.5 are merged into a larger module #55
which consists of 255 proteins. When λ=2.5, all the 24 modules generated with
λ=2.0 are merged into a more larger module which consists of 982 proteins. By
changing the values of parameter λ, we can obtain the hierarchial organization
of functional modules in the protein interaction network. In the following subsection we will evaluate the significance of the identified modules by using the SGD
GO Term Finder (http://www.yeastgenome.org/). The evaluation results show
that the hierarchical organization of modules are approximatively corresponding
to the hierarchical structure of GO annotations.
Statistical Assessment of the Identified Modules
To test whether the identified modules are significant, we validate them using all
the three types of GO terms: Biological Process, Molecular Function, and Cellular
Hierarchical Organization of Functional Modules
81
Component. For each identified module, the P-value from the hypergeometric
distribution is calculated based on the three types of GO annotations. A cutoff
parameter is used to differentiate significant groups from insignificant ones. If an
identified module is associated with a P-value larger than cutoff, it is considered
insignificant. We use the recommended cutoff of 0.05 for all our validations.
Firstly, we evaluate the significance of the hierarchical modules shown in Figure 2. As pointed out by Gavin et al. and Krogan et al., the larger complexes are
composed, sometimes transiently, from smaller subcomplexes [3, 6]. We obtain the
similar results that the smaller functional modules are hierarchically organized
into the larger functional modules which is approximatively corresponding to the
hierarchical structure of GO annotations. For example, the five green modules
(#38, #64, #65, #69 and #72) generated with λ=1.5 are merged into a larger
module #55 generated with λ=2.0. Correspondingly, the Biological Process annotation for the module #55 is the common ancestor of that for the five smaller
modules #38, #64, #65, #69 and #72 in the hierarchical structure of GO annotations, as shown in Additional file 2. Similar corresponding relation between the
hierarchical modules and the hierarchical structure of GO annotations is obtained
both for Molecular Function and for Cellular Component annotations.
Next, we evaluate all the identified modules generated by HC-Wpin. Take the
modules with λ=1.0 for example, 130 out of 145 identified modules are validated
to be significant with Biological Process annotations. The lowest P-values of the
130 significant modules range from 4.93E-02 to 5.82E-73. For Molecular Function
annotations, 115 identified modules are validated to be significant, whose lowest
P-values range from 4.70E-02 to 2.50E-52. The module with the lowest P-value
of 2.50E-52 is composed of 33 members. Of all the 33 proteins, more than 75%
have the function of “RNA polymerase activity”. For Cellular Component annotations, the lowest P-value of all the identified modules is 1.51E-68. The module
with the lowest P-value is composed of 45 proteins, in which 37 proteins belong to
the known complex “small nuclear ribonucleoprotein complex”. There are a series
of small identified modules matching the known complexes perfectly in the network of S.cerevisiae. For example, the module #10 consisting of 3 proteins (EFB1,
TEF4, TEF1) is exactly the “eukaryotic translation elongation factor 1 complex”,
the module #136 consisting of 3 proteins (POL3, POL31, POL32) is exactly the
“delta DNA polymerase complex”, the module #33 consisting of 4 proteins (APL3,
APL1, APS2, APM4) is exactly the “AP-2 adaptor complex”, the module #90 consisting of 4 proteins (SEC66, SEC72, SEC63, SEC62) is exactly the “endoplasmic
reticulum Sec complex”, and the module #131 consisting of 5 proteins (VPS29,
PEP8, VPS35, VPS5, VPS17) is exactly the “retromer complex”.
There are some identified modules which have the similar P-values validated
by different types of GO annotations. For the example of module #134, all the 4
proteins (SEN34, SEN2, SEN15, SEN54) in it are exactly the entire members in
the network of S.cerevisiae which have the same molecular function of “endoribonuclease activity, producing 3’-phosphomonoesters”, participate in the same
biological process of “RNA splicing, via endonucleolytic cleavage and ligation”
and are in the same cellular component “tRNA-intron endonuclease complex”.
82
M. Li et al.
There are also a number of identified modules whose lowest P-values are very
different with different validation of GO terms. For example, the module #60
is composed of 4 proteins (APM3, APL6, APS3, APL5). For Biological Process
annotations, it has a lowest P-value of 4.55E-09. The 4 proteins combining with
other 18 proteins participate in the process of “Golgi to vacuole transport”. For
Molecular Function annotations, 3 proteins (APM3, APL6, APS3) out of the 4
proteins are directly annotated to the root term “molecular function unknown”.
It has no P-value result. However, it is exactly the known complex “AP-3 adaptor
complex” for Cellular Component annotation with the lowest P-value of 6.70E13. The detailed annotations of one type of GO terms may give clues to study
another type of GO terms.
The above analyses show that our algorithm HC-Wpin not only can identify
significant functional modules but also can detect significant functional modules
in hierarchy.
Comparison with Other Methods
To evaluate the effectiveness of our algorithm HC-Wpin, we compare it with several previous state-of-the-art algorithms: the MoNet algorithm [12] and FAG-EC
algorithm [23] as hierarchial clustering approaches, the MCODE algorithm and
DPClus algorithm as density-based methods, and the STM algorithm [27]. The
values of the parameters in each algorithm are selected from those recommended
by the author. The accuracy of each algorithm is calculated and shown in Table
2. The accuracy of an algorithm indicates the average f -measure of the significant modules generated by it. f -measure of an identified module is defined as a
harmonic mean of its recall and precision.
f -measure =
2 ∗ recall ∗ precision
recall + precision
recall =
|M ∩ Fi |
|Fi |
precision =
|M ∩ Fi |
|M |
(6)
(7)
(8)
where Fi is a functional category mapped to module M . The proteins in functional category Fi are considered as true predictions, the proteins in module M
are considered as positive predictions, and the common proteins of Fi and M are
considered as true positive predictions. Recall is the fraction of the true-positive
predictions out of all the true predictions, and precision is the fraction of the
true-positive predictions out of all the positive predictions[8].
As shown in Table 2, the accuracy of our algorithm HC-Wpin is much higher
than that of the other four algorithms: FAG-EC, MCODE, DPClus and STM
with all the validations of Biological Process (abbreviated as B.P.), Molecular
Function (abbreviated as M.F.), and Cellular Component (abbreviated as C.C.).
MoNet produces a giant module consisting of 3336 proteins and three small
Hierarchical Organization of Functional Modules
83
Table 2. Comparison of the accuracy of algorithm HC-Wpin and other previous
algorithms
Algorithms Modules Average of Maximum
Accuracy
size ≥ 3
Size
Size
B.P. M.F. C.C.
HC-Wpin
145
9.24
79 0.34 0.29 0.50
MoNet
4
837.50
3336
FAG-EC
326
8.52
237
0.27
0.20
0.40
MCODE
59
74.78
555
0.29
0.24
0.39
DPClus
236
4.02
13
0.26
0.19
0.35
STM
10
467.80
4647
0.21
0.10
0.01
(a) Biological Process
Parameters
λ=1.0
S=1
λ=1.0
fluff =0.1;VWP =0.2
CPin =0.5;Din =0.9
Merge=1.0
(b) Molecular Function
0
(c) Cellular Component
0
0
-10
-10
-10
-20
-20
log(P-value)
log(P-value)
-40
HC-Wpin
MoNet
FAG-EC
MCODE
DPClus
STM
-50
-60
-70
0
5
10
15
20
Significant modules
25
-30
HC-Wpin
MoNet
FAG-EC
MCODE
DPClus
STM
-40
-50
-80
30
log(P-value)
-20
-30
-30
-40
HC-Wpin
MoNet
FAG-EC
MCODE
DPClus
STM
-50
-60
-70
-60
-80
0
5
10
15
20
Significant modules
25
30
0
5
10
15
20
25
30
Significant modules
Fig. 3. Comparison of P-value distribution of significant modules generated by
HC-Wpin and other algorithms. The x axis represents the number of significant modules and the y axis represents the log(P-value) for each corresponding module.
modules. Only one module is validated to be significant for all the three type
of GO annotations. Thus, we does not list the accuracy of MoNet. The false
clustering of MoNet is mostly caused by the miscalculation of betweenness. The
false positive interactions can yield the incorrect shortest paths in a network and
the incorrect shortest paths cause miscalculation of betweenness.
Figure 3 (a), (b), and (c) illustrate the P-value distributions of the significant
modules generated by all these algorithms. Since the network has a lower probability to produce the module by chance, the module is more significant with
lower P-value. As observed from Figure 3, the 30 best significant modules identified by our algorithm HC-Wpin are consistently more significant than those
generated by other algorithms.
The comparison results in Table 2 and Figure 3 show that our algorithm
HC-Wpin outperforms the other previous algorithms.
Efficiency Analysis
All experiments in this paper are implemented on a Linux server with 4 Intel Xeon 3.6GHz CPU and 4GByte RAM. Table 3 illustrates a comparison of
the running time of our algorithm HC-Wpin and the other five algorithms for
identifying functional modules. The network of S.cerevisiae (4726 proteins and
15166 interactions) and the other four networks with different confidence level
obtained from [28] are used as the test data. We call the network of S.cerevisiae
84
M. Li et al.
Table 3. Comparison of the running time of algorithm HC-Wpin and other algorithms
Algorithms
HC-Wpin
FAG-EC
MCODE
DPClus
MoNet
STM
The running time (second)
Y2k
Y11k
Y45k
Y78k
NSc
(988, 2455) (2401, 11000) (4687, 45000) (5321, 78390) (4726, 15166)
0.1
0.6
5.6
9.8
0.7
0.1
0.6
5.6
9.8
0.7
0.2
7.2
1037.2
4129.8
480.2
0.3
32.6
1116.4
5638.6
1194.7
2.0
88.2
6593.8
9516.2
2852.4
1.0
7944.3
62360.2
140174.8
27073.2
NSc and name the other four networks Y2k, Y11k, Y45k and Y78k, respectively,
according to the number of edges included in them.
From Table 3, we can see the running time of MoNet, STM, MCODE and
DPClus increase sharply with the size of network. It takes more than 4,000
seconds for MCODE and DPClus, about 10,000 seconds for MoNet, and more
than 100,000 seconds for STM to identify modules from the network consisting of
5,321 proteins and 78,390 interactions. However, the running time of HC-Wpin
and FAG-EC detecting functional modules from the same network is still small
and less than 10 seconds. As one can see, algorithm HC-Wpin is extremely fast,
which is hundreds of times faster than MCODE, DPClus, MoNet and STM. As
the protein-protein interactions accumulating, algorithm HC-Wpin can be used
in even larger protein interaction networks.
4
Conclusions
In previous studies, the protein interaction networks are generally represented as
unweighted graphs. As is well known, the protein interaction networks can not
avoid of false positives. Thus, directly clustering from the unweighted graphs with
high false positives, most of the previous graph clustering approaches may generate a number of false predictions. In this paper, we extend the protein interaction
network from unweighted graph to weighted graph and develop a fast hierarchical
clustering algorithm HC-Wpin to identify functional modules from the weighted
graph. The reliability of interactions is measured by the logisticregression-based
scheme [25, 26]. By changing the values of parameter λ, we can identify the functional modules in a hierarchy. We use all the three types of GO Terms to validate
the identified modules of S.cerevisiae. Many significant functional modules are
detected, most of which are exactly corresponding to the known complexes. For
most cases, the value of λ is recommended to be between 1.0 and 3.0. When you
want to get small modules, you should select a small value for λ. On the contrary, you should select a relative large value for λ to obtain modules consisting
of more proteins.
We also compare the performances of our algorithm HC-Wpin and the other
five algorithms: MoNet, FAG-EC, MCODE, DPClus, and STM. Unexpected
Hierarchical Organization of Functional Modules
85
giant modules are generated by MoNet and STM which are caused by their weakness to the false positives. Although MCODE and DPClus are somewhat robust
to the false positives, they are not adept at identifying hierarchically distributed
functional modules. The quantitative comparison of accuracy reveal that algorithm HC-Wpin outperforms the other five algorithms. Another strength of our
algorithm HC-Wpin is efficiency. It is very fast and can be applied to even larger
protein interaction networks of other higher-level organisms.
Acknowledgments. The authors would like to thank F. Luo and his colleagues
for sharing their program of MoNet, to W. Hwang and his colleagues for sharing
the source code of STM. The authors are also thankful to M. Altaf-UI-Amin and
his colleagues for sharing the tool of DPClus, to G.D. Bader and C.W. Hogue
for their publicity of MCODE. The authors also thank T. Shlomi for providing
and discussing about the data.
Additional Files
Additional file 1 — Example for hierarchical organization of functional
modules. This file contains the hierarchical modules shown in Figure 2 which
is available at http://bioinfo.csu.edu.cn/limin/HC-Wpin/Additional file 1.txt.
Additional file 2 —Supplemental Figure 1. This file contains a supplemental Figure 1 which shows the reduction of hierarchical structure of biological
process annotations for the hierarchical modules shown in Figure 2. This file is
available at http://bioinfo.csu.edu.cn/limin/HC-Wpin/Additional file 2.pdf.
References
1. Uetz, P., et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627 (2000)
2. Gavin, A.C., et al.: Functional organization of the yeast proteome by systematic
analysis of protein complexes. Nature 415(6868), 141–147 (2002)
3. Gavin, A.C., et al.: Proteome survey reveals modularity of the yeast cell machinery.
Nature 440(7084), 631–636 (2006)
4. Ho, Y., et al.: Systematic identification of protein complexes in saccharomyces
cerevisiae by mass spectrometry. Nature 415(6868), 180–183 (2002)
5. Krogan, N.J., et al.: High-definition macromolecular. composition of yeast RNAprocessing complexes. Molecular Cell 13, 225–239 (2004)
6. Krogan, N.J., et al.: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084), 637–643 (2006)
7. Harwell, L.H., Hopfield, J.J., Leibler, S., Murray, A.W.: From molecular to modular
cell biology. Nature 402, c47–c52 (1999)
8. Cho, Y.R., Hwang, W., Ramanmathan, M., Zhang, A.D.: Semantic integration
to identify overlapping functional modules in protein interaction networks. BMC
Bioinformatics 8, 265 (2007)
9. Girvan, M., Newman, M.E.: Community structure in social and biological networks.
Proc. Natl. Acad. Sci. 99, 7821–7826 (2002)
86
M. Li et al.
10. Rives, A.W., Galitski, T.: Modular organization of cellular networks. Proc. Natl.
Acad. Sci 100, 1128–1133 (2003)
11. Ravasz, E., et al.: Hierarchical organization of modularity in metaboli networks.
Science 297, 1551–1555 (2002)
12. Luo, F., Yang, Y.F., Chen, C.F., Chang, R., Zhou, J.Z.: Modular organization of
protein interaction networks. Bioinformatics 23(2), 207–214 (2007)
13. Brohee, S., van Helden, J.: Evaluation of clustering algorithms for protein-protein
interaction networks. BMC Bioinformatics, 7–488 (2006)
14. Spirin, V., Mirny, L.A.: Protein complexes and functional modules in molecular
networks. Proc. Natl. Acad. Sci., 12123–12128 (2003)
15. Bu, D., et al.: Topological structure analysis of the protein-protein interaction networks in budding yeast. Nucleic Acid Research 31(9), 2443–2450 (2003)
16. Brun, C., et al.: Clustering proteins from interaction networks for the prediction
of cellular functions. BMC Bioinformatics 7, 488 (2004)
17. Bader, G.D., Hogue, C.W.: An Automated Method for Finding Molecular Complexes in Large Protein Interaction Networks. BMC Bioinformatics 4, 2 (2003)
18. Altaf-Ul-Amin, M., Shinbo, Y., Mihara, K., Kurokawa, K., Kanaya, S.: Development and implementation of an algorithm for detection of protein complexes in
large interaction networks. BMC Bioinformatics, 7–207 (2006)
19. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structure of complex networks in nature and society. Nature 435, 814–818 (2005)
20. Barabási, A.L., Oltvai, Z.N.: Network biology: understanding the cell’s functional
organization. Nature Reviews: Genetics 5, 101–114 (2004)
21. Friedel, C., Zimmer, R.: Inferring topology from clustering coefficients in proteinprotein interaction networks. BMC Bioinformatics, 7–519 (2006)
22. Radicchi, F., et al.: Defining and identifying communities in networks. Proc. Natl.
Acad. Sci. 101, 2658–2663 (2004)
23. Li, M., Wang, J.X., Chen, J.E.: A fast agglomerate algorithm for mining functional
modules in protein interaction networks. In: Peng, Y., Zhang, Y. (eds.) Proceedings
of the First International Conference on BioMedical Engineering and Informatics:
Hainan, China, May 27-30, pp. 3–7 (2008)
24. Xenarios, I., et al.: DIP: the Database of Interaction Proteins: a research tool for
studying cellular networks of protien interactions. Nucleic Acids Res. 30, 303–305
(2002)
25. Sharan, R., et al.: Conserved patterns of protein interaction in multiple species.
Proc. Natl. Acad. Sci. 102(6), 1974–1979 (2005)
26. Shlomi, T., Segal, D., Ruppin, E., Sharan, R.: Qpath: a method for querying pathways in a protein-protein interaction network. BMC Bioinformatics, 7–199 (2006)
27. Hwang, W., Cho, Y.R., Zhang, A., Ramanathan, M.: A novel functional module detection algorithm for protein-protein interaction networks. Algorithms for
Molecular Biology 12, 1–24 (2006)
28. von Mering, C., et al.: Comparative assessment of large-scale data sets of proteinprotein interactions. Nature 417(6887), 399–403 (2002)