Download Finding community structure in very large networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 802.1aq wikipedia , lookup

Airborne Networking wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Transcript
By Aaron Clauset M. E. J.
Newman and Cristopher
Moore
1
Talk outline
 Introduction and reminder
 The algorithm
 Example: Amazon.com
 Summary
2
Girvan & Newman: betweenness
clustering
 Edge Betweeness: The number of shortest paths
between vertex pairs that goes along an edge
divisive Algorithm
compute all pairs of shortest paths
2. For each edge compute the number of such
paths it belongs to
3. Remove the max-weight edge
1.
Repeat to 1 until no edges are left
3
Girvan & Newman: disadvantages
 Betweenness needs to be recalculated at each step
 Removal of an edge can impact the betweenness
of another edge
 Very expensive: all pairs shortest path – O(n3)
O(m2n)
 Does not scale to more than a few hundred
nodes
4
Dendrogram (hierarchical tree)
 A dendrogram (hierarchical tree) illustrates the output
of hierarchical clustering algorithms
 Leaves represent graph nodes,
top represents original graph
 As we move down the tree,
larger communities are
partitioned to smaller ones
0
1
2
3
4
5
6
7
8
9
5
Quality functions
 Hierarchical clustering algorithms create numerous
partitions
 In general, we do not know how many communities
we should seek. How may we know that our
clustering is “good”
We need a quality function
6
The modularity quality function
 Modularity Q designed to measure the strength
of division of a network into clusters/communities
 It measures when the division is a good one, in the
sense that there are many edges within
communities and only a few between them
 If a high value of Q represents a good community
division, why not simply optimize Q over all
possible divisions to find the best one?
7
Is there a community structure in a very
large networks? How can we find it?
8
Newman Fast Algorithm(2003)
 A naive implementation runs in time O((m+ n)n), or O(n^2) on
a sparse graph.
 Greedy optimization of modularity:
Starting with each vertex is the sole member of one of n
communities, we repeatedly join communities together in pairs,
choosing at each step the join that results in the greatest increase
(or smallest decrease) in Q. (agglomerative aglrotihm)
 The progress of the algorithm can be represented as a
“dendrogram,” a tree that shows the order of the joins
Thank you shlomi for introducing it to us!
9
The algorithm
 Introduction and reminder
 The algorithm
 Example: Amazon.com
 Summary
10
The algorithm(2004)
 Here we propose a new algorithm that performs the
same greedy optimization by:
 using more sophisticated data structures for reducing
time complexity and memory
 it runs far more quickly, in time O(md log n) where d
is the depth of the “dendrogram” describing the
network’s community structure.
11
Definitions
 Avw- adjacency matrix
0

1
A
0

1

1
0
0
1
0
0
0
1
1

1
1

0 
0

1
A
0

0

4
1
0
0
0
1
1

0
0

0 
0
0
0
1
4
2
3
1
2
3
 vertex v belongs to community
12
Definitions (cont.)
 δ-function
is 1 if Ci = Cj and 0 otherwise
 degree
of a vertex v is defined to be the number
of edges incident upon it:
13
Definitions (cont.)
 the fraction of edges that join vertices in community i
to vertices in community j
 the fraction of ends of edges that are attached to
vertices in community i
14
The modularity calculation(cont.)
=Q
17
Purpose of the algorithm
 The operation of the algorithm involves finding the
changes in Q that would result from the join of each
pair of communities, choosing the largest of them, and
performing the corresponding join
18
Updates on the previous algorithm
The operation is done explicitly
on the entire matrix, but if the
adjacency matrix is sparse?
the operation can be carried out
more efficiently using data structures
for sparse matrices.
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
0
19
Data structures
Data structures:
1. A sparse matrix containing Qij for each pair i, j of
communities with at least one edge between them.
We store each row of the matrix both as a balanced
binary tree and as a max-heap.
20
Data structures (cont.)
2. A max-heap H containing the largest element of each
row of the matrix Qij along with the labelsi, j of the
corresponding pair of communities.
Row i
k
Row j
n
Row k
99
m
21
k,i
3. An ordinary vector array with elements ai.
4. Q=the maximal modularity
k,m
5
j,n
Initialization
 for each i:
 we start off with each vertex being the sole member of
a community of one, in which case eij = 1/2m if i and j
are connected and zero otherwise, and ai = ki/2m.
22
The algorithm
1. Calculate the initial values of ∆Qij and ai
according to initialization and populate the max
heap with the largest element of each row of the
matrix ∆Q.
2. Select the largest ∆Qij from H, join the
corresponding communities, update the matrix ∆Q,
the heap H and ai (as described later) and increment
Q by ∆Qij.
3. Repeat step 2 until only one community remains.
23
update the matrix ∆Q
 If we join communities i and j, labeling the combined
community j, say, we need only update the jth row and
column, and remove the ith row and column
altogether.
 If community k is connected to both i and j, then
 If k is connected to i but not to j, then
 If k is connected to j but not to i, then
24
Reminder of
how
modularity
can help us
visualize large
networks
25
Reminder-run time
 Insertion in balanced binary tree - O(log n)
 Updating the max-heap for the kth row by inserting,
raising, or lowering ∆Qkj takes O(log|k|) ≤ O(log n)
time
Operation
Binary[2]
find-max
Θ(1)
delete-max
Θ(log n)
insert
Θ(log n)
merge
Θ(n)
26
Run time
 |i|= degree of i, the number of neighboring
communities
 Join i and j = O((|i| + |j|) log n)
(10 a) -insert every |i| into the jth row costs :log |j|
(10 b +10 c)- insert every |i|+|j| : log (|i|+|j|)
kth row – update single element : log (|k|)
maximal O(log n)
there are at most |i| + |j| values of k for which we
have to do this
Total: O((|i| + |j|) log n)
27
Run time (cont.)
 the total running time is at most O(log n) times the
sum over all nodes of the dendrogram of the degrees of
the corresponding communities.
 worst-case: the degree of a community is the sum of
the degrees of all the vertices in the original network
comprising it.
 In that case, each vertex of the original network
contributes its degree to all of the communities it is a
part of, along the path in the dendrogram from it to
the root
28
Run time (cont.)
 If the dendrogram has depth d, there are at most d
nodes in this path, and since the total degree of all the
vertices is 2m, we have a running time of O(md log n)
as stated.
O(md log n)
29
Practical situations
 It is usually unnecessary to maintain the separate max-
heaps for each row
 their maintenance takes a moderate amount of effort
and this effort is wasted if the largest element in a row
does not change when two rows are joined
 if the largest element of the kth row was ∆Qki
or ∆Qkj and is now reduced by Eq. (10b) or (10c), we
simply scan the kth row to find the new largest element.
 the average-case running time is often better than that
of the more sophisticated algorithm.
30
Example: Amazon.com
 Introduction and reminder
 The algorithm
 Example: Amazon.com
 Summary
31
The connections- Amazon
 The network we study consists of items listed on
the Amazon web site. the network has 409 687 items
and 2 464 630 edges.
 Items can be books,music, video games etc.
 Edge from A to B iff B
was frequently
purchased by
buyers of A
32
33
Bridge – an edge
, that when removed,
splits off a community.
Bridges can act as
bottlenecks
for information flow
34
Looking at the largest communities in the network, we
find that they tend to consist of items (books, music) in
similar genres or on similar topics
35
Power low
 partitioned at the point of maximum modularity, the
distribution of community sizes s appears to have a
power-law form
36
Summary
 Introduction and reminder
 The algorithm
 Example: Amazon.com
 Summary
37
Summary
 Run time O(md log n) n- vertices m- edges d- depth of
the dendrogram
 Balanced dendrogram- d ∼ log n and Sparse network-
m∼n
Run time O(n log2 n).
 The algorithm should allow researchers to analyze
even larger networks with millions of vertices and tens
of millions of edges using current computing resources
38
Improvments
 Unfortunately, the algorithm does not scale well and
its use is practically limited to networks whose sizes
are up to 500,000 nodes.
 We show that this inefficiency is caused from merging
communities in unbalanced manner and that a simple
heuristics that attempts to merge community
structures in a balanced manner can dramatically
improve community structure analysis.
39
Improvments (cont.)
 The proposed techniques are tested using data sets
obtained from existing social networking service that
hosts 5.5 million users. We have tested two variations
of the heuristics.
 The fastest method processes a SNS friendship
network with 1 million users in 5 minutes (70 times
faster than our algorithm)
 Another friendship network with 4 million users in 35
minutes.
40
Credits
41
The End
42