Download Finding community structure in very large networks

By Aaron Clauset M. E. J. Newman and Cristopher Moore 1 Talk outline  Introduction and reminder  The algorithm  Example: Amazon.com  Summary 2 Girvan & Newman: betweenness clustering  Edge Betweeness: The number of shortest paths between vertex pairs that goes along an edge divisive Algorithm compute all pairs of shortest paths 2. For each edge compute the number of such paths it belongs to 3. Remove the max-weight edge 1. Repeat to 1 until no edges are left 3 Girvan & Newman: disadvantages  Betweenness needs to be recalculated at each step  Removal of an edge can impact the betweenness of another edge  Very expensive: all pairs shortest path – O(n3) O(m2n)  Does not scale to more than a few hundred nodes 4 Dendrogram (hierarchical tree)  A dendrogram (hierarchical tree) illustrates the output of hierarchical clustering algorithms  Leaves represent graph nodes, top represents original graph  As we move down the tree, larger communities are partitioned to smaller ones 0 1 2 3 4 5 6 7 8 9 5 Quality functions  Hierarchical clustering algorithms create numerous partitions  In general, we do not know how many communities we should seek. How may we know that our clustering is “good” We need a quality function 6 The modularity quality function  Modularity Q designed to measure the strength of division of a network into clusters/communities  It measures when the division is a good one, in the sense that there are many edges within communities and only a few between them  If a high value of Q represents a good community division, why not simply optimize Q over all possible divisions to find the best one? 7 Is there a community structure in a very large networks? How can we find it? 8 Newman Fast Algorithm(2003)  A naive implementation runs in time O((m+ n)n), or O(n^2) on a sparse graph.  Greedy optimization of modularity: Starting with each vertex is the sole member of one of n communities, we repeatedly join communities together in pairs, choosing at each step the join that results in the greatest increase (or smallest decrease) in Q. (agglomerative aglrotihm)  The progress of the algorithm can be represented as a “dendrogram,” a tree that shows the order of the joins Thank you shlomi for introducing it to us! 9 The algorithm  Introduction and reminder  The algorithm  Example: Amazon.com  Summary 10 The algorithm(2004)  Here we propose a new algorithm that performs the same greedy optimization by:  using more sophisticated data structures for reducing time complexity and memory  it runs far more quickly, in time O(md log n) where d is the depth of the “dendrogram” describing the network’s community structure. 11 Definitions  Avw- adjacency matrix 0  1 A 0  1  1 0 0 1 0 0 0 1 1  1 1  0  0  1 A 0  0  4 1 0 0 0 1 1  0 0  0  0 0 0 1 4 2 3 1 2 3  vertex v belongs to community 12 Definitions (cont.)  δ-function is 1 if Ci = Cj and 0 otherwise  degree of a vertex v is defined to be the number of edges incident upon it: 13 Definitions (cont.)  the fraction of edges that join vertices in community i to vertices in community j  the fraction of ends of edges that are attached to vertices in community i 14 The modularity calculation(cont.) =Q 17 Purpose of the algorithm  The operation of the algorithm involves finding the changes in Q that would result from the join of each pair of communities, choosing the largest of them, and performing the corresponding join 18 Updates on the previous algorithm The operation is done explicitly on the entire matrix, but if the adjacency matrix is sparse? the operation can be carried out more efficiently using data structures for sparse matrices. 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 19 Data structures Data structures: 1. A sparse matrix containing Qij for each pair i, j of communities with at least one edge between them. We store each row of the matrix both as a balanced binary tree and as a max-heap. 20 Data structures (cont.) 2. A max-heap H containing the largest element of each row of the matrix Qij along with the labelsi, j of the corresponding pair of communities. Row i k Row j n Row k 99 m 21 k,i 3. An ordinary vector array with elements ai. 4. Q=the maximal modularity k,m 5 j,n Initialization  for each i:  we start off with each vertex being the sole member of a community of one, in which case eij = 1/2m if i and j are connected and zero otherwise, and ai = ki/2m. 22 The algorithm 1. Calculate the initial values of ∆Qij and ai according to initialization and populate the max heap with the largest element of each row of the matrix ∆Q. 2. Select the largest ∆Qij from H, join the corresponding communities, update the matrix ∆Q, the heap H and ai (as described later) and increment Q by ∆Qij. 3. Repeat step 2 until only one community remains. 23 update the matrix ∆Q  If we join communities i and j, labeling the combined community j, say, we need only update the jth row and column, and remove the ith row and column altogether.  If community k is connected to both i and j, then  If k is connected to i but not to j, then  If k is connected to j but not to i, then 24 Reminder of how modularity can help us visualize large networks 25 Reminder-run time  Insertion in balanced binary tree - O(log n)  Updating the max-heap for the kth row by inserting, raising, or lowering ∆Qkj takes O(log|k|) ≤ O(log n) time Operation Binary[2] find-max Θ(1) delete-max Θ(log n) insert Θ(log n) merge Θ(n) 26 Run time  |i|= degree of i, the number of neighboring communities  Join i and j = O((|i| + |j|) log n) (10 a) -insert every |i| into the jth row costs :log |j| (10 b +10 c)- insert every |i|+|j| : log (|i|+|j|) kth row – update single element : log (|k|) maximal O(log n) there are at most |i| + |j| values of k for which we have to do this Total: O((|i| + |j|) log n) 27 Run time (cont.)  the total running time is at most O(log n) times the sum over all nodes of the dendrogram of the degrees of the corresponding communities.  worst-case: the degree of a community is the sum of the degrees of all the vertices in the original network comprising it.  In that case, each vertex of the original network contributes its degree to all of the communities it is a part of, along the path in the dendrogram from it to the root 28 Run time (cont.)  If the dendrogram has depth d, there are at most d nodes in this path, and since the total degree of all the vertices is 2m, we have a running time of O(md log n) as stated. O(md log n) 29 Practical situations  It is usually unnecessary to maintain the separate max- heaps for each row  their maintenance takes a moderate amount of effort and this effort is wasted if the largest element in a row does not change when two rows are joined  if the largest element of the kth row was ∆Qki or ∆Qkj and is now reduced by Eq. (10b) or (10c), we simply scan the kth row to find the new largest element.  the average-case running time is often better than that of the more sophisticated algorithm. 30 Example: Amazon.com  Introduction and reminder  The algorithm  Example: Amazon.com  Summary 31 The connections- Amazon  The network we study consists of items listed on the Amazon web site. the network has 409 687 items and 2 464 630 edges.  Items can be books,music, video games etc.  Edge from A to B iff B was frequently purchased by buyers of A 32 33 Bridge – an edge , that when removed, splits off a community. Bridges can act as bottlenecks for information flow 34 Looking at the largest communities in the network, we find that they tend to consist of items (books, music) in similar genres or on similar topics 35 Power low  partitioned at the point of maximum modularity, the distribution of community sizes s appears to have a power-law form 36 Summary  Introduction and reminder  The algorithm  Example: Amazon.com  Summary 37 Summary  Run time O(md log n) n- vertices m- edges d- depth of the dendrogram  Balanced dendrogram- d ∼ log n and Sparse network- m∼n Run time O(n log2 n).  The algorithm should allow researchers to analyze even larger networks with millions of vertices and tens of millions of edges using current computing resources 38 Improvments  Unfortunately, the algorithm does not scale well and its use is practically limited to networks whose sizes are up to 500,000 nodes.  We show that this inefficiency is caused from merging communities in unbalanced manner and that a simple heuristics that attempts to merge community structures in a balanced manner can dramatically improve community structure analysis. 39 Improvments (cont.)  The proposed techniques are tested using data sets obtained from existing social networking service that hosts 5.5 million users. We have tested two variations of the heuristics.  The fastest method processes a SNS friendship network with 1 million users in 5 minutes (70 times faster than our algorithm)  Another friendship network with 4 million users in 35 minutes. 40 Credits 41 The End 42

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Finding community structure in very large networks