Download Network - Mining of Massive Datasets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
Can we identify
node groups?
(communities,
modules, clusters)
Nodes: Football Teams
Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
2
NCAA conferences
Nodes: Football Teams
Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
3
Can we identify
functional modules?
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
4
Functional modules
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
5
Can we identify
social communities?
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
6
Social communities
Summer
internship
High school
Stanford (Basketball)
Stanford (Squash)
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
7
Non-overlapping vs. overlapping communities
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
8
Nodes
Nodes
Network
Adjacency matrix
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
9
What is the structure of community overlaps:
Edge density in the overlaps is higher!
Communities as “tiles”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
10
Communities
in a network
This is what we want!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
11
1) Given a model, we generate the network:
B
Generative
model for
networks
F
A
D
E
G
C
H
2) Given a network, find the “best” model
B
F
A
D
E
G
C
H
Generative
model for
networks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
12
Goal: Define a model that can generate
networks
The model will have a set of “parameters” that we
will later want to estimate (and detect communities)
Generative
model for
networks
B
F
A
D
E
G
C
H
Q: Given a set of nodes, how do communities
“generate” edges of the network?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
13
Communities, C
pA
pB
Model
Memberships, M
Nodes, V
Model
Network
Generative model B(V, C, M, {pc}) for graphs:
Nodes V, Communities C, Memberships M
Each community c has a single probability pc
Later we fit the model to networks to detect
communities
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
14
Communities, C
pA
pB
Model
Memberships, M
Nodes, V
Community Affiliations
Network
AGM generates the links: For each
For each pair of nodes in community ,
we connect them with prob. The overall edge probability is:
P(u, v) = 1 −
∏ (1 − p )
c
c∈M u ∩ M v
If , share no communities: , … set of communities
node belongs to
Think of this as an “OR” function: If at least 1 community says “YES” we create an edge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
15
Model
Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
16
AGM can express a
variety of community
structures:
Non-overlapping,
Overlapping, Nested
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17
Detecting communities with AGM:
B
F
A
D
E
G
C
H
Given a Graph
, find the Model
1) Affiliation graph M
2) Number of communities C
3) Parameters pc
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
19
Maximum Likelihood Principle (MLE):
Given: Data Assumption: Data is generated by some model … model
… model parameters
Want to estimate :
The probability that our model (with parameters )
generated the data
Now let’s find the most likely model that could have
generated the data: argmax J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
20
Imagine we are given a set of coin flips
Task: Figure out the bias of a coin!
Data: Sequence of coin flips: , , , , , , , Model: return 1 with prob. Θ, else return 0
What is ? Assuming coin flips are independent
So, ∗ ∗ … ∗ What is ? Simple, Then, For example:
.
$
. "#
. %"
What did we learn? Our data was most
likely generated by coin with bias /$
∗ /$
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
21
How do we do MLE for graphs?
Model generates a probabilistic adjacency matrix
We then flip all the entries of the probabilistic
matrix to obtain the binary adjacency matrix Flip
For every pair
of nodes , AGM gives the
prob. of
them being
linked
0
0.10
0.10
0.04
0.10
0
0.02
0.10
0.02
0.04
0.06
biased
coins
0
1
0
0
0.06
1
0
1
1
0
0.06
0
1
0
1
0.06
0
0
1
1
0
The likelihood of AGM generating graph G:
P(G | Θ) = Π P(u , v) Π (1 − P(u, v))
( u ,v )∈E
( u ,v )∉E
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
22
A
Given graph G(V,E) and Θ, we calculate
likelihood that Θ generated G: P(G|Θ)
B
Θ=B(V, C, M, {pc})
G
0
0.9
0.9
0
0
1
1
0
0.9
0
0.9
0
1
0
1
0
0.9
0.9
0
0.9
1
1
0
1
0
0
0.9
0
0
0
1
0
G
P(G|Θ)
P(G | Θ) = Π P(u , v) Π (1 − P(u , v))
( u ,v )∈E
( u ,v )∉E
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
23
Our goal: Find '(, ), , ) such
that:
arg max
Θ
P(
|
AGM
Θ)
How do we find '(, ), , ) that
maximizes the likelihood?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
24
Our goal is to find ' (, ), , )
argmax
+ , + , *(,),, ) ,∈-
such that:
∉-
Problem: Finding B means finding the
bipartite affiliation network.
There is no nice way to do this.
Fitting '(, ), , ) is too hard,
let’s change the model (so it is easier to fit)!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
25
Relaxation: Memberships have strengths
/
u
v
/ : The membership strength of node to community (/ : no membership)
Each community links nodes independently:
, 123/ ⋅ / J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
26
Community membership strength matrix /
Nodes
Communities
Probability of connection is
proportional to the product of
strengths
/
j
/ …
, 123/ ⋅ / strength of ’s
membership to Notice: If one node doesn’t belong to the
community (567 0) then , Prob. that at least one common
community ) links the nodes:
, ∏) ) , / … vector of
community
membership
strengths of J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
27
/ :
/ :
/? :
Community links nodes , independently:
, 123/ ⋅ / Then prob. at least one common ) links them:
, ∏) ) , 123 ∑) /) ⋅ /) 123/ ⋅ /; Example / matrix:
0
1.2
0
0.2
0.5
0
0
0.8
0 1.8 1
0
Node community
membership strengths
Then: / ⋅ /; . #
And: , <= . # . >
But: , ? . $$
, ? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
28
Task: Given a network @(, -, estimate /
Find / that maximizes the likelihood:
ABCDA=/ + , + , ,∈-
, ∉-
where: , 123/ ⋅ /; Many times we take the logarithm of the likelihood,
and call it log-likelihood: E / FGH@|/
Goal: Find / that maximizes E/:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
29
Compute gradient of a single row / of /:
Coordinate gradient ascent:
Iterate over the rows of /:
P.. Set out
outgoing neighbors
Compute gradient JE / of row (while keeping others fixed)
Update the row / : / ← / L MNE/ Project / back to a non-negative vector: If /) O : /) This is slow! Computing JE / takes linear time!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
30
However, we notice:
We cache ∑ /
So, computing ∑∉P / now takes linear time
in the degree |P | of In networks degree of a node is much smaller to the total
number of nodes in the network, so this is a significant
speedup!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
31
10000
Time (Sec.)
8000
6000
4000
Link Clustering
Clique Percolation
MMSB
BigCLAM
Parallel BigCLAM
2000
0
0
100
200
300
Number of nodes (× 103)
BigCLAM takes 5 minutes for 300k node nets
Other methods take 10 days
Can process networks with 100M edges!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
32
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
34
Extension:
Make community membership edges directed!
Outgoing membership: Nodes “sends” edges
Incoming membership: Node “receives” edges
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
35
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
36
Everything is almost the same except now
we have 2 matrices: / and Q
/… out-going community memberships
Q… in-coming community memberships
Edge prob.: , <=/ Q; Network log-likelihood:
/
Q
which we optimize the same way as before
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
37
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
38
Overlapping Community Detection at Scale: A Nonnegative Matrix
Factorization Approach by J. Yang, J. Leskovec. ACM International
Conference on Web Search and Data Mining (WSDM), 2013.
Detecting Cohesive and 2-mode Communities in Directed and
Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM
International Conference on Web Search and Data Mining (WSDM),
2014.
Community Detection in Networks with Node Attributes by J. Yang,
J. McAuley, J. Leskovec. IEEE International Conference On Data
Mining (ICDM), 2013.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
39