Download Inferring Networks of Diffusion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Zero-configuration networking wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

CAN bus wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Computer network wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

Network tap wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
Inferring Networks of
Diffusion and Influence
Manuel Gomez Rodriguez1,2
Jure Leskovec1
Andreas Krause3
1 Stanford
University
for Biological Cybernetics
3 California Institute of Technology
2 MPI
1
Hidden and implicit networks
 Many social or information networks are implicit or
hard to observe:

Hidden/hard-to-reach populations:
 Network of needle sharing between drug injection users

Implicit connections:
 Network of information propagation in online news media
 But we can observe results of the processes
taking place on such (invisible) networks:

Virus propagation:
 Drug users get sick, and we observe when they see the doctor

Information networks:
 We observe when media sites mention information
2
Information Diffusion Network
 Information diffuses through the network
Time
 We only see who mentions but not where they got the
information from
 Question: Can we infer the hidden networks?
3
Examples and Applications
Virus propagation
Word of mouth &
Viral marketing
Process
Viruses propagate
through the network
Recommendations and
influence propagate
We observe
We only observe when
people get sick
We only observe when
people buy products
It’s hidden
But NOT who infected
whom
But NOT who influenced
whom
Can we infer the underlying network?
4
Inferring the Network
 There is a directed social network over which
diffusions take place:
a
b
d
c
e
 But we do not observe the edges of the network
 We only see the time when a node gets infected:


Cascade c1: (a, 1), (c, 2), (b, 6), (e, 9)
Cascade c2: (c, 1), (a, 4), (b, 5), (d, 8)
 Task: inferring the underlying network
5
Our Problem Formulation
 Plan for the talk:
1. Define a continuous time model of diffusion
2. Define the likelihood of the observed cascades given a
network
3. Show how to efficiently compute the likelihood of cascades
4. Show how to efficiently find a graph G that maximizes the
likelihood

Note:

There is a super-exponential number of graphs, O(NN*N)

Our method finds a near-optimal graph in O(N2)!
6
Cascade Generation Model
 Continuous time cascade diffusion model:

Cascade c reaches node u at tu and spreads
to u’s neighbors:
 With probability β cascade propagates along edge (u, v)
and we determine the infection time of node v
tv = tu + Δ
e.g.: Δ ~ Exponential or Power-law
ta
tb
Δ1
a
te
Δ3
tf
Δ4
b
c
e
Δ2
tc
d
f
We assume each node
v has only one parent!
7
Likelihood of a Single Cascade
 Probability that cascade c propagates from
node u to node v is:
Pc(u, v)  P(tv - tu) with tv > tu

Since not all nodes get infected by the diffusion process, we
introduce the external influence node m: Pc(m, v) = ε
 Prob. that cascade c propagates
in a tree pattern T:
m
ε ε ε
a
b
d
c
e
Tree pattern T on cascade c:
(a, 1), (b, 2), (c, 4), (e, 8)
8
Finding the Diffusion Network
newspossible propagation
Badtrees
newsthat are
 ThereGood
are many
consistent
with the
data:
Computing P(c|G)
is observed
We
actually want to search
tractable:
c: (a, 1), (c, 2), (b, 3), (e, 4)
over graphs:
n
a
Even though
bthere are O(n ) a
possible propagation trees.
d
c
Matrix Tree Theorem can
c
a
b
d
c
b
d
There is a supere
e
exponential number
of
Need to consider all possiblegraphs!
propagation trees T
3)!
compute this
in
O(n
e

supported by the graph G:
 Likelihood of a set of cascades C:
 Want to find a graph:
9
An Alternative Formulation
 We consider only the most likely tree
 Maximum log-likelihood for a cascade c under a
graph G:
The problem is still intractable (NP-hard)
But we present
algorithm
that
finds near Log-likelihood
of Gan
given
a set of
cascades
C:
optimal networks in O(N2)
10
Max Directed Spanning Tree
Given a cascade c and a network G,
 What is the most likely propagation tree?
where

A maximum directed spanning tree (MDST):
 The sub-graph of G induced by the nodes in the
cascade c is a DAG
 Because edges point forward in time
 For each node, just picks an in-edge of max-weight:
Greedy parent selection of each node gives globally optimal tree!
11
Objective function is Submodular
Given a set of cascades C,
 How do we find the network G that maximize FC(G)?
Theorem:
Log-likelihood FC(G) of a set of cascades C is monotonic, and
submodular in the edges of the graph G
FC(A  {e}) – FC (A) ≥ FC (B  {e}) – FC (B)
Gain of adding an edge to a “small”
graph
Proof:
Fc(G) of a single cascade c is
monotonic, and submodular
Gain of adding an edge to a “large“
graph
A  B  VxV
FC(G) of a set of cascades C 12
monotonic, and submodular
Objective function is Submodular
Proof:
Fc(A  {e}) – Fc (A) ≥ Fc (B  {e}) – Fc (B)
Gain of adding an edge to a “small”
graph
Gain of adding an edge to a “large“
graph
A  B  VxV





Single cascade c, edge e with weight x
A
Let w be max weight in-edge of s in A
Let w’ be max weight in-edge of s in B
B
We know: w ≤ w’
Now: Fc(A  {e}) – Fc(A) = max (w, x) – w
≥ max (w’, x) – w’ = Fc(B  {e}) – Fc(B)
i
w
o
r
s
k
j
x
w’
a
13
Finding the Diffusion Graph
 Use the greedy hill-climbing to maximize FC(G):

For i=1…k:

At every step, pick the edge that maximizes the
marginal improvement
Marginal gains
a
Benefits:
b
a
b : 20
1. Approximation
guarantee (≈ 0.63 of OPT)
c
b : 17
18
d
e
a
b
b
c
e
b
d
24
b : 1
35
b : 1
c
c : 15
c : 68
d : 16
e
d
:
7
8
Lazy
evaluation (by submodularity)
d : 810
e : 7 update (by the structure of
Localized
e : 13
d
2. Tight on-line bounds on the solution quality
3. Speed-ups:
the problem)
14
Experimental Setup
 We validate our method on:
Synthetic data
Real data
Generate a graph G on k edges
Generate cascades
Record node infection times
Reconstruct G

How many edges of
G can we find?
MemeTracker: 172m news articles
Aug ’08 – Sept ‘09
343m textual phrases (quotes)
Flickr:

How well do we
optimize the
likelihood Fc(G)?

How fast is the
algorithm?
 Precision-Recall
 Break-even point

How many cascades
do we need?
15
Small Synthetic Example
 Small synthetic network:
True network
Baseline network
Our method
Pick k strongest edges:
16
16
Synthetic Networks
1024 node hierarchical Kronecker
exponential transmission model
1000 node Forest Fire (α = 1.1)
power law transmission model
 Performance does not depend on the network structure:
 Synthetic Networks: Forest Fire, Kronecker, etc.
 Transmission time distribution: Exponential, Power Law
 Break-even point of > 90%
17
How good is our graph?
 We achieve ≈ 90 % of the best possible network!
18
How many cascades do we need?
 With 2x as many infections as edges, the break-even
point is already 0.8 - 0.9!
19
Running Time
 Lazy evaluation and localized updates speed up 2
orders of magnitude!
 Can infer a networks of 10k nodes in several hours
20
Real Data: Information diffusion
 MemeTracker dataset:


172m news articles from Aug ’08 – Sept ‘09
343m textual phrases (quotes)
 Want to infer the network of information diffusion
 We use the hyperlinks between sites to generate the
edges of a ground truth G
 From the MemeTracker dataset, we have the
timestamps of:
1. cascades of hyperlinks:
e
a
c
time when a site creates a link
2. cascades of (MemeTracker) textual phrases:
time when site mentions the information
f
e
a
c
f
21
Real Network: Performance
500 node hyperlink network using
hyperlinks cascades
500 node hyperlink network using
MemeTracker cascades
 Break-even points of 50% for hyperlinks cascades and
30% for MemeTracker cascades!
22
Information Diffusion Network
 5,000 news sites:
Blogs
Mainstream media
23
Information Diffusion Network (small part)
Blogs
Mainstream media
24
Real Data: Trips reconstruction
 Flickr dataset:


60k Flickr users
6M time-stamped geo-localized photos
 For every user we have:
Time and Place where a photo was taken
20425816@N05;Argentina;Ciudad de Buenos Aires;Cafayate;2008-04-02
9603517@N06;Spain;Andalucia;Granada;2008-04-09
9603517@N06;Belgium;Oost-Vlaanderen;Ghent;2006-05-20
95311862@N00;Italy;Piedmont;San Pietro Mosezzo;2005-03-10
…
 Want to infer the network of frequent trips
25
Trips Network
26
Conclusions
 We infer hidden networks based on diffusion data
(timestamps)
 Problem formulation in a maximum likelihood
framework


NP-hard problem to solve exactly
We develop an approximation algorithm that:
 It is efficient -> It runs in O(N2)
 It is invariant to the structure of the underlying network
 It gives a sub-optimal network with tight bound
 Future work:


Learn both the network and the diffusion model
Applications to other domains: biology, neuroscience, etc.
27
Thanks!
For more (Code & Data):
http://snap.stanford.edu/netinf
28