* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Inferring Networks of Diffusion
Survey
Document related concepts
Zero-configuration networking wikipedia , lookup
IEEE 802.1aq wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Computer network wikipedia , lookup
Piggybacking (Internet access) wikipedia , lookup
Transcript
Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez1,2 Jure Leskovec1 Andreas Krause3 1 Stanford University for Biological Cybernetics 3 California Institute of Technology 2 MPI 1 Hidden and implicit networks Many social or information networks are implicit or hard to observe: Hidden/hard-to-reach populations: Network of needle sharing between drug injection users Implicit connections: Network of information propagation in online news media But we can observe results of the processes taking place on such (invisible) networks: Virus propagation: Drug users get sick, and we observe when they see the doctor Information networks: We observe when media sites mention information 2 Information Diffusion Network Information diffuses through the network Time We only see who mentions but not where they got the information from Question: Can we infer the hidden networks? 3 Examples and Applications Virus propagation Word of mouth & Viral marketing Process Viruses propagate through the network Recommendations and influence propagate We observe We only observe when people get sick We only observe when people buy products It’s hidden But NOT who infected whom But NOT who influenced whom Can we infer the underlying network? 4 Inferring the Network There is a directed social network over which diffusions take place: a b d c e But we do not observe the edges of the network We only see the time when a node gets infected: Cascade c1: (a, 1), (c, 2), (b, 6), (e, 9) Cascade c2: (c, 1), (a, 4), (b, 5), (d, 8) Task: inferring the underlying network 5 Our Problem Formulation Plan for the talk: 1. Define a continuous time model of diffusion 2. Define the likelihood of the observed cascades given a network 3. Show how to efficiently compute the likelihood of cascades 4. Show how to efficiently find a graph G that maximizes the likelihood Note: There is a super-exponential number of graphs, O(NN*N) Our method finds a near-optimal graph in O(N2)! 6 Cascade Generation Model Continuous time cascade diffusion model: Cascade c reaches node u at tu and spreads to u’s neighbors: With probability β cascade propagates along edge (u, v) and we determine the infection time of node v tv = tu + Δ e.g.: Δ ~ Exponential or Power-law ta tb Δ1 a te Δ3 tf Δ4 b c e Δ2 tc d f We assume each node v has only one parent! 7 Likelihood of a Single Cascade Probability that cascade c propagates from node u to node v is: Pc(u, v) P(tv - tu) with tv > tu Since not all nodes get infected by the diffusion process, we introduce the external influence node m: Pc(m, v) = ε Prob. that cascade c propagates in a tree pattern T: m ε ε ε a b d c e Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8) 8 Finding the Diffusion Network newspossible propagation Badtrees newsthat are ThereGood are many consistent with the data: Computing P(c|G) is observed We actually want to search tractable: c: (a, 1), (c, 2), (b, 3), (e, 4) over graphs: n a Even though bthere are O(n ) a possible propagation trees. d c Matrix Tree Theorem can c a b d c b d There is a supere e exponential number of Need to consider all possiblegraphs! propagation trees T 3)! compute this in O(n e supported by the graph G: Likelihood of a set of cascades C: Want to find a graph: 9 An Alternative Formulation We consider only the most likely tree Maximum log-likelihood for a cascade c under a graph G: The problem is still intractable (NP-hard) But we present algorithm that finds near Log-likelihood of Gan given a set of cascades C: optimal networks in O(N2) 10 Max Directed Spanning Tree Given a cascade c and a network G, What is the most likely propagation tree? where A maximum directed spanning tree (MDST): The sub-graph of G induced by the nodes in the cascade c is a DAG Because edges point forward in time For each node, just picks an in-edge of max-weight: Greedy parent selection of each node gives globally optimal tree! 11 Objective function is Submodular Given a set of cascades C, How do we find the network G that maximize FC(G)? Theorem: Log-likelihood FC(G) of a set of cascades C is monotonic, and submodular in the edges of the graph G FC(A {e}) – FC (A) ≥ FC (B {e}) – FC (B) Gain of adding an edge to a “small” graph Proof: Fc(G) of a single cascade c is monotonic, and submodular Gain of adding an edge to a “large“ graph A B VxV FC(G) of a set of cascades C 12 monotonic, and submodular Objective function is Submodular Proof: Fc(A {e}) – Fc (A) ≥ Fc (B {e}) – Fc (B) Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph A B VxV Single cascade c, edge e with weight x A Let w be max weight in-edge of s in A Let w’ be max weight in-edge of s in B B We know: w ≤ w’ Now: Fc(A {e}) – Fc(A) = max (w, x) – w ≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B) i w o r s k j x w’ a 13 Finding the Diffusion Graph Use the greedy hill-climbing to maximize FC(G): For i=1…k: At every step, pick the edge that maximizes the marginal improvement Marginal gains a Benefits: b a b : 20 1. Approximation guarantee (≈ 0.63 of OPT) c b : 17 18 d e a b b c e b d 24 b : 1 35 b : 1 c c : 15 c : 68 d : 16 e d : 7 8 Lazy evaluation (by submodularity) d : 810 e : 7 update (by the structure of Localized e : 13 d 2. Tight on-line bounds on the solution quality 3. Speed-ups: the problem) 14 Experimental Setup We validate our method on: Synthetic data Real data Generate a graph G on k edges Generate cascades Record node infection times Reconstruct G How many edges of G can we find? MemeTracker: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases (quotes) Flickr: How well do we optimize the likelihood Fc(G)? How fast is the algorithm? Precision-Recall Break-even point How many cascades do we need? 15 Small Synthetic Example Small synthetic network: True network Baseline network Our method Pick k strongest edges: 16 16 Synthetic Networks 1024 node hierarchical Kronecker exponential transmission model 1000 node Forest Fire (α = 1.1) power law transmission model Performance does not depend on the network structure: Synthetic Networks: Forest Fire, Kronecker, etc. Transmission time distribution: Exponential, Power Law Break-even point of > 90% 17 How good is our graph? We achieve ≈ 90 % of the best possible network! 18 How many cascades do we need? With 2x as many infections as edges, the break-even point is already 0.8 - 0.9! 19 Running Time Lazy evaluation and localized updates speed up 2 orders of magnitude! Can infer a networks of 10k nodes in several hours 20 Real Data: Information diffusion MemeTracker dataset: 172m news articles from Aug ’08 – Sept ‘09 343m textual phrases (quotes) Want to infer the network of information diffusion We use the hyperlinks between sites to generate the edges of a ground truth G From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks: e a c time when a site creates a link 2. cascades of (MemeTracker) textual phrases: time when site mentions the information f e a c f 21 Real Network: Performance 500 node hyperlink network using hyperlinks cascades 500 node hyperlink network using MemeTracker cascades Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades! 22 Information Diffusion Network 5,000 news sites: Blogs Mainstream media 23 Information Diffusion Network (small part) Blogs Mainstream media 24 Real Data: Trips reconstruction Flickr dataset: 60k Flickr users 6M time-stamped geo-localized photos For every user we have: Time and Place where a photo was taken 20425816@N05;Argentina;Ciudad de Buenos Aires;Cafayate;2008-04-02 9603517@N06;Spain;Andalucia;Granada;2008-04-09 9603517@N06;Belgium;Oost-Vlaanderen;Ghent;2006-05-20 95311862@N00;Italy;Piedmont;San Pietro Mosezzo;2005-03-10 … Want to infer the network of frequent trips 25 Trips Network 26 Conclusions We infer hidden networks based on diffusion data (timestamps) Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that: It is efficient -> It runs in O(N2) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound Future work: Learn both the network and the diffusion model Applications to other domains: biology, neuroscience, etc. 27 Thanks! For more (Code & Data): http://snap.stanford.edu/netinf 28