Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Leveraging Propagation for Data Mining Models, Algorithms, Applications B. Aditya Prakash Department of Computer Science Social Computing Workshop, ARL, Sept 28, 2016 Dynamical Processes over networks are also everywhere! Prakash 2016 2 Why do we care? • • • • • • • • Social collaboration Information Diffusion Viral Marketing Epidemiology and Public Health Cyber Security Human mobility Games and Virtual Worlds Ecology • ........ Prakash 2016 3 Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] Diseases over contact networks Prakash 2016 CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts 4 Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2016 5 Why do we care? (1: Epidemiology) ~6x fewer! CURRENT PRACTICE [US-MEDICARE NETWORK 2005] OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2016 6 Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2016 7 Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2016 8 Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2016 9 High Impact – Multiple Settings epidemic outQ. How to squash rumors faster? breaks products/viruses Q. How do opinions spread? transmit s/w patches Q. How to market better? Prakash 2016 10 Research Theme ANALYSIS Understanding POLICY/ ACTION DATA Large real-world networks & processes Managing Prakash 2016 11 Research Theme – Social Media ANALYSIS # cascades in future? DATA POLICY/ ACTION Modeling Tweets spreading How to market better? Prakash 2016 12 Research Theme – Public Health ANALYSIS Will an epidemic happen? DATA POLICY/ ACTION Modeling # patient transfers How to control out-breaks? Prakash 2016 13 In this talk Using propagation for _________ Q1: Syndromic Surveillance Q2: Memes, Tweets, Blogs Applications Large real-world networks & processes Q3: Summarization & Communities. Prakash 2016 14 Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016 15 Surveillance [Chen et. al. ICDM 2014] • How to estimate and predict flu trends? Hospital record Surveillance Report Lab survey Population survey Prakash 2016 16 GFT & Twitter • Estimate flu trends using online electronic sources Prakash 2016 17 Flu forecasting • Twitter – a surrogate for flu forecasting? • Google Flu Trends: using keywords to track the flu season • Can we get more specific? • Consider: Prakash 2016 18 “Propagation” ideas • Can we develop better disease surveillance tools by leveraging – How flu-related information propagates on Twitter – Epidemiological models Prakash 2016 19 Observation 1: States • There are different states in an infection cycle. • SEIR model: 1. Susceptible 3. Infected 2. Exposed 4. Recovered Prakash 2016 20 Observation 2: Ep. & So. Gap • Infection cases drop exponentially in epidemiology (Hethcote 2000) • Keyword mentions drop in a power-law pattern in social media (Matsubara 2012) Prakash 2016 21 Flu Forecasting • Using combination of propagation patterns, develop a hidden flu-state topic model • Learn “flu” vocabulary and transition probabilities Prakash 2016 22 HFSTM Model Details • Hidden Flu-State from Tweet Model (HFSTM) – Each word (w) in a tweet (Oi) can be generated by: Initial prob. • A background topic • Non-flu related topics • State related topics Transit. switch Binary nonflu related switch Latent state Transit. prob. Binary background switch Prakash 2016 Word distribution 23 HFSTM Model • Generating tweets Details Generate the state for a tweet Generate the topic for a word State: [S,E,I] Topic: [Background, Non-flu, State] S: This restaurant is really good E: The movie was good but it was freezing I: I think I have flu Prakash 2016 24 Inference Details • EM-based algorithm: HFSTM-FIT – E-step: • At(i)=P(O1,O2,…,Ot,St=i) • Bt(i)=P(Ot+1,…,OTu|St=i) • γt(i)=P(St=i|Ou) – M-step: • Other parameters such as state transition probabilities, topic distributions, etc. – Parameters learned: Prakash 2016 25 A possible issue with HFSTM • Suffers from large, noisy vocabulary. • Semi-supervision for improvement – Introduce weak supervision into HFSTM. Prakash 2016 26 HFSTM-A Details [Chen et. al. DAMI 2015] • HFSTM-A(spect) – Introduce an aspect variable y, expressing our belief on whether a word is flu-related or not. – The value of y biases the switch variables s.t. flu-related words are more likely to be explained by state topics. When the aspect value (y) is introduced, the switching probability are updated accordingly. Prakash 2016 27 Vocabulary & Dataset • Vocabulary (230 words): – Flu-related keyword list by Chakraborty SDM 2014 – Extra state-related keyword list • Dataset (34,000 tweets): – Identify infected users and collect their tweets – Train on data from Jun 20, 2013-Aug 06, 2013 – Test on two time period: • Dec 01, 2012- July 08, 2013 • Nov 10, 2013-Jan 26, 2014 Prakash 2016 28 Learned word distributions • The most probable words learned in each state Probably healthy: S Having symptons: E Prakash 2016 Definitely sick: I 29 Learned state transition Transition probabilities Transition in real tweets Learned by HFSTM: Not directly flu-related, yet correctly identified Prakash 2016 30 Flu trend fitting • Ground-truth: – The Pan American Health Organization (PAHO) • Algorithms: – Baseline: • Count the number of keywords weekly as features, and regress to the ground-truth curve. – Google flu trend: • Take the google flu trend data as input, regress to the PAHO curve. – HFSTM: • Distinguish different states of keyword, and only use the number of keywords in I state. Again regress to PAHO. Prakash 2016 31 Flu trend fitting • Linear regression to the case count reported by PAHO (the ground-truth) Prakash 2016 32 HFSTM-A • Results are qualitatively similar with HFSTM, when the vocabulary is 10 times larger. Prakash 2016 33 Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016 34 Memetracking • Memes – a virally transmitted cultural symbol or social idea (first coined by Richard Dawkins in 1976) • Usually text (a phrase) and/or an image A viral meme from 2012 Olympics All the way to the White House Prakash 2016 35 Patterns Imputation Anomaly Compression Extrapolation Prakash 2016 36 Google Search Volume (1) First spike (2) Release date (3) Two weeks before release ? ? e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date Prakash 2016 37 Rise and fall patterns in social media • Meme (# of mentions in blogs) – short phrases Sourced from U.S. politics in 2008 “you can put lipstick on a pig” “yes we can” Prakash 2016 38 Rise and fall patterns in social media • Can we find a unifying model, which includes these patterns? • four classes on YouTube [Crane et al. ’08] • six classes on Meme [Yang et al. ’11] 100 100 100 50 50 50 0 0 100 50 100 50 0 0 0 0 100 50 100 50 50 100 0 0 0 0 100 50 100 50 100 50 50 Prakash 2016 100 0 0 39 Rise and fall patterns in social media • Answer: YES! 0 20 40 60 80 Time 20 40 60 80 Time 100 120 Value 20 40 60 80 Time 50 0 20 40 60 80 Time 100 120 50 0 100 120 Original SpikeM 100 Value Value 50 0 0 100 120 Original SpikeM 100 50 Original SpikeM 100 20 40 60 80 Time 100 120 Original SpikeM 100 Value Value 50 Original SpikeM 100 Value Original SpikeM 100 50 0 20 40 60 80 Time 100 120 • We can represent all patterns by single model In Matsubara+ SIGKDD 2012 Prakash 2016 40 Main idea - SpikeM - 1. Un-informed bloggers (uninformed about rumor) - 2. External shock at time nb (e.g, breaking news) - 3. Infection (word-of-mouth) Time n=0 Time n=nb Infectiveness of a blog-post at age n: b f (n) Time n=nb+1 f (n) = b * n-1.5 - Strength of infection (quality of news) - Decay function (how infective a blog posting is) Prakash 2016 β Power Law 41 -1.5 slope J. G. Oliveira et. al. Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF] (also in Leskovec, McGlohon+, SDM 2007) Prakash 2016 42 SpikeM - with periodicity • Full equation of SpikeM n é ù DB(n +1) = p(n +1)× êU(n)× å (DB(t) + S(t))× f (n +1- t) + e ú ê ú ë t=n û b Periodicity 12pm Peak activity Bloggers change their activity over time activity 3am Low activity p(n) (e.g., daily, weekly, yearly) Time n Prakash 2016 43 Tail-part forecasts • SpikeM can capture tail part Prakash 2016 44 “What-if” forecasting (1) First spike (2) Release date (3) Two weeks before release ? ? e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date Prakash 2016 45 “What-if” forecasting –SpikeM can forecast not only tail-part, but also rise-part! (1) First spike (2) Release date (3) Two weeks before release • SpikeM can forecast upcoming spikes Prakash 2016 46 Bonus: Protest Predictions [Sundereisan et al. ASONAM 2014] [Jin et al. SIGKDD 2014] • Can Twitter provide a lead time? • South American twitter dataset Violent Protest (VP) – Language: Spanish/Portuguese – Idea 1. Look for trending keywords. 2. Predict event type for protest using SpikeM parameters! VP A political tweet Prakash 2016 P Non Violent Protest (P) 47 [Papalexakakis et al. ASONAM 2013] Propagation and Cyber-Security: Temporal Patterns Looks familiar? Prakash 2016 48 [Chan et. Al. WSDM 2016] Propagation and Cyber-Security: Ensemble Models Prakash 2016 49 Applications Using propagation for _________ • Q1: Syndromic Surveillance • Q2: Memes, Tweets, Blogs • Q3: General Graph Mining Prakash 2016 50 Example 1: Missing data correction Prakash 2016 51 Real data is noisy! We don’t know who exactly are infected • Epidemiology CDC – Public-health surveillance Lab Hospital Not sure CNN headlines Not sure ? ? Surveillance Pyramid [Nishiura+, PLoS ONE 2011] Each level has a certain probability to miss some truly infected people Prakash 2016 52 Real data is noisy! Correcting missing data is by itself very important • Social Media – Twitter: due to the uniform samples [Morstatter+, ICWSM 2013], the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling Prakash 2016 53 [Sudareisan, Vreeken, Prakash SDM 2015] [Rozenshtein et al. SIGKDD 2016] The Problem • GIVEN: – Graph G(V, E) from historical data – Infected set D V, sampled (p%) and incomplete – Infectivity β of the virus Ì • FIND: – Seed set i.e. patient zeros/culprits – Set C- (the missing infected nodes) – Ripple R (the order of infections) Prakash 2016 54 Visualizing Performance (Grid connected) NetSleuth Seeds Missing nodes Legend: Correct Simulation Seeds Missing nodes FP FN Frontier Seeds Missing nodes Seeds Prakash 2016 NetFill Seeds Missing nodes Infected 55 Meme-Tracker– case study • 96,000 node graph for the meme “State of the economy” • Found missing websites like “www.nbcbayarea.com”, “chicagotribune.com” and some blog posts. Prakash 2016 56 Example 2: “Zoom-out” of the network • “Zoom-out” of the cascade graph to get a quick picture (= summarization) A D D A Zoom-out C C B B F E F E Smaller representation of the network Big graph Coarsening [Purohit, Prakash, et, al. SIGKDD 2014] Prakash 2016 57 CoarseNet: algorithm • Step 1: compute scores for all edge pairs 2: Merge nodes with smallest score 3. Goto step 1 until αn nodes left Assigning scores Merging edges Original Network (weight=0.5) Coarsened Network Prakash 2016 58 Application 1: Influence Maximization • Methodology: Step 1: Coarsen the large social network using CoarsenNet Step 2: Solve influence maximization on the coarsened network Step 3: Randomly select one node from each selected “supernode” D A Step 1: Coarsen Step 2: Solve influence maximization C B F D A E C B F We call it CSPIN Step 3: Randomly select one node from C Prakash 2016 E 59 Application 2: Diffusion Characterization • Goal: use Graph Coarsening to understand information cascades • Dataset: Flixster – a friendship network with movie ratings – Cascade: the same movie rating from friends • Methodology – coarsen the network using CoarseNet with the reduction factor α=0.5 – study the formed groups (supernodes) – Can get non-network surrogates Prakash 2016 60 Diffusion observation Stats: • 1891 groups • mean group size: 16.6 • the largest group: 22061 nodes (roughly 40% of nodes) Observation 1: a very large fraction of movies propagate in a small number of groups Observation 2: a multi-modal distribution Prakash 2016 61 Things I won’t talk about Theory Fundamental Models Understanding Prakash 2016 62 Main questions 1. When will a virus take-off on a network? [ICDM 2012] 2. What happens if the networks vary with time? [PKDD 2010] 3. What happens if multiple viruses compete? (‘winner-takes-all’) [WWW 2012] Prakash 2016 63 More… 3. Interacting viruses Phase Transition for co-existence vs extinction vs [SIGKDD 2012] 4. Composite Networks (e.g. communication vs power-grid networks) depends on the networks [IEEE J. on Selected Areas in Comm. (JSAC) 2013] Prakash 2016 64 Algorithms Policy/Action Managing/Manipula ting Prakash 2016 65 Alg 1: Immunization (= Interventions) • Different Flavors: – Pre-emptive – Data-aware Prakash 2016 66 Immunizations as Network manipulation • Node based [Tong, P., + ICDM 2010] • Edge-based [Tong, P., + CIKM 2012, Best Paper Award] • Edge-Manipulation [P., Adamic+ SDM 2013] Prakash 2016 67 Latest results • First (provable) approximation algorithms for edge-based problem [Saha, Adiga, P., Vullikanti SDM 2015]) – O(log^2 n)--factor (can be improved to O(log n)) • Based on the idea of removing closed walks – Semi-Definite Programming Rounding-based O(1) factor Prakash 2016 68 and Prakash, SDM 2014 Data-aware Immunization [Zhang Zhang and Prakash, TKDD 2015] Given: Graph and Infected nodes Find: ‘best’ nodes for immunization • Complexity – NP-hard – Hard to approximate within an absolute error Graph with infected nodes • DAVA-tree – Optimal solution on the tree • DAVA and DAVA-fast – Merging infected nodes – Build a “dominator tree”, and run DAVA-tree • Running time: subquadratic Dominator tree – DAVA: O(k(|E|+ |V|log|V|)) – DAVA-fast: O(|E|+|V|log|V|) Prakash 2016 69 Extensions • Can be extended to Uncertain and noisy initial data as well! [Zhang and Prakash, CIKM 2014] Twitter Firehose API 1% sample Prakash 2016 70 Group-based Immunization [Zhang, Adiga, Vullikanti, Prakash, 2015] How to select groups to minimize the epidemic? • Epidemiology • People are grouped by ages, demographics, occupations … • Social Media A D • Friends are grouped by the same interests • E.g., Facebook pages C B F E Results: First approximation algorithms for the problem Prakash 2016 71 Conclusion: Theme ANALYSIS Understanding POLICY/ ACTION DATA Large real-world networks & processes Managing Prakash 2016 72 Scalability – Big Data • Datasets of unprecedented scale – High dimensionality and sample size! • Need scalable algorithms for – Learning Models – Developing Policy • Leverage parallel systems – Map-Reduce clusters (like Hadoop) for dataintensive jobs (more than 6000 machines) – Parallelized compute-intensive simulations (like Condor) Prakash 2016 73 Effect on Community Structure • Example: Twitter network where tweets diffuse over followee-follower network users who boost the diffusion ("bridges/media nodes") influential users (“kernels") Original Network Communities detected by NEWMAN’s algorithm Cannot capture different Prakash 2016 roles in diffusion! Ideal communities and roles of nodes: kernel, media, 74 ordinary nodes Summarization and Segmentation • Automatic segmentation? ig. 6: M DSA S segmentation result for Peru: word clouds he three segmentscascades? detected. • Segment ……. bola: M DSA S, EMP and TopicM all have a satisfactory Q alue (see Fig. 4b). As we explained in Sec. V-B, the l ⇤ va earned by M DSA S is close to |X |, and p( x̃ i |y) ⇡ p(x j | Prakash 2016 75 Extensions • • • • • Temporal graphs Noisy data Incorporating Richer Attributed graphs Heterogeneous graphs …. Prakash 2016 76 Theory & Algo. Biology Physics Comp. Systems ML & Stats. Social Science Propagation on Networks Prakash 2016 Econ. 77 Acknowledgements Collaborators Deepayan Chakrabarti, Hanghang Tong, Kunal Punera, Ashwin Sridharan, Sridhar Machiraju, Mukund Seshadri, Alice Zheng, Lei Li, Polo Chau, Nicholas Valler, Alex Beutel, Xuetao Wei Christos Faloutsos Roni Rosenfeld, Michalis Faloutsos, Lada Adamic, Theodore Iwashyna (M.D.), Dave Andersen, Tina Eliassi-Rad, Iulian Neamtiu, Varun Gupta, Jilles Vreeken, V. S. Subrahmanian John Brownstein (M.D.) Prakash 2016 78 Acknowledgements • Students Liangzhe Chen Shashidhar Sundereisan Benjamin Wang Yao Zhang Sorour Amiri Bijaya Adhikari Prakash 2016 79 Acknowledgements Funding Prakash 2016 80 Propagation for Data Mining B. Aditya Prakash http://www.cs.vt.edu/~badityap Analysis Policy/Action Prakash 2016 Data 81