Download 16-09-28-aditya-arl-talk - People at VT Computer Science

Document related concepts
no text concepts found
Transcript
Leveraging Propagation for Data
Mining
Models, Algorithms, Applications
B. Aditya Prakash
Department of Computer Science
Social Computing Workshop, ARL, Sept 28, 2016
Dynamical Processes over
networks are also everywhere!
Prakash 2016
2
Why do we care?
•
•
•
•
•
•
•
•
Social collaboration
Information Diffusion
Viral Marketing
Epidemiology and Public Health
Cyber Security
Human mobility
Games and Virtual Worlds
Ecology
• ........
Prakash 2016
3
Why do we care? (1:
Epidemiology)
• Dynamical Processes over networks
[AJPH 2007]
Diseases over contact networks
Prakash 2016
CDC data: Visualization of
the first 35 tuberculosis
(TB) patients and their 1039
contacts
4
Why do we care? (1:
Epidemiology)
• Dynamical Processes over networks
• Each circle is a hospital
• ~3000 hospitals
• More than 30,000 patients
transferred
[US-MEDICARE
NETWORK 2005]
Problem: Given k units of
disinfectant, whom to immunize?
Prakash 2016
5
Why do we care? (1:
Epidemiology)
~6x
fewer!
CURRENT
PRACTICE
[US-MEDICARE
NETWORK 2005]
OUR METHOD
Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year)
Prakash 2016
6
Why do we care? (2: Online
Diffusion)
> 800m users, ~$1B
revenue [WSJ 2010]
~100m active users
> 50m users
Prakash 2016
7
Why do we care? (2: Online
Diffusion)
• Dynamical Processes over networks
Buy Versace™!
Followers
Celebrity
Social Media Marketing
Prakash 2016
8
Why do we care?
(3: To change the world?)
• Dynamical Processes over networks
Social networks and Collaborative Action
Prakash 2016
9
High Impact – Multiple Settings
epidemic outQ. How to squash rumors
faster?
breaks
products/viruses
Q. How do opinions spread?
transmit s/w patches
Q. How to market better?
Prakash 2016
10
Research Theme
ANALYSIS
Understanding
POLICY/
ACTION
DATA
Large real-world
networks & processes
Managing
Prakash 2016
11
Research Theme – Social
Media
ANALYSIS
# cascades in
future?
DATA
POLICY/
ACTION
Modeling Tweets
spreading
How to market
better?
Prakash 2016
12
Research Theme – Public Health
ANALYSIS
Will an epidemic
happen?
DATA
POLICY/
ACTION
Modeling # patient
transfers
How to control
out-breaks?
Prakash 2016
13
In this talk
Using propagation
for _________
Q1: Syndromic Surveillance
Q2: Memes, Tweets, Blogs
Applications
Large real-world networks
& processes
Q3: Summarization &
Communities.
Prakash 2016
14
Applications
Using propagation for _________
• Q1: Syndromic Surveillance
• Q2: Memes, Tweets, Blogs
• Q3: General Graph Mining
Prakash 2016
15
Surveillance
[Chen et. al. ICDM 2014]
• How to estimate and predict flu trends?
Hospital record
Surveillance
Report
Lab survey
Population survey
Prakash 2016
16
GFT & Twitter
• Estimate flu trends using online electronic
sources
Prakash 2016
17
Flu forecasting
• Twitter – a surrogate for flu forecasting?
• Google Flu Trends: using keywords to
track the flu season
• Can we get more specific?
• Consider:
Prakash 2016
18
“Propagation” ideas
• Can we develop better disease
surveillance tools by leveraging
– How flu-related information propagates on
Twitter
– Epidemiological models
Prakash 2016
19
Observation 1: States
• There are different states in an infection
cycle.
• SEIR model:
1. Susceptible
3. Infected
2. Exposed
4. Recovered
Prakash 2016
20
Observation 2:
Ep. & So. Gap
• Infection cases drop
exponentially in
epidemiology
(Hethcote 2000)
• Keyword mentions
drop in a power-law
pattern in social
media (Matsubara
2012)
Prakash 2016
21
Flu Forecasting
• Using combination of propagation
patterns, develop a hidden flu-state topic
model
• Learn “flu” vocabulary and transition
probabilities
Prakash 2016
22
HFSTM Model
Details
• Hidden Flu-State from Tweet Model (HFSTM)
– Each word (w) in a tweet (Oi) can be generated by:
Initial
prob.
• A background topic
• Non-flu related topics
• State related topics
Transit.
switch
Binary nonflu related
switch
Latent
state
Transit.
prob.
Binary
background
switch
Prakash 2016
Word
distribution
23
HFSTM Model
• Generating tweets
Details
Generate the state for a tweet
Generate the topic for a word
State: [S,E,I]
Topic: [Background,
Non-flu,
State]
S: This restaurant is really good
E: The movie was good
but it was freezing
I: I think I have flu
Prakash 2016
24
Inference
Details
• EM-based algorithm: HFSTM-FIT
– E-step:
• At(i)=P(O1,O2,…,Ot,St=i)
• Bt(i)=P(Ot+1,…,OTu|St=i)
• γt(i)=P(St=i|Ou)
– M-step:
• Other parameters such as state transition
probabilities, topic distributions, etc.
– Parameters learned:
Prakash 2016
25
A possible issue with HFSTM
• Suffers from large, noisy vocabulary.
• Semi-supervision for improvement
– Introduce weak supervision into HFSTM.
Prakash 2016
26
HFSTM-A
Details
[Chen et. al. DAMI 2015]
• HFSTM-A(spect)
– Introduce an aspect variable y, expressing our belief on whether
a word is flu-related or not.
– The value of y biases the switch variables s.t. flu-related words
are more likely to be explained by state topics.
When the aspect value (y) is
introduced, the switching probability
are updated accordingly.
Prakash 2016
27
Vocabulary & Dataset
• Vocabulary (230 words):
– Flu-related keyword list by Chakraborty SDM 2014
– Extra state-related keyword list
• Dataset (34,000 tweets):
– Identify infected users and collect their tweets
– Train on data from Jun 20, 2013-Aug 06, 2013
– Test on two time period:
• Dec 01, 2012- July 08, 2013
• Nov 10, 2013-Jan 26, 2014
Prakash 2016
28
Learned word distributions
• The most probable words learned in each state
Probably healthy: S
Having symptons: E
Prakash 2016
Definitely sick: I
29
Learned state transition
Transition probabilities
Transition in real tweets
Learned by HFSTM:
Not directly flu-related,
yet correctly identified
Prakash 2016
30
Flu trend fitting
• Ground-truth:
– The Pan American Health Organization (PAHO)
• Algorithms:
– Baseline:
• Count the number of keywords weekly as features, and
regress to the ground-truth curve.
– Google flu trend:
• Take the google flu trend data as input, regress to the PAHO
curve.
– HFSTM:
• Distinguish different states of keyword, and only use the
number of keywords in I state. Again regress to PAHO.
Prakash 2016
31
Flu trend fitting
• Linear regression to the case count
reported by PAHO (the ground-truth)
Prakash 2016
32
HFSTM-A
• Results are qualitatively similar with HFSTM,
when the vocabulary is 10 times larger.
Prakash 2016
33
Applications
Using propagation for _________
• Q1: Syndromic Surveillance
• Q2: Memes, Tweets, Blogs
• Q3: General Graph Mining
Prakash 2016
34
Memetracking
• Memes – a virally transmitted cultural
symbol or social idea (first coined by
Richard Dawkins in 1976)
• Usually text (a phrase) and/or an image
A viral meme from
2012 Olympics
All the way to the
White House
Prakash 2016
35
Patterns
Imputation
Anomaly
Compression
Extrapolation
Prakash 2016
36
Google Search Volume
(1) First spike
(2) Release date
(3) Two weeks before release
?
?
e.g., given (1) first spike,
(2) release date of two sequel movies
(3) access volume before the release date
Prakash 2016
37
Rise and fall patterns in social
media
• Meme (# of mentions in blogs)
– short phrases Sourced from U.S. politics in 2008
“you can put lipstick on a pig”
“yes we can”
Prakash 2016
38
Rise and fall patterns in social
media
• Can we find a unifying model, which
includes these patterns?
• four classes on YouTube [Crane et al. ’08]
• six classes on Meme [Yang et al. ’11]
100
100
100
50
50
50
0
0
100
50
100
50
0
0
0
0
100
50
100
50
50
100
0
0
0
0
100
50
100
50
100
50
50
Prakash 2016
100
0
0
39
Rise and fall patterns in social
media
• Answer: YES!
0
20
40
60 80
Time
20
40
60 80
Time
100 120
Value
20
40
60 80
Time
50
0
20
40
60 80
Time
100 120
50
0
100 120
Original
SpikeM
100
Value
Value
50
0
0
100 120
Original
SpikeM
100
50
Original
SpikeM
100
20
40
60 80
Time
100 120
Original
SpikeM
100
Value
Value
50
Original
SpikeM
100
Value
Original
SpikeM
100
50
0
20
40
60 80
Time
100 120
• We can represent all patterns by single model
In Matsubara+ SIGKDD 2012
Prakash 2016
40
Main idea - SpikeM
- 1. Un-informed bloggers (uninformed about rumor)
- 2. External shock at time nb (e.g, breaking news)
- 3. Infection (word-of-mouth)
Time n=0
Time n=nb
Infectiveness of a blog-post at age n:
b
f (n)
Time n=nb+1
f (n) = b * n-1.5
- Strength of infection (quality of news)
- Decay function (how infective a blog posting is)
Prakash 2016
β
Power Law
41
-1.5 slope
J. G. Oliveira et. al. Human Dynamics: The
Correspondence Patterns of Darwin and Einstein.
Nature 437, 1251 (2005) . [PDF]
(also in Leskovec, McGlohon+, SDM 2007)
Prakash 2016
42
SpikeM - with periodicity
• Full equation of SpikeM
n
é
ù
DB(n +1) = p(n +1)× êU(n)× å (DB(t) + S(t))× f (n +1- t) + e ú
ê
ú
ë
t=n
û
b
Periodicity
12pm
Peak activity
Bloggers change their
activity over time
activity
3am
Low activity
p(n)
(e.g., daily, weekly, yearly)
Time n
Prakash 2016
43
Tail-part forecasts
• SpikeM can capture tail part
Prakash 2016
44
“What-if” forecasting
(1) First spike
(2) Release date
(3) Two weeks before release
?
?
e.g., given (1) first spike,
(2) release date of two sequel movies
(3) access volume before the release date
Prakash 2016
45
“What-if” forecasting
–SpikeM can forecast not only tail-part, but also rise-part!
(1) First spike
(2) Release date
(3) Two weeks before release
• SpikeM can forecast upcoming spikes
Prakash 2016
46
Bonus: Protest Predictions
[Sundereisan et al. ASONAM 2014]
[Jin et al. SIGKDD 2014]
• Can Twitter provide a lead time?
• South American twitter dataset
Violent
Protest (VP)
– Language: Spanish/Portuguese
– Idea
1. Look for trending keywords.
2. Predict event type for protest using SpikeM
parameters!
VP
A political tweet
Prakash 2016
P
Non Violent
Protest (P)
47
[Papalexakakis et al. ASONAM 2013]
Propagation and Cyber-Security:
Temporal Patterns
Looks
familiar?

Prakash 2016
48
[Chan et. Al. WSDM 2016]
Propagation and Cyber-Security:
Ensemble Models
Prakash 2016
49
Applications
Using propagation for _________
• Q1: Syndromic Surveillance
• Q2: Memes, Tweets, Blogs
• Q3: General Graph Mining
Prakash 2016
50
Example 1: Missing data
correction
Prakash 2016
51
Real data is noisy!
We don’t know who exactly are infected
• Epidemiology
CDC
– Public-health surveillance
Lab
Hospital
Not sure
CNN headlines
Not sure
?
?
Surveillance Pyramid
[Nishiura+, PLoS ONE 2011]
Each level has a certain probability to
miss some truly infected people
Prakash 2016
52
Real data is noisy!
Correcting missing data is by itself very important
• Social Media
– Twitter: due to the uniform samples [Morstatter+, ICWSM
2013], the relevant ‘infected’ tweets may be missed
Tweets
Missing
?
Sampled
Tweets
? Missing
Sampling
Prakash 2016
53
[Sudareisan, Vreeken, Prakash SDM 2015]
[Rozenshtein et al. SIGKDD 2016]
The Problem
• GIVEN:
– Graph G(V, E) from historical data
– Infected set D
V, sampled (p%) and incomplete
– Infectivity β of the virus
Ì
• FIND:
– Seed set i.e. patient zeros/culprits
– Set C- (the missing infected nodes)
– Ripple R (the order of infections)
Prakash 2016
54
Visualizing Performance (Grid
connected)
NetSleuth
Seeds
Missing nodes
Legend: Correct
Simulation
Seeds
Missing nodes
FP
FN
Frontier
Seeds
Missing nodes
Seeds
Prakash 2016
NetFill
Seeds
Missing nodes
Infected
55
Meme-Tracker– case study
• 96,000 node graph for the meme “State of
the economy”
• Found missing websites like
“www.nbcbayarea.com”,
“chicagotribune.com” and some blog
posts.
Prakash 2016
56
Example 2: “Zoom-out” of the network
• “Zoom-out” of the cascade graph to get a
quick picture (= summarization)
A
D
D
A
Zoom-out
C
C
B
B
F
E
F
E
Smaller representation
of the network
Big graph
Coarsening
[Purohit, Prakash, et, al. SIGKDD 2014]
Prakash 2016
57
CoarseNet: algorithm
• Step
1: compute scores for all edge pairs
2: Merge nodes with smallest score
3. Goto step 1 until αn nodes left
Assigning scores
Merging edges
Original Network (weight=0.5)
Coarsened Network
Prakash 2016
58
Application 1: Influence
Maximization
• Methodology:
Step 1: Coarsen the large social network using CoarsenNet
Step 2: Solve influence maximization on the coarsened network
Step 3: Randomly select one node from each selected “supernode”
D
A
Step 1: Coarsen
Step 2: Solve influence
maximization
C
B
F
D
A
E
C
B
F
We call it CSPIN
Step 3: Randomly select
one node from C
Prakash 2016
E
59
Application 2: Diffusion Characterization
• Goal: use Graph Coarsening to understand
information cascades
• Dataset: Flixster
– a friendship network with movie ratings
– Cascade: the same movie rating from friends
• Methodology
– coarsen the network using CoarseNet with the
reduction factor α=0.5
– study the formed groups (supernodes)
– Can get non-network surrogates
Prakash 2016
60
Diffusion observation
Stats:
• 1891 groups
• mean group size: 16.6
• the largest group: 22061
nodes (roughly 40% of
nodes)
Observation 1: a very large fraction of movies propagate in a
small number of groups
Observation 2: a multi-modal distribution
Prakash 2016
61
Things I won’t talk about 
Theory
Fundamental
Models
Understanding
Prakash 2016
62
Main questions
1. When will a virus take-off on a network?
[ICDM 2012]
2. What happens if the networks vary with
time?
[PKDD 2010]
3. What happens if multiple viruses compete?
(‘winner-takes-all’)
[WWW 2012]
Prakash 2016
63
More…
3. Interacting viruses  Phase Transition for
co-existence vs extinction
vs
[SIGKDD 2012]
4. Composite Networks (e.g. communication
vs power-grid networks)  depends on the
networks
[IEEE J. on Selected Areas in Comm.
(JSAC) 2013]
Prakash 2016
64
Algorithms
Policy/Action
Managing/Manipula
ting
Prakash 2016
65
Alg 1: Immunization (= Interventions)
• Different Flavors:
– Pre-emptive
– Data-aware
Prakash 2016
66
Immunizations as Network
manipulation
• Node based [Tong, P., + ICDM 2010]
• Edge-based [Tong, P., + CIKM 2012, Best Paper
Award]
• Edge-Manipulation [P., Adamic+ SDM 2013]
Prakash 2016
67
Latest results
• First (provable) approximation algorithms for
edge-based problem [Saha, Adiga, P.,
Vullikanti SDM 2015])
– O(log^2 n)--factor (can be improved to O(log n))
• Based on the idea of removing closed walks
– Semi-Definite Programming Rounding-based O(1)
factor
Prakash 2016
68
and Prakash, SDM 2014
Data-aware Immunization [Zhang
Zhang and Prakash, TKDD 2015]
Given: Graph and Infected nodes
Find: ‘best’ nodes for immunization
• Complexity
– NP-hard
– Hard to approximate within an absolute error
Graph with infected
nodes
• DAVA-tree
– Optimal solution on the tree
• DAVA and DAVA-fast
– Merging infected nodes
– Build a “dominator tree”, and run DAVA-tree
• Running time: subquadratic
Dominator tree
– DAVA: O(k(|E|+ |V|log|V|))
– DAVA-fast: O(|E|+|V|log|V|)
Prakash 2016
69
Extensions
• Can be extended to Uncertain and noisy initial
data as well!
[Zhang and Prakash, CIKM 2014]
Twitter
Firehose
API
1%
sample
Prakash 2016
70
Group-based Immunization
[Zhang, Adiga, Vullikanti, Prakash, 2015]
How to select groups to minimize the epidemic?
• Epidemiology
• People are grouped by ages,
demographics, occupations …
• Social Media
A
D
• Friends are grouped by the
same interests
• E.g., Facebook pages
C
B
F
E
Results:
First approximation algorithms
for the problem
Prakash 2016
71
Conclusion: Theme
ANALYSIS
Understanding
POLICY/
ACTION
DATA
Large real-world
networks & processes
Managing
Prakash 2016
72
Scalability – Big Data
• Datasets of unprecedented scale
– High dimensionality and sample size!
• Need scalable algorithms for
– Learning Models
– Developing Policy
• Leverage parallel systems
– Map-Reduce clusters (like Hadoop) for dataintensive jobs (more than 6000 machines)
– Parallelized compute-intensive simulations (like
Condor)
Prakash 2016
73
Effect on Community Structure
• Example: Twitter network where tweets diffuse
over followee-follower network
users who boost the diffusion
("bridges/media nodes")
influential
users
(“kernels")
Original Network
Communities detected by
NEWMAN’s algorithm
Cannot capture different
Prakash 2016
roles in diffusion!
Ideal communities and roles
of nodes: kernel, media, 74
ordinary nodes
Summarization and Segmentation
• Automatic segmentation?
ig. 6: M DSA S segmentation result for Peru: word clouds
he three
segmentscascades?
detected.
• Segment
…….
bola: M DSA S, EMP and TopicM all have a satisfactory
Q
alue (see Fig. 4b). As we explained in Sec. V-B, the l ⇤ va
earned by M DSA S is close to |X |, and p( x̃ i |y) ⇡ p(x j |
Prakash 2016
75
Extensions
•
•
•
•
•
Temporal graphs
Noisy data
Incorporating Richer Attributed graphs
Heterogeneous graphs
….
Prakash 2016
76
Theory
& Algo.
Biology
Physics
Comp.
Systems
ML &
Stats.
Social
Science
Propagation
on Networks
Prakash 2016
Econ.
77
Acknowledgements
Collaborators
Deepayan Chakrabarti,
Hanghang Tong,
Kunal Punera,
Ashwin Sridharan,
Sridhar Machiraju,
Mukund Seshadri,
Alice Zheng,
Lei Li,
Polo Chau,
Nicholas Valler,
Alex Beutel,
Xuetao Wei
Christos Faloutsos
Roni Rosenfeld,
Michalis Faloutsos,
Lada Adamic,
Theodore Iwashyna (M.D.),
Dave Andersen,
Tina Eliassi-Rad,
Iulian Neamtiu,
Varun Gupta,
Jilles Vreeken,
V. S. Subrahmanian
John Brownstein (M.D.)
Prakash 2016
78
Acknowledgements
• Students
Liangzhe Chen
Shashidhar Sundereisan
Benjamin Wang
Yao Zhang
Sorour Amiri
Bijaya Adhikari
Prakash 2016
79
Acknowledgements
Funding
Prakash 2016
80
Propagation for Data Mining
B. Aditya Prakash
http://www.cs.vt.edu/~badityap
Analysis
Policy/Action
Prakash 2016
Data
81