Download Foils

Document related concepts
no text concepts found
Transcript
CMU SCS
Mining Billion Node Graphs
Christos Faloutsos
CMU
CMU SCS
CONGRATULATIONS!
IC '11
C. Faloutsos
2
CMU SCS
Outline
•
•
•
•
Q+A
Problem definition / Motivation
Graphs and power laws
Streams, environment, data center
monitoring
• Conclusions
IC '11
C. Faloutsos
3
CMU SCS
Q+A
•
•
•
•
•
Are you recruiting? How many?
How many do you have?
How frequently you meet them?
What is your advising style?
How do you feel about summer
internships?
IC '11
C. Faloutsos
4
CMU SCS
Q+A
•
•
•
•
•
Are you recruiting? How many? •
•
How many do you have?
How frequently you meet them? •
•
What is your advising style?
How do you feel about summer •
internships?
IC '11
C. Faloutsos
Yes, 1-2
5+2
1/week

Yes/Maybe
(Y!,G,
MSR, IBM,
++)
5
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns and anomalies
– Scalability and ‘hadoop’
– Influence/ virus propagation
• Streams, environment, data center
monitoring
• Conclusions
IC '11
C. Faloutsos
6
CMU SCS
Motivation
• Data mining: ~ find patterns (rules, outliers)
• How do real graphs look like? Anomalies?
– Virus/influence propagation
• Time series / env. Monitoring
Temperature in datacenter
IC '11
C. Faloutsos
7
CMU SCS
Graphs - why should we care?
IC '11
C. Faloutsos
8
CMU SCS
Graphs - why should we care?
Friendship Network
[Moody ’01]
IC '11
C. Faloutsos
9
CMU SCS
Graphs - why should we care?
Food Web
[Martinez ’91]
Friendship Network
[Moody ’01]
IC '11
Internet Map
[lumeta.com]
C. Faloutsos
10
CMU SCS
Problem #1 - network and
graph mining
• What does the Internet look like?
• What does FaceBook look like?
• What is ‘normal’/‘abnormal’?
• which patterns/laws hold?
– To spot anomalies (rarities), we have to
discover patterns
– Large datasets reveal patterns/anomalies
that may be invisible otherwise…
IC '11
C. Faloutsos
11
CMU SCS
Graph mining
• Are real graphs random?
IC '11
C. Faloutsos
12
CMU SCS
Laws and patterns
NO!!
• Diameter
• in- and out- degree distributions
• other (surprising) patterns
IC '11
C. Faloutsos
13
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns and anomalies
– Scalability and ‘hadoop’
– Influence/ virus propagation
• Streams, environment, data center
monitoring
•IC '11Conclusions
C. Faloutsos
14
CMU SCS
S1 – degree distributions
• Q: avg degree is ~3 - what is the most
probable degree?
count
??
3
IC '11
degree
C. Faloutsos
15
CMU SCS
S1– degree distributions
• Q: avg degree is ~3 - what is the most
probable degree?
count
??
3
IC '11
count
degree
C. Faloutsos
3
degree
16
CMU SCS
Solution:
Frequency
Exponent = slope
O = -2.15
-2.15
Nov’97
Outdegree
The plot is linear in log-log scale [FFF’99]
freq = degree (-2.15)
IC '11
C. Faloutsos
17
CMU SCS
Solution# S.2: Triangle ‘Laws’
• Real social networks have a lot of triangles
IC '11
C. Faloutsos
18
CMU SCS
Solution# S.2: Triangle ‘Laws’
• Real social networks have a lot of triangles
– Friends of friends are friends
• Any patterns?
IC '11
C. Faloutsos
19
CMU SCS
Triangle Law: #S.2
[Tsourakakis ICDM 2008]
Reuters
X-axis: degree
Y-axis: mean # triangles
n friends -> ???? triangles
IC '11
C. Faloutsos
20
CMU SCS
Triangle Law: #S.2
[Tsourakakis ICDM 2008]
Reuters
SN
X-axis: degree
Y-axis: mean # triangles
n friends -> ~n1.6 triangles
Epinions
IC '11
C. Faloutsos
21
CMU SCS
Triangle counting for large
graphs?
Anomalous nodes in Twitter(~ 3 billion edges)
[U Kang, Brendan Meeder, +, PAKDD’11]
IC '11
C. Faloutsos
22
22
CMU SCS
Triangle counting for large
graphs?
Anomalous nodes in Twitter(~ 3 billion edges)
[U Kang, Brendan Meeder, +, PAKDD’11]
IC '11
C. Faloutsos
23
23
CMU SCS
Triangle counting for large
graphs?
Anomalous nodes in Twitter(~ 3 billion edges)
[U Kang, Brendan Meeder, +, PAKDD’11]
IC '11
C. Faloutsos
24
24
CMU SCS
But:
• Q1: How about graphs from other domains?
• Q2: How about temporal evolution?
IC '11
C. Faloutsos
25
CMU SCS
Time evolution
• with Jure Leskovec (CMU -> Stanford)
• and Jon Kleinberg (Cornell)
(‘best paper’ KDD05)
IC '11
C. Faloutsos
26
CMU SCS
T1 - Evolution of the Diameter
• Prior work on Power Law graphs hints
at slowly growing diameter:
– diameter ~ O(log N)
– diameter ~ O(log log N)
• What is happening in real data?
IC '11
C. Faloutsos
27
CMU SCS
T1 - Evolution of the Diameter
• Prior work on Power Law graphs hints
at slowly growing diameter:
– diameter ~ O(log N)
– diameter ~ O(log log N)
• What is happening in real data?
• Diameter shrinks over time
– As the network grows the distances
between nodes slowly decrease
IC '11
C. Faloutsos
28
CMU SCS
Diameter – ArXiv citation
graph
• Citations among
physics papers
• 1992 –2003
• One graph per
year
diameter
time [years]
IC '11
C. Faloutsos
29
CMU SCS
Diameter – “Patents”
• Patent citation
network
• 25 years of data
diameter
time [years]
IC '11
C. Faloutsos
30
CMU SCS
And many more patterns…
•
•
•
•
#nodes vs #edges (power law(!))
# conn. Components (power law, too)
Contact/phone-call duration (log-logistic)
Total node weight vs # edges (superlinear/power law)
• ….
IC '11
C. Faloutsos
31
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns and anomalies
– Scalability and ‘hadoop’
– Influence/ virus propagation
• Streams, environment, data center
monitoring
• Conclusions
IC '11
C. Faloutsos
32
CMU SCS
E-bay Fraud detection
w/ Polo Chau &
Shashank Pandit, CMU
[www’07]
IC '11
C. Faloutsos
33
CMU SCS
E-bay Fraud detection
IC '11
C. Faloutsos
34
CMU SCS
E-bay Fraud detection
IC '11
C. Faloutsos
35
CMU SCS
E-bay Fraud detection - NetProbe
IC '11
C. Faloutsos
36
CMU SCS
Popular press
And less desirable attention:
• E-mail from ‘Belgium police’ (‘copy of
your code?’)
IC '11
C. Faloutsos
37
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns and anomalies
– Scalability and ‘hadoop’
– Influence/ virus propagation
• Streams, environment, data center
monitoring
• Conclusions
IC '11
C. Faloutsos
38
CMU SCS
Scalability
• Google: > 450,000 processors in clusters of ~2000
processors each [Barroso, Dean, Hölzle, “Web Search for
a Planet: The Google Cluster Architecture” IEEE Micro
2003]
•
•
•
•
Yahoo: 5Pb of data [Fayyad, KDD’07]
Problem: machine failures, on a daily basis
How to parallelize data mining tasks, then?
A: map/reduce – hadoop (open-source clone)
http://hadoop.apache.org/
WIN - NYU 2009
C. Faloutsos
39
CMU SCS
details
User
Program
Input Data
(on HDFS)
Split 0 read
Split 1
Split 2
fork
fork
fork
assign
map
Master
assign
reduce
Mapper
Mapper
Reducer
local
write
Output
File 0
Output
Reducer
Mapper
write
File 1
remote read,
sort
By default: 3-way replication;
Late/dead machines: ignored, transparently (!)
WIN - NYU 2009
C. Faloutsos
40
CMU SCS
HADI for diameter estimation
• Radius Plots for Mining Tera-byte Scale
Graphs U Kang, Charalampos Tsourakakis,
Ana Paula Appel, Christos Faloutsos, Jure
Leskovec, SDM’10
• Naively: diameter needs O(N**2) space and
up to O(N**3) time – prohibitive (N~1B)
IC '11
C. Faloutsos
41
CMU SCS
HADI for diameter estimation
• Radius Plots for Mining Tera-byte Scale
Graphs U Kang, Charalampos Tsourakakis,
Ana Paula Appel, Christos Faloutsos, Jure
Leskovec, SDM’10
• Naively: diameter needs O(N**2) space and
up to O(N**3) time – prohibitive (N~1B)
• Our HADI: linear on E (~10B)
– Near-linear scalability wrt # machines
– Several optimizations -> 5x faster
IC '11
C. Faloutsos
42
CMU SCS
Count
????
19+ [Barabasi+]
~1999, ~1M nodes
Radius
IC '11
C. Faloutsos
43
CMU SCS
??
Count
????
19+ [Barabasi+]
~1999, ~1M nodes
Radius
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
• Largest publicly available graph ever studied.
IC '11
C. Faloutsos
44
CMU SCS
Count
14 (dir.)
????
~7 (undir.)
19+? [Barabasi+]
Radius
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
• Largest publicly available graph ever studied.
IC '11
C. Faloutsos
45
CMU SCS
Count
14 (dir.)
????
~7 (undir.)
19+? [Barabasi+]
Radius
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
•7 degrees of separation (!)
•Diameter: shrunk
IC '11
C. Faloutsos
46
CMU SCS
Count
????
~7 (undir.)
Radius
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
Q: Shape?
IC '11
C. Faloutsos
47
CMU SCS
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
• effective diameter: surprisingly small.
• Multi-modality (?!)
IC '11
C. Faloutsos
48
CMU SCS
Radius Plot of GCC of YahooWeb.
IC '11
C. Faloutsos
49
CMU SCS
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
• effective diameter: surprisingly small.
• Multi-modality: probably mixture of cores .
IC '11
C. Faloutsos
50
CMU SCS
Conjecture:
EN
DE
BR
~7
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
• effective diameter: surprisingly small.
• Multi-modality: probably mixture of cores .
IC '11
C. Faloutsos
51
CMU SCS
Conjecture:
~7
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)
• effective diameter: surprisingly small.
• Multi-modality: probably mixture of cores .
IC '11
C. Faloutsos
52
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns and anomalies
– Scalability and ‘hadoop’
– Influence/ virus propagation
• Streams, environment, data center
monitoring
• Conclusions
IC '11
C. Faloutsos
53
CMU SCS
Immunization and epidemic
thresholds
• Q1: which nodes to immunize?
• Q2: will a virus vanish, or will it create an
epidemic?
IC '11
C. Faloutsos
54
CMU SCS
Q1: Immunization:
•Given a network,
•k vaccines, and
•the virus details
•Which nodes to immunize?
Aditya
Prakash
?
?
IC '11
C. Faloutsos
55
CMU SCS
Q1: Immunization:
•Given a network,
•k vaccines, and
•the virus details
•Which nodes to immunize?
?
?
IC '11
C. Faloutsos
56
CMU SCS
Q1: Immunization:
•Given a network,
•k vaccines, and
•the virus details
•Which nodes to immunize?
?
?
IC '11
C. Faloutsos
57
CMU SCS
Q1: Immunization:
•Given a network,
A: immunize the ones that
•k vaccines, and
maximally raise
•the virus details
the `epidemic threshold’
•Which nodes to immunize? [Tong+, ICDM’10]
~ l1
?
?
IC '11
C. Faloutsos
58
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns and anomalies
– Scalability and ‘hadoop’
– Influence/ virus propagation
• Streams, environment, data center
monitoring
• Conclusions
IC '11
C. Faloutsos
59
CMU SCS
Datacenter Monitoring &
Management
• Goal: save energy in data
centers
Lei Li
– US alone, $7.4B power
consumption (2011)
• Challenge:
– 1TB per day
– Complex cyber physical
systems
Temperature in datacenter
CMU SCS
OVERALL CONCLUSIONS –
high level
Graphs/
Social net
Databases,
Map/reduce
Data center
monitoring
Big data /
analytics
Environmental
data monitoring
IC '11
Cyber-security
Fraud detection
Health db
C. Faloutsos
61
CMU SCS
All these projects:
Require all three:
• Theory (e.g., eigenvalues, tensors, Kalman
filters, wavelets)
• Practice (e.g., PIG, hadoop 0.20, >120GB
of data, often TB)
• Domain knowledge (e.g., Navier Stokes,
Volterra-Lotka, etc)
IC '11
C. Faloutsos
62
CMU SCS
Project info
www.cs.cmu.edu/~pegasus
Koutra,
Danae
Chau,
Polo
Akoglu,
Leman
Kang, U
Prakash,
Aditya
(McGlohon,
Mary)
(Tong,
Hanghang)
Thanks to: NSF IIS-0705359, IIS-0534205,
CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT,
IC '11
C. Faloutsos
63
Google, INTEL, HP, iLab
CMU SCS
Contact info
• www.cs.cmu.edu/~christos
• GHC 8019
• Ph#: x8.1457
• Course: 15-826, Tu-Th 12-1:20
• and, again
IC '11
WELCOME!
C. Faloutsos
64
Related documents