Download here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CMU SCS
Big (graph) data analytics
Christos Faloutsos
CMU
CMU SCS
CONGRATULATIONS!
IC '13
C. Faloutsos
2
CMU SCS
Outline
•
•
•
•
•
Q+A
Problem definition / Motivation
Graphs and power laws
Anomaly/fraud detection
Conclusions
IC '13
C. Faloutsos
3
CMU SCS
Q+A
•
•
•
•
•
Are you recruiting? How many?
How many do you have?
How frequently you meet them?
What is your advising style?
How do you feel about summer
internships?
IC '13
C. Faloutsos
4
CMU SCS
Q+A
•
•
•
•
•
Are you recruiting? How many? •
•
How many do you have?
How frequently you meet them? •
•
What is your advising style?
How do you feel about summer •
internships?
IC '13
C. Faloutsos
Maybe, ~1
4 (+5pdocs)
1/week
results
Yes/Maybe
(FB, MSR,
IBM, ++)
5
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns
– Scalability and ‘hadoop’
• Anomaly detection
• Conclusions
IC '13
C. Faloutsos
6
CMU SCS
Motivation
• Data mining: ~ find patterns (rules, outliers)
• How do real graphs look like? Anomalies?
– Virus/influence propagation
• Time series / env. Monitoring
Temperature in datacenter
IC '13
C. Faloutsos
7
CMU SCS
Graphs - why should we care?
IC '13
C. Faloutsos
8
CMU SCS
Graphs - why should we care?
Food Web
[Martinez ’91]
~1B users
$10-$100B revenue
Internet Map
[lumeta.com]
IC '13
C. Faloutsos
9
CMU SCS
Tensors: Graphs on steroids
• Tensors (=multi-dimensional arrays)
– Predicates (subject, verb, object) in knowledge base
Vagelis Papalexakis
CMU-CS
“Eric Clapton plays
guitar”
“Barack Obama is
the president of
U.S.”
Open House, 2013
Tom Mitchell
CMU/CS-MLD
(48M)
(26M)
NELL (Never Ending
Language Learner) data
Nonzeros =144M
(26M)
C. Faloutsos (CMU)
10
CMU SCS
Concept Discovery
• Concept Discovery in Knowledge Base
Open House, 2013
C. Faloutsos (CMU)
11
CMU SCS
Concept Discovery
• Concept Discovery in Knowledge Base
NP1: Internet, file, data
NP2: Protocol, software, suite
Open House, 2013
C. Faloutsos (CMU)
12
CMU SCS
‘NeuroSemantics’
>200GB total
Open House, 2013
C. Faloutsos (CMU)
13
CMU SCS
Experiments
• GigaTensor solves 100x larger problem
(K)
(J)
GigaTensor
100x
Out of
Memory
Open House, 2013
C. Faloutsos (CMU)
(I)
Number of
nonzero
= I / 50
14
CMU SCS
Problem #1 - network and
graph mining
• What does the Internet look like?
• What does FaceBook look like?
• What is ‘normal’/‘abnormal’?
• which patterns/laws hold?
– To spot anomalies (rarities), we have to
discover patterns
– Large datasets reveal patterns/anomalies
that may be invisible otherwise…
IC '13
C. Faloutsos
15
CMU SCS
Graph mining
• Are real graphs random?
IC '13
C. Faloutsos
16
CMU SCS
Laws and patterns
NO!!
• Diameter
• in- and out- degree distributions
• other (surprising) patterns
IC '13
C. Faloutsos
17
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns
– Scalability and ‘hadoop’
• Anomaly/Fraud detection
• Conclusions
IC '13
C. Faloutsos
18
CMU SCS
S1 – degree distributions
• Q: avg degree is ~3 - what is the most
probable degree?
count
??
3
IC '13
degree
C. Faloutsos
19
CMU SCS
S1– degree distributions
• Q: avg degree is ~3 - what is the most
probable degree?
count
??
3
IC '13
count
degree
C. Faloutsos
3
degree
20
CMU SCS
Solution:
Frequency
Exponent = slope
O = -2.15
-2.15
Nov’97
Outdegree
The plot is linear in log-log scale [FFF’99]
freq = degree (-2.15)
IC '13
C. Faloutsos
21
CMU SCS
Solution# S.2: Triangle ‘Laws’
• Real social networks have a lot of triangles
IC '13
C. Faloutsos
22
CMU SCS
Solution# S.2: Triangle ‘Laws’
• Real social networks have a lot of triangles
– Friends of friends are friends
• Any patterns?
IC '13
C. Faloutsos
23
CMU SCS
Triangle Law: #S.2
[Tsourakakis ICDM 2008]
Reuters
X-axis: degree
Y-axis: mean # triangles
n friends -> ???? triangles
IC '13
C. Faloutsos
24
CMU SCS
Triangle Law: #S.2
[Tsourakakis ICDM 2008]
Reuters
SN
X-axis: degree
Y-axis: mean # triangles
n friends -> ~n1.6 triangles
Epinions
IC '13
C. Faloutsos
25
CMU SCS
Triangle counting for large
graphs?
Anomalous nodes in Twitter(~ 3 billion edges)
[U Kang, Brendan Meeder, +, PAKDD’11]
IC '13
C. Faloutsos
26
26
CMU SCS
Triangle counting for large
graphs?
Anomalous nodes in Twitter(~ 3 billion edges)
[U Kang, Brendan Meeder, +, PAKDD’11]
IC '13
C. Faloutsos
27
27
CMU SCS
Triangle counting for large
graphs?
Anomalous nodes in Twitter(~ 3 billion edges)
[U Kang, Brendan Meeder, +, PAKDD’11]
IC '13
C. Faloutsos
28
28
CMU SCS
And many more patterns…
•
•
•
•
•
Diameter: SHRINKS with size!
#nodes vs #edges (power law(!))
# conn. Components (power law, too)
Contact/phone-call duration (log-logistic)
Total node weight vs # edges (superlinear/power law)
• ….
IC '13
C. Faloutsos
29
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns and anomalies
– Scalability and ‘hadoop’
• Anomaly/fraud detection
• Conclusions
IC '13
C. Faloutsos
30
CMU SCS
Scalability
• Google: > 450,000 processors in clusters of ~2000
processors each [Barroso, Dean, Hölzle, “Web Search for
a Planet: The Google Cluster Architecture” IEEE Micro
2003]
•
•
•
•
Yahoo: 5Pb of data [Fayyad, KDD’07]
Problem: machine failures, on a daily basis
How to parallelize data mining tasks, then?
A: map/reduce – hadoop (open-source clone)
http://hadoop.apache.org/
IC '13
C. Faloutsos
31
CMU SCS
details
User
Program
Input Data
(on HDFS)
Split 0 read
Split 1
Split 2
fork
fork
fork
assign
map
Master
assign
reduce
Mapper
Mapper
Reducer
local
write
Output
File 0
Output
Reducer
Mapper
write
File 1
remote read,
sort
By default: 3-way replication;
Late/dead machines: ignored, transparently (!)
IC '13
C. Faloutsos
32
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns
– Scalability and ‘hadoop’
• Anomaly/Fraud detection
• Conclusions
IC '13
C. Faloutsos
33
CMU SCS
E-bay Fraud detection
w/ Polo Chau &
Shashank Pandit, CMU
[www’07]
IC '13
C. Faloutsos
34
CMU SCS
E-bay Fraud detection
IC '13
C. Faloutsos
35
CMU SCS
E-bay Fraud detection
IC '13
C. Faloutsos
36
CMU SCS
E-bay Fraud detection - NetProbe
IC '13
C. Faloutsos
37
CMU SCS
App-store fraud
Opinion Fraud Detection in Online Reviews
using Network Effects
Leman Akoglu, Rishi Chandy, CF
ICWSM’13
IC '13
C. Faloutsos
38
CMU SCS
• Given
Problem
– user-product review network
– review sign (+/-)
• Classify
– objects into type-specific classes:
users: `honest’ / `fraudster’
products: `good’ / `bad’
reviews: `genuine’ / `fake’
No side data!
(e.g., timestamp, review text)
IC '13
C. Faloutsos
39
CMU SCS
Formulation: BP
User
honest
honest
Product
–
+
bad
bad
Before
After
IC '13
C. Faloutsos
40
CMU SCS
Users
Top scorers
Products
+ positive (4-5) rating
o negative (1-2) rating
IC '13
C. Faloutsos
41
CMU SCS
Users
Top scorers
Products
+ positive (4-5) rating
o negative (1-2) rating
IC '13
C. Faloutsos
42
CMU SCS
‘Fraud-bot’ member reviews
Same developer!
IC '13
Duplicated text!
C. Faloutsos
Same day activity!
43
CMU SCS
Outline
• Problem definition / Motivation
• Graphs and power laws
– Patterns and anomalies
– Scalability and ‘hadoop’
• Anomaly/fraud detection
• Streams, spikes, environment, data
center monitoring
•IC '13Conclusions
C. Faloutsos
44
CMU SCS
Datacenter Monitoring &
Management
• Goal: save energy in data
centers
Lei Li
– US alone, $7.4B power
consumption (2011)
• Challenge:
– 1TB per day
– Complex cyber physical
systems
IC '13
C. Faloutsos
Temperature in datacenter
45
CMU SCS
Spike forecasting
Yasuko Matsubara
–Forecast not only tail-part, but also risepart!
?
(1) First spike
IC '13
(2) Release date
C. Faloutsos
?
(3) Two weeks before
release
46
CMU SCS
Spike forecasting
Yasuko Matsubara
–Forecast not only tail-part, but also risepart!
(1) First spike
IC '13
(2) Release date
C. Faloutsos
(3) Two weeks before
release
47
CMU SCS
Environmental data
Temp. and pressure
over time
Temperatures, April
Sao Paulo, Brazil
IC '13
C. Faloutsos
48
CMU SCS
Open research questions
• Patterns/anomalies for time-evolving
graphs (Call graph, 3M people x 6mo)
• Patterns/anomalies given node
attributes
• Graph understanding / attribution
…..
• How is the human brain wired
IC '13
C. Faloutsos
50
CMU SCS
Contact info
• www.cs.cmu.edu/~christos
• GHC 8019
• Ph#: x8.1457
• www.cs.cmu.edu/~christos/TALKS/13ic/
• FYI: Course: 15-826, Tu-Th
1:30-3:00
IC '13
C. Faloutsos
52
Related documents