Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CMU SCS Big (graph) data analytics Christos Faloutsos CMU CMU SCS CONGRATULATIONS! IC '13 C. Faloutsos 2 CMU SCS Outline • • • • • Q+A Problem definition / Motivation Graphs and power laws Anomaly/fraud detection Conclusions IC '13 C. Faloutsos 3 CMU SCS Q+A • • • • • Are you recruiting? How many? How many do you have? How frequently you meet them? What is your advising style? How do you feel about summer internships? IC '13 C. Faloutsos 4 CMU SCS Q+A • • • • • Are you recruiting? How many? • • How many do you have? How frequently you meet them? • • What is your advising style? How do you feel about summer • internships? IC '13 C. Faloutsos Maybe, ~1 4 (+5pdocs) 1/week results Yes/Maybe (FB, MSR, IBM, ++) 5 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns – Scalability and ‘hadoop’ • Anomaly detection • Conclusions IC '13 C. Faloutsos 6 CMU SCS Motivation • Data mining: ~ find patterns (rules, outliers) • How do real graphs look like? Anomalies? – Virus/influence propagation • Time series / env. Monitoring Temperature in datacenter IC '13 C. Faloutsos 7 CMU SCS Graphs - why should we care? IC '13 C. Faloutsos 8 CMU SCS Graphs - why should we care? Food Web [Martinez ’91] ~1B users $10-$100B revenue Internet Map [lumeta.com] IC '13 C. Faloutsos 9 CMU SCS Tensors: Graphs on steroids • Tensors (=multi-dimensional arrays) – Predicates (subject, verb, object) in knowledge base Vagelis Papalexakis CMU-CS “Eric Clapton plays guitar” “Barack Obama is the president of U.S.” Open House, 2013 Tom Mitchell CMU/CS-MLD (48M) (26M) NELL (Never Ending Language Learner) data Nonzeros =144M (26M) C. Faloutsos (CMU) 10 CMU SCS Concept Discovery • Concept Discovery in Knowledge Base Open House, 2013 C. Faloutsos (CMU) 11 CMU SCS Concept Discovery • Concept Discovery in Knowledge Base NP1: Internet, file, data NP2: Protocol, software, suite Open House, 2013 C. Faloutsos (CMU) 12 CMU SCS ‘NeuroSemantics’ >200GB total Open House, 2013 C. Faloutsos (CMU) 13 CMU SCS Experiments • GigaTensor solves 100x larger problem (K) (J) GigaTensor 100x Out of Memory Open House, 2013 C. Faloutsos (CMU) (I) Number of nonzero = I / 50 14 CMU SCS Problem #1 - network and graph mining • What does the Internet look like? • What does FaceBook look like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? – To spot anomalies (rarities), we have to discover patterns – Large datasets reveal patterns/anomalies that may be invisible otherwise… IC '13 C. Faloutsos 15 CMU SCS Graph mining • Are real graphs random? IC '13 C. Faloutsos 16 CMU SCS Laws and patterns NO!! • Diameter • in- and out- degree distributions • other (surprising) patterns IC '13 C. Faloutsos 17 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns – Scalability and ‘hadoop’ • Anomaly/Fraud detection • Conclusions IC '13 C. Faloutsos 18 CMU SCS S1 – degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ?? 3 IC '13 degree C. Faloutsos 19 CMU SCS S1– degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ?? 3 IC '13 count degree C. Faloutsos 3 degree 20 CMU SCS Solution: Frequency Exponent = slope O = -2.15 -2.15 Nov’97 Outdegree The plot is linear in log-log scale [FFF’99] freq = degree (-2.15) IC '13 C. Faloutsos 21 CMU SCS Solution# S.2: Triangle ‘Laws’ • Real social networks have a lot of triangles IC '13 C. Faloutsos 22 CMU SCS Solution# S.2: Triangle ‘Laws’ • Real social networks have a lot of triangles – Friends of friends are friends • Any patterns? IC '13 C. Faloutsos 23 CMU SCS Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters X-axis: degree Y-axis: mean # triangles n friends -> ???? triangles IC '13 C. Faloutsos 24 CMU SCS Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles Epinions IC '13 C. Faloutsos 25 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] IC '13 C. Faloutsos 26 26 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] IC '13 C. Faloutsos 27 27 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] IC '13 C. Faloutsos 28 28 CMU SCS And many more patterns… • • • • • Diameter: SHRINKS with size! #nodes vs #edges (power law(!)) # conn. Components (power law, too) Contact/phone-call duration (log-logistic) Total node weight vs # edges (superlinear/power law) • …. IC '13 C. Faloutsos 29 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ • Anomaly/fraud detection • Conclusions IC '13 C. Faloutsos 30 CMU SCS Scalability • Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] • • • • Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ IC '13 C. Faloutsos 31 CMU SCS details User Program Input Data (on HDFS) Split 0 read Split 1 Split 2 fork fork fork assign map Master assign reduce Mapper Mapper Reducer local write Output File 0 Output Reducer Mapper write File 1 remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) IC '13 C. Faloutsos 32 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns – Scalability and ‘hadoop’ • Anomaly/Fraud detection • Conclusions IC '13 C. Faloutsos 33 CMU SCS E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’07] IC '13 C. Faloutsos 34 CMU SCS E-bay Fraud detection IC '13 C. Faloutsos 35 CMU SCS E-bay Fraud detection IC '13 C. Faloutsos 36 CMU SCS E-bay Fraud detection - NetProbe IC '13 C. Faloutsos 37 CMU SCS App-store fraud Opinion Fraud Detection in Online Reviews using Network Effects Leman Akoglu, Rishi Chandy, CF ICWSM’13 IC '13 C. Faloutsos 38 CMU SCS • Given Problem – user-product review network – review sign (+/-) • Classify – objects into type-specific classes: users: `honest’ / `fraudster’ products: `good’ / `bad’ reviews: `genuine’ / `fake’ No side data! (e.g., timestamp, review text) IC '13 C. Faloutsos 39 CMU SCS Formulation: BP User honest honest Product – + bad bad Before After IC '13 C. Faloutsos 40 CMU SCS Users Top scorers Products + positive (4-5) rating o negative (1-2) rating IC '13 C. Faloutsos 41 CMU SCS Users Top scorers Products + positive (4-5) rating o negative (1-2) rating IC '13 C. Faloutsos 42 CMU SCS ‘Fraud-bot’ member reviews Same developer! IC '13 Duplicated text! C. Faloutsos Same day activity! 43 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ • Anomaly/fraud detection • Streams, spikes, environment, data center monitoring •IC '13Conclusions C. Faloutsos 44 CMU SCS Datacenter Monitoring & Management • Goal: save energy in data centers Lei Li – US alone, $7.4B power consumption (2011) • Challenge: – 1TB per day – Complex cyber physical systems IC '13 C. Faloutsos Temperature in datacenter 45 CMU SCS Spike forecasting Yasuko Matsubara –Forecast not only tail-part, but also risepart! ? (1) First spike IC '13 (2) Release date C. Faloutsos ? (3) Two weeks before release 46 CMU SCS Spike forecasting Yasuko Matsubara –Forecast not only tail-part, but also risepart! (1) First spike IC '13 (2) Release date C. Faloutsos (3) Two weeks before release 47 CMU SCS Environmental data Temp. and pressure over time Temperatures, April Sao Paulo, Brazil IC '13 C. Faloutsos 48 CMU SCS Open research questions • Patterns/anomalies for time-evolving graphs (Call graph, 3M people x 6mo) • Patterns/anomalies given node attributes • Graph understanding / attribution ….. • How is the human brain wired IC '13 C. Faloutsos 50 CMU SCS Contact info • www.cs.cmu.edu/~christos • GHC 8019 • Ph#: x8.1457 • www.cs.cmu.edu/~christos/TALKS/13ic/ • FYI: Course: 15-826, Tu-Th 1:30-3:00 IC '13 C. Faloutsos 52