Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU CMU SCS CONGRATULATIONS! IC '11 C. Faloutsos 2 CMU SCS Outline • • • • Q+A Problem definition / Motivation Graphs and power laws Streams, environment, data center monitoring • Conclusions IC '11 C. Faloutsos 3 CMU SCS Q+A • • • • • Are you recruiting? How many? How many do you have? How frequently you meet them? What is your advising style? How do you feel about summer internships? IC '11 C. Faloutsos 4 CMU SCS Q+A • • • • • Are you recruiting? How many? • • How many do you have? How frequently you meet them? • • What is your advising style? How do you feel about summer • internships? IC '11 C. Faloutsos Yes, 1-2 5+2 1/week Yes/Maybe (Y!,G, MSR, IBM, ++) 5 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ – Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions IC '11 C. Faloutsos 6 CMU SCS Motivation • Data mining: ~ find patterns (rules, outliers) • How do real graphs look like? Anomalies? – Virus/influence propagation • Time series / env. Monitoring Temperature in datacenter IC '11 C. Faloutsos 7 CMU SCS Graphs - why should we care? IC '11 C. Faloutsos 8 CMU SCS Graphs - why should we care? Friendship Network [Moody ’01] IC '11 C. Faloutsos 9 CMU SCS Graphs - why should we care? Food Web [Martinez ’91] Friendship Network [Moody ’01] IC '11 Internet Map [lumeta.com] C. Faloutsos 10 CMU SCS Problem #1 - network and graph mining • What does the Internet look like? • What does FaceBook look like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? – To spot anomalies (rarities), we have to discover patterns – Large datasets reveal patterns/anomalies that may be invisible otherwise… IC '11 C. Faloutsos 11 CMU SCS Graph mining • Are real graphs random? IC '11 C. Faloutsos 12 CMU SCS Laws and patterns NO!! • Diameter • in- and out- degree distributions • other (surprising) patterns IC '11 C. Faloutsos 13 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ – Influence/ virus propagation • Streams, environment, data center monitoring •IC '11Conclusions C. Faloutsos 14 CMU SCS S1 – degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ?? 3 IC '11 degree C. Faloutsos 15 CMU SCS S1– degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ?? 3 IC '11 count degree C. Faloutsos 3 degree 16 CMU SCS Solution: Frequency Exponent = slope O = -2.15 -2.15 Nov’97 Outdegree The plot is linear in log-log scale [FFF’99] freq = degree (-2.15) IC '11 C. Faloutsos 17 CMU SCS Solution# S.2: Triangle ‘Laws’ • Real social networks have a lot of triangles IC '11 C. Faloutsos 18 CMU SCS Solution# S.2: Triangle ‘Laws’ • Real social networks have a lot of triangles – Friends of friends are friends • Any patterns? IC '11 C. Faloutsos 19 CMU SCS Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters X-axis: degree Y-axis: mean # triangles n friends -> ???? triangles IC '11 C. Faloutsos 20 CMU SCS Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles Epinions IC '11 C. Faloutsos 21 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] IC '11 C. Faloutsos 22 22 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] IC '11 C. Faloutsos 23 23 CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11] IC '11 C. Faloutsos 24 24 CMU SCS But: • Q1: How about graphs from other domains? • Q2: How about temporal evolution? IC '11 C. Faloutsos 25 CMU SCS Time evolution • with Jure Leskovec (CMU -> Stanford) • and Jon Kleinberg (Cornell) (‘best paper’ KDD05) IC '11 C. Faloutsos 26 CMU SCS T1 - Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: – diameter ~ O(log N) – diameter ~ O(log log N) • What is happening in real data? IC '11 C. Faloutsos 27 CMU SCS T1 - Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: – diameter ~ O(log N) – diameter ~ O(log log N) • What is happening in real data? • Diameter shrinks over time – As the network grows the distances between nodes slowly decrease IC '11 C. Faloutsos 28 CMU SCS Diameter – ArXiv citation graph • Citations among physics papers • 1992 –2003 • One graph per year diameter time [years] IC '11 C. Faloutsos 29 CMU SCS Diameter – “Patents” • Patent citation network • 25 years of data diameter time [years] IC '11 C. Faloutsos 30 CMU SCS And many more patterns… • • • • #nodes vs #edges (power law(!)) # conn. Components (power law, too) Contact/phone-call duration (log-logistic) Total node weight vs # edges (superlinear/power law) • …. IC '11 C. Faloutsos 31 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ – Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions IC '11 C. Faloutsos 32 CMU SCS E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’07] IC '11 C. Faloutsos 33 CMU SCS E-bay Fraud detection IC '11 C. Faloutsos 34 CMU SCS E-bay Fraud detection IC '11 C. Faloutsos 35 CMU SCS E-bay Fraud detection - NetProbe IC '11 C. Faloutsos 36 CMU SCS Popular press And less desirable attention: • E-mail from ‘Belgium police’ (‘copy of your code?’) IC '11 C. Faloutsos 37 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ – Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions IC '11 C. Faloutsos 38 CMU SCS Scalability • Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] • • • • Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ WIN - NYU 2009 C. Faloutsos 39 CMU SCS details User Program Input Data (on HDFS) Split 0 read Split 1 Split 2 fork fork fork assign map Master assign reduce Mapper Mapper Reducer local write Output File 0 Output Reducer Mapper write File 1 remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) WIN - NYU 2009 C. Faloutsos 40 CMU SCS HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 • Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) IC '11 C. Faloutsos 41 CMU SCS HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10 • Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B) • Our HADI: linear on E (~10B) – Near-linear scalability wrt # machines – Several optimizations -> 5x faster IC '11 C. Faloutsos 42 CMU SCS Count ???? 19+ [Barabasi+] ~1999, ~1M nodes Radius IC '11 C. Faloutsos 43 CMU SCS ?? Count ???? 19+ [Barabasi+] ~1999, ~1M nodes Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • Largest publicly available graph ever studied. IC '11 C. Faloutsos 44 CMU SCS Count 14 (dir.) ???? ~7 (undir.) 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • Largest publicly available graph ever studied. IC '11 C. Faloutsos 45 CMU SCS Count 14 (dir.) ???? ~7 (undir.) 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) •7 degrees of separation (!) •Diameter: shrunk IC '11 C. Faloutsos 46 CMU SCS Count ???? ~7 (undir.) Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape? IC '11 C. Faloutsos 47 CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality (?!) IC '11 C. Faloutsos 48 CMU SCS Radius Plot of GCC of YahooWeb. IC '11 C. Faloutsos 49 CMU SCS YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores . IC '11 C. Faloutsos 50 CMU SCS Conjecture: EN DE BR ~7 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores . IC '11 C. Faloutsos 51 CMU SCS Conjecture: ~7 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores . IC '11 C. Faloutsos 52 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ – Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions IC '11 C. Faloutsos 53 CMU SCS Immunization and epidemic thresholds • Q1: which nodes to immunize? • Q2: will a virus vanish, or will it create an epidemic? IC '11 C. Faloutsos 54 CMU SCS Q1: Immunization: •Given a network, •k vaccines, and •the virus details •Which nodes to immunize? Aditya Prakash ? ? IC '11 C. Faloutsos 55 CMU SCS Q1: Immunization: •Given a network, •k vaccines, and •the virus details •Which nodes to immunize? ? ? IC '11 C. Faloutsos 56 CMU SCS Q1: Immunization: •Given a network, •k vaccines, and •the virus details •Which nodes to immunize? ? ? IC '11 C. Faloutsos 57 CMU SCS Q1: Immunization: •Given a network, A: immunize the ones that •k vaccines, and maximally raise •the virus details the `epidemic threshold’ •Which nodes to immunize? [Tong+, ICDM’10] ~ l1 ? ? IC '11 C. Faloutsos 58 CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ – Influence/ virus propagation • Streams, environment, data center monitoring • Conclusions IC '11 C. Faloutsos 59 CMU SCS Datacenter Monitoring & Management • Goal: save energy in data centers Lei Li – US alone, $7.4B power consumption (2011) • Challenge: – 1TB per day – Complex cyber physical systems Temperature in datacenter CMU SCS OVERALL CONCLUSIONS – high level Graphs/ Social net Databases, Map/reduce Data center monitoring Big data / analytics Environmental data monitoring IC '11 Cyber-security Fraud detection Health db C. Faloutsos 61 CMU SCS All these projects: Require all three: • Theory (e.g., eigenvalues, tensors, Kalman filters, wavelets) • Practice (e.g., PIG, hadoop 0.20, >120GB of data, often TB) • Domain knowledge (e.g., Navier Stokes, Volterra-Lotka, etc) IC '11 C. Faloutsos 62 CMU SCS Project info www.cs.cmu.edu/~pegasus Koutra, Danae Chau, Polo Akoglu, Leman Kang, U Prakash, Aditya (McGlohon, Mary) (Tong, Hanghang) Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, IC '11 C. Faloutsos 63 Google, INTEL, HP, iLab CMU SCS Contact info • www.cs.cmu.edu/~christos • GHC 8019 • Ph#: x8.1457 • Course: 15-826, Tu-Th 12-1:20 • and, again IC '11 WELCOME! C. Faloutsos 64