Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University School of Computer Science Carnegie Mellon Thank you! • Prof. Hsing-Kuo Kenneth PAO • Prof. Yuh-Jye LEE • Hsin Yeh ICAST, June'09 C. Faloutsos 2 School of Computer Science Carnegie Mellon And also thanks to Ching-Hao (Eric) MAO Lei LI Ming-Kung (Morgan) SUN Leman AKOGLU Yi-Ren (Ian) YEH Ian ROLEWICZ ICAST, June'09 C. Faloutsos 3 School of Computer Science Carnegie Mellon Overview • Goals/ motivation: find patterns in large datasets: – (A) Sensor data – (B) network intrusion data • Solutions: self-similarity and power laws • Discussion ICAST, June'09 C. Faloutsos 4 School of Computer Science Carnegie Mellon Applications of sensors/streams • network monitoring # alerts time ICAST, June'09 C. Faloutsos 5 School of Computer Science Carnegie Mellon Applications of sensors/streams • Financial, sales, economic series • Medical: ECGs +; blood pressure etc monitoring ICAST, June'09 C. Faloutsos 6 School of Computer Science Carnegie Mellon Motivation - Applications • Scientific data: seismological; astronomical; environment / antipollution; meteorological Sunspot Data 300 250 200 150 100 50 0 ICAST, June'09 C. Faloutsos 7 School of Computer Science Carnegie Mellon Motivation - Applications (cont’d) • Computer systems – web servers (buffering, prefetching) – ... http://repository.cs.vt.edu/lbl-conn-7.tar.Z ICAST, June'09 C. Faloutsos 8 School of Computer Science Carnegie Mellon Web traffic • [Crovella Bestavros, SIGMETRICS’96] ICAST, June'09 C. Faloutsos 9 School of Computer Science Carnegie Mellon Self-* Storage (Ganger+) “self-*” = self-managing, self-tuning, self-healing, … survivable, self-managing storage infrastructure a storage brick (0.5–5 TB) ~1 PB ICAST, June'09 C. Faloutsos 10 School of Computer Science Carnegie Mellon Self-* Storage (Ganger+) “self-*” = self-managing, self-tuning, self-healing, … Goal: 1 petabyte (PB) www.pdl.cmu.edu/SelfStar survivable, self-managing storage infrastructure a storage brick (0.5–5 TB) ~1 PB ICAST, June'09 C. Faloutsos 11 School of Computer Science Carnegie Mellon Problem definition • Given: one or more sequences x1 , x2 , … , xt , …; (y1, y2, … , yt, …) • Find – patterns; clusters; outliers; forecasts; ICAST, June'09 C. Faloutsos 12 School of Computer Science Carnegie Mellon Problem # bytes • Find patterns, in large datasets time ICAST, June'09 C. Faloutsos 13 School of Computer Science Carnegie Mellon Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr ICAST, June'09 C. Faloutsos 14 School of Computer Science Carnegie Mellon Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr ICAST, June'09 C. Faloutsos 15 School of Computer Science Carnegie Mellon Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr ICAST, June'09 Q: Then, how to generate such bursty traffic? C. Faloutsos 16 School of Computer Science Carnegie Mellon Solutions • New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail • Let’s see the details: ICAST, June'09 C. Faloutsos 17 School of Computer Science Carnegie Mellon Overview • Goals/ motivation: find patterns in large datasets: – (A) Sensor data – (B) network data • Solutions: self-similarity and power laws • Discussion ICAST, June'09 C. Faloutsos 18 School of Computer Science Carnegie Mellon What is a fractal? = self-similar point set, e.g., Sierpinski triangle: ... zero area: (3/4)^inf infinite length! (4/3)^inf Q: What is its dimensionality?? ICAST, June'09 C. Faloutsos 19 School of Computer Science Carnegie Mellon What is a fractal? = self-similar point set, e.g., Sierpinski triangle: ... zero area: (3/4)^inf infinite length! (4/3)^inf Q: What is its dimensionality?? A: log3 / log2 = 1.58 (!?!) ICAST, June'09 C. Faloutsos 20 School of Computer Science Carnegie Mellon Intrinsic (‘fractal’) dimension • Q: fractal dimension of a line? ICAST, June'09 • Q: fd of a plane? C. Faloutsos 21 School of Computer Science Carnegie Mellon Intrinsic (‘fractal’) dimension • Q: fractal dimension of a line? • A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) ICAST, June'09 • Q: fd of a plane? • A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs.. log(r) ) C. Faloutsos 22 School of Computer Science Carnegie Mellon Sierpinsky triangle == ‘correlation integral’ log(#pairs within <=r ) = CDF of pairwise distances 1.58 log( r ) ICAST, June'09 C. Faloutsos 23 School of Computer Science Carnegie Mellon Observations: Fractals <-> power laws Closely related: • fractals <=> • self-similarity <=> • scale-free <=> • power laws ( y= xa ; F=K r-2) 1.58 • (vs y=e-ax or y=xa+b) ICAST, June'09 log(#pairs within <=r ) C. Faloutsos log( r ) 24 School of Computer Science Carnegie Mellon Outline • • • • Problems Self-similarity and power laws Solutions to posed problems Discussion ICAST, June'09 C. Faloutsos 25 School of Computer Science Carnegie Mellon Solution #1: traffic • disk traces: self-similar: (also: [Leland+94]) • How to generate such traffic? #bytes time ICAST, June'09 C. Faloutsos 26 School of Computer Science Carnegie Mellon Solution #1: traffic • disk traces (80-20 ‘law’) – ‘multifractals’ 20% 80% #bytes time ICAST, June'09 C. Faloutsos 27 School of Computer Science Carnegie Mellon 80-20 / multifractals 20 ICAST, June'09 80 C. Faloutsos 28 School of Computer Science Carnegie Mellon 80-20 / multifractals 20 80 • p ; (1-p) in general • yes, there are dependencies ICAST, June'09 C. Faloutsos 29 School of Computer Science Carnegie Mellon More on 80/20: PQRS • Part of ‘self-* storage’ project time ICAST, June'09 cylinder# C. Faloutsos 30 School of Computer Science Carnegie Mellon More on 80/20: PQRS • Part of ‘self-* storage’ project ICAST, June'09 p q r s C. Faloutsos q r s 31 School of Computer Science Carnegie Mellon Overview • Goals/ motivation: find patterns in large datasets: – (A) Sensor data – (B) network data • Solutions: self-similarity and power laws – sensor/traffic data – network data • Discussion ICAST, June'09 C. Faloutsos 32 School of Computer Science Carnegie Mellon Problem dfn <source-ip, target-ip, timestamp, alert-type> eg., <192.168.2.5; 128.2.220.159; 3am june 6; ICMP-redirect-host> goal: find patterns / anomalies ICAST, June'09 C. Faloutsos 33 School of Computer Science Carnegie Mellon Power laws in intrusion data count rank ICAST, June'09 C. Faloutsos 34 School of Computer Science Carnegie Mellon ICAST, June'09 C. Faloutsos 35 School of Computer Science Carnegie Mellon human-like robot-like Q: Can we visually summarize / classify our sequences? robot-like (bursty) ICAST, June'09 C. Faloutsos 36 School of Computer Science Carnegie Mellon Answer: yes! two features: • F1: how periodic (24h-cycle) is a sequence • F2: how bursty it is Q: how to measure burstiness? A: Fractal dimension! ICAST, June'09 C. Faloutsos 37 School of Computer Science Carnegie Mellon Burstiness & f.d. uniform: fd = 1 @same time-tick: fd = 0 ICAST, June'09 C. Faloutsos 38 School of Computer Science Carnegie Mellon Burstiness & f.d. uniform: fd = 1 bursts within bursts within bursts: 0<fd<1 @same time-tick: fd = 0 ICAST, June'09 C. Faloutsos 39 School of Computer Science Carnegie Mellon 6. Proposed Methods: The FDP Plot ICAST, June'09clustering wrt C. Faloutsos • Notice: alert types! 40 School of Computer Science Carnegie Mellon human-like robot-like can we visually summarize / classify our sequences? robot-like (bursty) ICAST, June'09 C. Faloutsos 41 School of Computer Science Carnegie Mellon Examples human-like behavior ICAST, June'09 C. Faloutsos 42 School of Computer Science Carnegie Mellon Examples human-like behavior ICAST, June'09 C. Faloutsos 43 School of Computer Science Carnegie Mellon Examples human-like behavior ICAST, June'09 C. Faloutsos 44 School of Computer Science Carnegie Mellon Examples human-like behavior ICAST, June'09 C. Faloutsos 45 School of Computer Science Carnegie Mellon Outline • • • • problems Fractals Solutions Discussion – what else can they solve? – how frequent are fractals? ICAST, June'09 C. Faloutsos 46 School of Computer Science Carnegie Mellon What else can they solve? • • • • • • separability [KDD’02] forecasting [CIKM’02] dimensionality reduction [SBBD’00] non-linear axis scaling [KDD’02] disk trace modeling [PEVA’02] selectivity of spatial/multimedia queries [PODS’94, VLDB’95, ICDE’00] • ... ICAST, June'09 C. Faloutsos 47 School of Computer Science Carnegie Mellon Problem #3 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. - ‘spiral’ and ‘elliptical’ Nichol) galaxies - patterns? (not Gaussian; not uniform) -attraction/repulsion? - separability?? ICAST, June'09 C. Faloutsos 48 School of Computer Science Carnegie Mellon Solution#3: spatial d.m. log(#pairs within <=r ) CORRELATION INTEGRAL! - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) ICAST, June'09 C. Faloutsos 49 School of Computer Science Carnegie Mellon Solution#3: spatial d.m. log(#pairs within <=r ) [w/ Seeger, Traina, Traina, SIGMOD00] - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) ICAST, June'09 C. Faloutsos 50 School of Computer Science Carnegie Mellon Solution#3: spatial d.m. r1 r2 Heuristic on choosing # of clusters r2 r1 ICAST, June'09 C. Faloutsos 51 School of Computer Science Carnegie Mellon Solution#3: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) ICAST, June'09 C. Faloutsos 52 School of Computer Science Carnegie Mellon Outline • • • • problems Fractals Solutions Discussion – what else can they solve? – how frequent are fractals? ICAST, June'09 C. Faloutsos 53 School of Computer Science Carnegie Mellon Fractals & power laws: appear in numerous settings: • medical • geographical / geological • social • computer-system related • <and many-many more! see [Mandelbrot]> ICAST, June'09 C. Faloutsos 54 School of Computer Science Carnegie Mellon Fractals: Brain scans • brain-scans Log(#octants) 2.63 = fd ICAST, June'09 C. Faloutsos octree levels 55 School of Computer Science Carnegie Mellon More fractals • periphery of malignant tumors: ~1.5 • benign: ~1.3 • [Burdet+] ICAST, June'09 C. Faloutsos 56 School of Computer Science Carnegie Mellon More fractals: • cardiovascular system: 3 (!) lungs: ~2.9 ICAST, June'09 C. Faloutsos 57 School of Computer Science Carnegie Mellon Fractals & power laws: appear in numerous settings: • medical • geographical / geological • social • computer-system related ICAST, June'09 C. Faloutsos 58 School of Computer Science Carnegie Mellon More fractals: • Coastlines: 1.2-1.58 1 1.1 1.3 ICAST, June'09 C. Faloutsos 59 School of Computer Science Carnegie Mellon ICAST, June'09 C. Faloutsos 60 School of Computer Science Carnegie Mellon More fractals: • the fractal dimension for the Amazon river is 1.85 (Nile: 1.4) [ems.gphys.unc.edu/nonlinear/fractals/examples.html] ICAST, June'09 C. Faloutsos 61 School of Computer Science Carnegie Mellon More fractals: • the fractal dimension for the Amazon river is 1.85 (Nile: 1.4) [ems.gphys.unc.edu/nonlinear/fractals/examples.html] ICAST, June'09 C. Faloutsos 62 School of Computer Science Carnegie Mellon GIS points Cross-roads of Montgomery county: •any rules? ICAST, June'09 C. Faloutsos 63 School of Computer Science Carnegie Mellon GIS log(#pairs(within <= r)) A: self-similarity: • intrinsic dim. = 1.51 1.51 log( r ) ICAST, June'09 C. Faloutsos 64 School of Computer Science Carnegie Mellon Examples:LB county • Long Beach county of CA (road end-points) log(#pairs) 1.7 log(r) ICAST, June'09 C. Faloutsos 65 School of Computer Science Carnegie Mellon More power laws: areas – Korcak’s law Scandinavian lakes Any pattern? ICAST, June'09 C. Faloutsos 66 School of Computer Science Carnegie Mellon More power laws: areas – Korcak’s law log(count( >= area)) Scandinavian lakes area vs complementary cumulative count (log-log axes) ICAST, June'09 log(area) C. Faloutsos 67 School of Computer Science Carnegie Mellon More power laws: Korcak log(count( >= area)) Japan islands; area vs cumulative count (log-log axes) ICAST, June'09 log(area) C. Faloutsos 68 School of Computer Science Carnegie Mellon More power laws • Energy of earthquakes (Gutenberg-Richter law) [simscience.org] Energy released log(count) day ICAST, June'09 Magnitude = log(energy) C. Faloutsos 69 School of Computer Science Carnegie Mellon Fractals & power laws: appear in numerous settings: • medical • geographical / geological • social • computer-system related ICAST, June'09 C. Faloutsos 70 School of Computer Science Carnegie Mellon A famous power law: Zipf’s law log(freq) “a” • Bible - rank vs. frequency (log-log) “the” “Rank/frequency plot” log(rank) ICAST, June'09 C. Faloutsos 71 School of Computer Science Carnegie Mellon TELCO data count of customers ‘best customer’ # of service units ICAST, June'09 C. Faloutsos 72 School of Computer Science Carnegie Mellon SALES data – store#96 count of products “aspirin” # units sold ICAST, June'09 C. Faloutsos 73 School of Computer Science Carnegie Mellon Olympic medals (Sidney’00, Athens’04): log(#medals) 2.5 2 1.5 athens 1 sidney 0.5 0 0 0.5 1 1.5 2 log( rank) ICAST, June'09 C. Faloutsos 74 School of Computer Science Carnegie Mellon Olympic medals (Sidney’00, Athens’04): log(#medals) 2.5 2 1.5 athens 1 sidney 0.5 0 0 0.5 1 1.5 2 log( rank) ICAST, June'09 C. Faloutsos 75 School of Computer Science Carnegie Mellon Even more power laws: • Income distribution (Pareto’s law) • size of firms • publication counts (Lotka’s law) ICAST, June'09 C. Faloutsos 76 School of Computer Science Carnegie Mellon Even more power laws: library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(count) Ullman log(#citations) ICAST, June'09 C. Faloutsos 77 School of Computer Science Carnegie Mellon Even more power laws: • web hit counts [w/ A. Montgomery] Web Site Traffic log(count) Zipf “yahoo.com” log(freq) ICAST, June'09 C. Faloutsos 78 School of Computer Science Carnegie Mellon Autonomous systems • Power law in the degree distribution [SIGCOMM99] internet domains att.com log(degree) ibm.com -0.82 log(rank) ICAST, June'09 C. Faloutsos 79 School of Computer Science Carnegie Mellon The Peer-to-Peer Topology count [Jovanovic+] degree • Number of immediate peers (= degree), follows a power-law ICAST, June'09 C. Faloutsos 80 School of Computer Science Carnegie Mellon epinions.com • who-trusts-whom [Richardson + Domingos, KDD 2001] count (out) degree ICAST, June'09 C. Faloutsos 81 School of Computer Science Carnegie Mellon Fractals & power laws: appear in numerous settings: • medical • geographical / geological • social • computer-system related ICAST, June'09 C. Faloutsos 82 School of Computer Science Carnegie Mellon Power laws, cont’d • In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] log indegree from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ] ICAST, June'09 - log(freq) C. Faloutsos 83 School of Computer Science Carnegie Mellon Power laws, cont’d • In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] log(freq) from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ] ICAST, June'09 log indegree C. Faloutsos 84 School of Computer Science Carnegie Mellon Power laws, cont’d • In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] log(freq) Q: ‘how can we use these power laws?’ log indegree ICAST, June'09 C. Faloutsos 85 School of Computer Science Carnegie Mellon “Foiled by power law” • [Broder+, WWW’00] (log) count (log) in-degree ICAST, June'09 C. Faloutsos 86 School of Computer Science Carnegie Mellon “Foiled by power law” • [Broder+, WWW’00] (log) count “The anomalous bump at 120 on the x-axis is due a large clique formed by a single spammer” (log) in-degree ICAST, June'09 C. Faloutsos 87 School of Computer Science Carnegie Mellon Power laws, cont’d • In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] • length of file transfers [Crovella+Bestavros ‘96] • duration of UNIX jobs ICAST, June'09 C. Faloutsos 88 School of Computer Science Carnegie Mellon Conclusions • Fascinating problems in Data Mining: find patterns in streams & network data ICAST, June'09 C. Faloutsos 89 School of Computer Science Carnegie Mellon Conclusions - cont’d New tools for Data Mining: self-similarity & power laws: appear in many cases Bad news: lead to skewed distributions (no Gaussian, Poisson, uniformity, independence, mean, variance) ICAST, June'09 C. Faloutsos Good news: • ‘correlation integral’ for separability • rank/frequency plots • 80-20 (multifractals) • • • • (Hurst exponent, strange attractors, renormalization theory, ++) 90 School of Computer Science Carnegie Mellon Resources • Manfred Schroeder “Chaos, Fractals and Power Laws”, 1991 ICAST, June'09 C. Faloutsos 91 School of Computer Science Carnegie Mellon References • [vldb95] Alberto Belussi and Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension Proc. of VLDB, p. 299310, 1995 • [Broder+’00] Andrei Broder, Ravi Kumar , Farzin Maghoul1, Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata, Andrew Tomkins , Janet Wiener, Graph structure in the web , WWW’00 • M. Crovella and A. Bestavros, Self similarity in World wide web traffic: Evidence and possible causes , SIGMETRICS ’96. ICAST, June'09 C. Faloutsos 92 School of Computer Science Carnegie Mellon References • J. Considine, F. Li, G. Kollios and J. Byers, Approximate Aggregation Techniques for Sensor Databases (ICDE’04, best paper award). • [pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4-13 ICAST, June'09 C. Faloutsos 93 School of Computer Science Carnegie Mellon References • [vldb96] Christos Faloutsos, Yossi Matias and Avi Silberschatz, Modeling Skewed Distributions Using Multifractals and the `80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996. • [sigmod2000] Christos Faloutsos, Bernhard Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000 ICAST, June'09 C. Faloutsos 94 School of Computer Science Carnegie Mellon References • [vldb96] Christos Faloutsos and Volker Gaede Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India, Sept. 1996 • [sigcomm99] Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999 ICAST, June'09 C. Faloutsos 95 School of Computer Science Carnegie Mellon References • [Leskovec 05] Jure Leskovec, Jon M. Kleinberg, Christos Faloutsos: Graphs over time: densification laws, shrinking diameters and possible explanations. KDD 2005: 177-187 ICAST, June'09 C. Faloutsos 96 School of Computer Science Carnegie Mellon References • [ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the Self-Similar Nature of Ethernet Traffic, IEEE Transactions on Networking, 2, 1, pp 1-15, Feb. 1994. • [brite] Alberto Medina, Anukool Lakhina, Ibrahim Matta, and John Byers. BRITE: An Approach to Universal Topology Generation. MASCOTS '01 ICAST, June'09 C. Faloutsos 97 School of Computer Science Carnegie Mellon References • [icde99] Guido Proietti and Christos Faloutsos, I/O complexity for range queries on region data stored using an R-tree (ICDE’99) • Stan Sclaroff, Leonid Taycher and Marco La Cascia , "ImageRover: A content-based image browser for the world wide web" Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, pp 2-9, 1997. ICAST, June'09 C. Faloutsos 98 School of Computer Science Carnegie Mellon References • [kdd2001] Agma J. M. Traina, Caetano Traina Jr., Spiros Papadimitriou and Christos Faloutsos: Triplots: Scalable Tools for Multidimensional Data Mining, KDD 2001, San Francisco, CA. ICAST, June'09 C. Faloutsos 99 School of Computer Science Carnegie Mellon Thank you! Contact info: christos <at> cs.cmu.edu www. cs.cmu.edu /~christos (w/ papers, datasets, code for fractal dimension estimation, etc) ICAST, June'09 C. Faloutsos 100