Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resources USC 2001 C. Faloutsos 2 Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: • Allow fast, approximate queries, and • Find rules/patterns USC 2001 C. Faloutsos 3 Sample queries • Similarity search – Find pairs of branches with similar sales patterns – find medical cases similar to Smith's – Find pairs of sensor series that move in sync – Find shapes like a spark-plug USC 2001 C. Faloutsos 4 Sample queries –cont’d • Rule discovery – Clusters (of branches; of sensor data; ...) – Forecasting (total sales for next year?) – Outliers (eg., unexpected part failures; fraud detection) USC 2001 C. Faloutsos 5 Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • related projects @ CMU and resourses USC 2001 C. Faloutsos 6 Indexing - Multimedia Problem: • given a set of (multimedia) objects, • find the ones similar to a desirable query object USC 2001 C. Faloutsos 7 $price $price 1 365 day $price 1 365 day distance function: by expert 1 365 day USC 2001 C. Faloutsos 8 ‘GEMINI’ - Pictorially eg,. std S1 F(S1) 1 365 day F(Sn) Sn eg, avg 1 USC 2001 365 day C. Faloutsos 9 Remaining issues • how to extract features automatically? • how to merge similarity scores from different media USC 2001 C. Faloutsos 10 Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search – Visualization: Fastmap – Relevance feedback: FALCON • Data Mining / Fractals • Conclusions USC 2001 C. Faloutsos 11 FastMap ~100 O1 O2 O3 O4 O5 O1 0 1 1 100 100 USC 2001 O2 1 0 1 100 100 O3 1 1 0 100 100 O4 100 100 100 0 1 O5 100 100 100 1 0 C. Faloutsos ?? ~1 12 FastMap • Multi-dimensional scaling (MDS) can do that, but in O(N**2) time • We want a linear algorithm: FastMap [SIGMOD95] USC 2001 C. Faloutsos 13 Applications: time sequences • given n co-evolving time sequences • visualize them + find rules [ICDE00] DEM rate JPY HKD time USC 2001 C. Faloutsos 14 Applications - financial • currency exchange rates [ICDE00] FRF GBP JPY HKD USD(t) USD(t-5) USC 2001 C. Faloutsos 15 Applications - financial • currency exchange rates [ICDE00] FRF DEM HKD JPY USD(t) USD(t-5) USC 2001 USD GBP C. Faloutsos 16 Application: VideoTrails [ACM MM97] USC 2001 C. Faloutsos 17 VideoTrails - usage • scene-cut detection (about 10% errors) • scene classification (eg., dialogue vs action) USC 2001 C. Faloutsos 18 Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search – Visualization: Fastmap – Relevance feedback: FALCON • Data Mining / Fractals • Conclusions USC 2001 C. Faloutsos 19 Merging similarity scores • eg., video: text, color, motion, audio – weights change with the query! • solution 1: user specifies weights • solution 2: user gives examples – and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) – but: how about disjunctive queries? USC 2001 C. Faloutsos 20 ‘FALCON’ Vs Inverted Vs Trader wants only ‘unstable’ stocks USC 2001 C. Faloutsos 21 “Single query point” methods x + + + + ++ Rocchio USC 2001 C. Faloutsos 22 “Single query point” methods x + + + + ++ Rocchio x + + + + ++ MARS x + + + + + + MindReader The averaging affect in action... USC 2001 C. Faloutsos 23 Main idea: FALCON Contours [Wu+, vldb2000] + feature2 + eg., frequency + + + feature1 (eg., temperature) USC 2001 C. Faloutsos 24 Conclusions for indexing + visualization • GEMINI: fast indexing, exploiting off-theshelf SAMs • FastMap: automatic feature extraction in O(N) time • FALCON: relevance feedback for disjunctive queries USC 2001 C. Faloutsos 25 Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resourses USC 2001 C. Faloutsos 26 Data mining & fractals – Road map • • • • Motivation – problems / case study Definition of fractals and power laws Solutions to posed problems More examples USC 2001 C. Faloutsos 27 Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. - ‘spiral’ and ‘elliptical’ Nichol) galaxies (stores & households ; mpg & MTBF...) - patterns? (not Gaussian; not uniform) -attraction/repulsion? - separability?? USC 2001 C. Faloutsos 28 Problem#2: dim. reduction • given attributes x1, ... xn – possibly, non-linearly correlated • drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) USC 2001 C. Faloutsos 29 Answer: • Fractals / self-similarities / power laws USC 2001 C. Faloutsos 30 What is a fractal? = self-similar point set, e.g., Sierpinski triangle: ... zero area; infinite length! USC 2001 C. Faloutsos 31 Definitions (cont’d) • Paradox: Infinite perimeter ; Zero area! • ‘dimensionality’: between 1 and 2 • actually: Log(3)/Log(2) = 1.58… (long story) USC 2001 C. Faloutsos 32 Intrinsic (‘fractal’) dimension Eg: • Q: fractal dimension of a line? #cylinders; miles / gallon x 5 4 3 2 USC 2001 C. Faloutsos y 1 2 3 4 33 Intrinsic (‘fractal’) dimension • Q: fractal dimension of a line? • A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) USC 2001 C. Faloutsos 34 Intrinsic (‘fractal’) dimension • Q: fractal dimension of a line? • A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) USC 2001 • Q: fd of a plane? • A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) C. Faloutsos 35 Sierpinsky triangle == ‘correlation integral’ log(#pairs within <=r ) 1.58 log( r ) USC 2001 C. Faloutsos 36 Road map • • • • • Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples Conclusions USC 2001 C. Faloutsos 37 Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) •clusters? •separable? •attraction/repulsion? •data ‘scrubbing’ – duplicates? USC 2001 C. Faloutsos 38 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) USC 2001 C. Faloutsos 39 Solution#1: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) USC 2001 C. Faloutsos 40 spatial d.m. r1 r2 Heuristic on choosing # of clusters r2 r1 USC 2001 C. Faloutsos 41 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) USC 2001 C. Faloutsos 42 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell -repulsion!! spi-spi -duplicates spi-ell log(r) USC 2001 C. Faloutsos 43 Problem #2: Dim. reduction y (a) Quarter-circle y y (b)Line (c) Spike 1 0 0 USC 2001 1 x x 0 C. Faloutsos 0 x 44 Solution: • drop the attributes that don’t increase the ‘partial f.d.’ PFD • dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] USC 2001 C. Faloutsos 45 Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 y (a) Quarter-circle y y (b)Line (c) Spike 1 0 1 0 PFD~1 USC 2001 x x 0 PFD=1 C. Faloutsos x 0 PFD=0 46 Problem #2: dim. reduction global FD=1 PFD=1 PFD=1 y (a) Quarter-circle y y (b)Line (c) Spike 1 0 1 0 PFD~1 USC 2001 x x 0 PFD=1 C. Faloutsos 0 x Notice: ‘max variance’ PFD=0 would fail here 47 Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 y (a) Quarter-circle y y (b)Line (c) Spike 1 0 1 0 PFD~1 USC 2001 x x 0 PFD=1 0 x Notice: SVD would fail here PFD=0 C. Faloutsos 48 Road map • • • • Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples – fractals – power laws • Conclusions USC 2001 C. Faloutsos 49 disk traffic • Not Poisson, not(?) iid - BUT: self-similar • How to model it? #bytes time USC 2001 C. Faloutsos 50 traffic • disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02]) 20% 80% #bytes time USC 2001 C. Faloutsos 51 Traffic Many other time-sequences are bursty/clustered: (such as?) USC 2001 C. Faloutsos 52 Tape accesses # tapes needed, to retrieve n records? Tape#1 Tape# N (# days down, due to failures / hurricanes / communication noise...) time USC 2001 C. Faloutsos 53 Tape accesses 50-50 = Poisson # tapes retrieved Tape#1 Tape# N real time USC 2001 C. Faloutsos # qual. records 54 More apps: Brain scans • Oct-trees; brain-scans Log(#octants) 2.63 = fd USC 2001 C. Faloutsos octree levels 55 GIS points Cross-roads of Montgomery county: •any rules? USC 2001 C. Faloutsos 56 GIS log(#pairs(within <= r)) A: self-similarity: • intrinsic dim. = 1.51 • avg#neighbors(<= r ) = r^D 1.51 log( r ) USC 2001 C. Faloutsos 57 Examples:LB county • Long Beach county of CA (road end-points) USC 2001 C. Faloutsos 58 More fractals: • cardiovascular system: 3 (!) • stock prices (LYCOS) - random walks: 1.5 1 year 2 years • Coastlines: 1.2-1.58 (?) USC 2001 C. Faloutsos 59 USC 2001 C. Faloutsos 60 Road map • • • • Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples – fractals – power laws • Conclusions USC 2001 C. Faloutsos 61 Fractals <-> Power laws self-similarity -> • <=> fractals • <=> scale-free • <=> power-laws (y=x^a, F=C*r^(-2)) log(#pairs within <=r ) 1.58 log( r ) USC 2001 C. Faloutsos 62 “the” log(freq) Zipf’s law “and” Bible RANK-FREQUENCY plot: (in log-log scales) q log(rank) Zipf’s (first) Law: USC 2001 C. Faloutsos 63 Zipf’s law • similarly for first names (slope ~-1) • last names (~ -0.7) • etc USC 2001 C. Faloutsos 64 More power laws • Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) amplitude day USC 2001 magnitude C. Faloutsos 65 Clickstream data <url, u-id, ....> Web Site Traffic log(count) Zipf log(freq) USC 2001 C. Faloutsos 66 Lotka’s law • library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(count) J. Ullman log(#citations) USC 2001 C. Faloutsos 67 Korcak’s law log(count( >= area)) Scandinavian lakes area vs complementary cumulative count (log-log axes) USC 2001 log(area) C. Faloutsos 68 More power laws: Korcak log(count( >= area)) Japan islands; area vs cumulative count (log-log axes) USC 2001 log(area) C. Faloutsos 69 (Korcak’s law: Aegean islands) USC 2001 C. Faloutsos 70 Olympic medals: log(# medals) Russia China 2.5 USA 2 1.5 Series1 Linear (Series1) 1 y = -0.9676x + 2.3054 R2 = 0.9458 0.5 0 0 USC 2001 0.5 1 1.5 C. Faloutsos 2 log rank 71 SALES data – store#96 count of products # units sold USC 2001 C. Faloutsos 72 TELCO data count of customers # of service units USC 2001 C. Faloutsos 73 More power laws on the Internet log(degree) -0.82 log(rank) degree vs rank, for Internet domains (log-log) [sigcomm99] USC 2001 C. Faloutsos 74 Even more power laws: • • • • Income distribution (Pareto’s law); duration of UNIX jobs [Harchol-Balter] Distribution of UNIX file sizes Web graph [CLEVER-IBM; Barabasi] USC 2001 C. Faloutsos 75 Overall Conclusions: ‘Find similar/interesting things’ in multimedia databases • Indexing: feature extraction (‘GEMINI’) – automatic feature extraction: FastMap – Relevance feedback: FALCON USC 2001 C. Faloutsos 76 Conclusions - cont’d • New tools for Data Mining: Fractals/power laws: – appear everywhere – lead to skewed distributions (Gaussian, Poisson, uniformity, independence) – ‘correlation integral’ for separability/cluster detection – PFD for dimensionality reduction USC 2001 C. Faloutsos 77 Resources: • Software and papers: – – – – www.cs.cmu.edu/~christos Fractal dimension (FracDim) Separability (sigmod 2000, kdd2001) Relevance feedback for query by content (FALCON – vldb 2000) USC 2001 C. Faloutsos 78 Resources • Manfred Schroeder “Chaos, Fractals and Power Laws” USC 2001 C. Faloutsos 79