Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

no text concepts found

Transcript

Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resources USC 2001 C. Faloutsos 2 Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: • Allow fast, approximate queries, and • Find rules/patterns USC 2001 C. Faloutsos 3 Sample queries • Similarity search – Find pairs of branches with similar sales patterns – find medical cases similar to Smith's – Find pairs of sensor series that move in sync – Find shapes like a spark-plug USC 2001 C. Faloutsos 4 Sample queries –cont’d • Rule discovery – Clusters (of branches; of sensor data; ...) – Forecasting (total sales for next year?) – Outliers (eg., unexpected part failures; fraud detection) USC 2001 C. Faloutsos 5 Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • related projects @ CMU and resourses USC 2001 C. Faloutsos 6 Indexing - Multimedia Problem: • given a set of (multimedia) objects, • find the ones similar to a desirable query object USC 2001 C. Faloutsos 7 $price $price 1 365 day $price 1 365 day distance function: by expert 1 365 day USC 2001 C. Faloutsos 8 ‘GEMINI’ - Pictorially eg,. std S1 F(S1) 1 365 day F(Sn) Sn eg, avg 1 USC 2001 365 day C. Faloutsos 9 Remaining issues • how to extract features automatically? • how to merge similarity scores from different media USC 2001 C. Faloutsos 10 Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search – Visualization: Fastmap – Relevance feedback: FALCON • Data Mining / Fractals • Conclusions USC 2001 C. Faloutsos 11 FastMap ~100 O1 O2 O3 O4 O5 O1 0 1 1 100 100 USC 2001 O2 1 0 1 100 100 O3 1 1 0 100 100 O4 100 100 100 0 1 O5 100 100 100 1 0 C. Faloutsos ?? ~1 12 FastMap • Multi-dimensional scaling (MDS) can do that, but in O(N**2) time • We want a linear algorithm: FastMap [SIGMOD95] USC 2001 C. Faloutsos 13 Applications: time sequences • given n co-evolving time sequences • visualize them + find rules [ICDE00] DEM rate JPY HKD time USC 2001 C. Faloutsos 14 Applications - financial • currency exchange rates [ICDE00] FRF GBP JPY HKD USD(t) USD(t-5) USC 2001 C. Faloutsos 15 Applications - financial • currency exchange rates [ICDE00] FRF DEM HKD JPY USD(t) USD(t-5) USC 2001 USD GBP C. Faloutsos 16 Application: VideoTrails [ACM MM97] USC 2001 C. Faloutsos 17 VideoTrails - usage • scene-cut detection (about 10% errors) • scene classification (eg., dialogue vs action) USC 2001 C. Faloutsos 18 Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search – Visualization: Fastmap – Relevance feedback: FALCON • Data Mining / Fractals • Conclusions USC 2001 C. Faloutsos 19 Merging similarity scores • eg., video: text, color, motion, audio – weights change with the query! • solution 1: user specifies weights • solution 2: user gives examples – and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) – but: how about disjunctive queries? USC 2001 C. Faloutsos 20 ‘FALCON’ Vs Inverted Vs Trader wants only ‘unstable’ stocks USC 2001 C. Faloutsos 21 “Single query point” methods x + + + + ++ Rocchio USC 2001 C. Faloutsos 22 “Single query point” methods x + + + + ++ Rocchio x + + + + ++ MARS x + + + + + + MindReader The averaging affect in action... USC 2001 C. Faloutsos 23 Main idea: FALCON Contours [Wu+, vldb2000] + feature2 + eg., frequency + + + feature1 (eg., temperature) USC 2001 C. Faloutsos 24 Conclusions for indexing + visualization • GEMINI: fast indexing, exploiting off-theshelf SAMs • FastMap: automatic feature extraction in O(N) time • FALCON: relevance feedback for disjunctive queries USC 2001 C. Faloutsos 25 Outline Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resourses USC 2001 C. Faloutsos 26 Data mining & fractals – Road map • • • • Motivation – problems / case study Definition of fractals and power laws Solutions to posed problems More examples USC 2001 C. Faloutsos 27 Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. - ‘spiral’ and ‘elliptical’ Nichol) galaxies (stores & households ; mpg & MTBF...) - patterns? (not Gaussian; not uniform) -attraction/repulsion? - separability?? USC 2001 C. Faloutsos 28 Problem#2: dim. reduction • given attributes x1, ... xn – possibly, non-linearly correlated • drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) USC 2001 C. Faloutsos 29 Answer: • Fractals / self-similarities / power laws USC 2001 C. Faloutsos 30 What is a fractal? = self-similar point set, e.g., Sierpinski triangle: ... zero area; infinite length! USC 2001 C. Faloutsos 31 Definitions (cont’d) • Paradox: Infinite perimeter ; Zero area! • ‘dimensionality’: between 1 and 2 • actually: Log(3)/Log(2) = 1.58… (long story) USC 2001 C. Faloutsos 32 Intrinsic (‘fractal’) dimension Eg: • Q: fractal dimension of a line? #cylinders; miles / gallon x 5 4 3 2 USC 2001 C. Faloutsos y 1 2 3 4 33 Intrinsic (‘fractal’) dimension • Q: fractal dimension of a line? • A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) USC 2001 C. Faloutsos 34 Intrinsic (‘fractal’) dimension • Q: fractal dimension of a line? • A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) USC 2001 • Q: fd of a plane? • A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) C. Faloutsos 35 Sierpinsky triangle == ‘correlation integral’ log(#pairs within <=r ) 1.58 log( r ) USC 2001 C. Faloutsos 36 Road map • • • • • Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples Conclusions USC 2001 C. Faloutsos 37 Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) •clusters? •separable? •attraction/repulsion? •data ‘scrubbing’ – duplicates? USC 2001 C. Faloutsos 38 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) USC 2001 C. Faloutsos 39 Solution#1: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) USC 2001 C. Faloutsos 40 spatial d.m. r1 r2 Heuristic on choosing # of clusters r2 r1 USC 2001 C. Faloutsos 41 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell - repulsion! spi-spi spi-ell log(r) USC 2001 C. Faloutsos 42 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! ell-ell -repulsion!! spi-spi -duplicates spi-ell log(r) USC 2001 C. Faloutsos 43 Problem #2: Dim. reduction y (a) Quarter-circle y y (b)Line (c) Spike 1 0 0 USC 2001 1 x x 0 C. Faloutsos 0 x 44 Solution: • drop the attributes that don’t increase the ‘partial f.d.’ PFD • dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] USC 2001 C. Faloutsos 45 Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 y (a) Quarter-circle y y (b)Line (c) Spike 1 0 1 0 PFD~1 USC 2001 x x 0 PFD=1 C. Faloutsos x 0 PFD=0 46 Problem #2: dim. reduction global FD=1 PFD=1 PFD=1 y (a) Quarter-circle y y (b)Line (c) Spike 1 0 1 0 PFD~1 USC 2001 x x 0 PFD=1 C. Faloutsos 0 x Notice: ‘max variance’ PFD=0 would fail here 47 Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 y (a) Quarter-circle y y (b)Line (c) Spike 1 0 1 0 PFD~1 USC 2001 x x 0 PFD=1 0 x Notice: SVD would fail here PFD=0 C. Faloutsos 48 Road map • • • • Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples – fractals – power laws • Conclusions USC 2001 C. Faloutsos 49 disk traffic • Not Poisson, not(?) iid - BUT: self-similar • How to model it? #bytes time USC 2001 C. Faloutsos 50 traffic • disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02]) 20% 80% #bytes time USC 2001 C. Faloutsos 51 Traffic Many other time-sequences are bursty/clustered: (such as?) USC 2001 C. Faloutsos 52 Tape accesses # tapes needed, to retrieve n records? Tape#1 Tape# N (# days down, due to failures / hurricanes / communication noise...) time USC 2001 C. Faloutsos 53 Tape accesses 50-50 = Poisson # tapes retrieved Tape#1 Tape# N real time USC 2001 C. Faloutsos # qual. records 54 More apps: Brain scans • Oct-trees; brain-scans Log(#octants) 2.63 = fd USC 2001 C. Faloutsos octree levels 55 GIS points Cross-roads of Montgomery county: •any rules? USC 2001 C. Faloutsos 56 GIS log(#pairs(within <= r)) A: self-similarity: • intrinsic dim. = 1.51 • avg#neighbors(<= r ) = r^D 1.51 log( r ) USC 2001 C. Faloutsos 57 Examples:LB county • Long Beach county of CA (road end-points) USC 2001 C. Faloutsos 58 More fractals: • cardiovascular system: 3 (!) • stock prices (LYCOS) - random walks: 1.5 1 year 2 years • Coastlines: 1.2-1.58 (?) USC 2001 C. Faloutsos 59 USC 2001 C. Faloutsos 60 Road map • • • • Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples – fractals – power laws • Conclusions USC 2001 C. Faloutsos 61 Fractals <-> Power laws self-similarity -> • <=> fractals • <=> scale-free • <=> power-laws (y=x^a, F=C*r^(-2)) log(#pairs within <=r ) 1.58 log( r ) USC 2001 C. Faloutsos 62 “the” log(freq) Zipf’s law “and” Bible RANK-FREQUENCY plot: (in log-log scales) q log(rank) Zipf’s (first) Law: USC 2001 C. Faloutsos 63 Zipf’s law • similarly for first names (slope ~-1) • last names (~ -0.7) • etc USC 2001 C. Faloutsos 64 More power laws • Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) amplitude day USC 2001 magnitude C. Faloutsos 65 Clickstream data <url, u-id, ....> Web Site Traffic log(count) Zipf log(freq) USC 2001 C. Faloutsos 66 Lotka’s law • library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(count) J. Ullman log(#citations) USC 2001 C. Faloutsos 67 Korcak’s law log(count( >= area)) Scandinavian lakes area vs complementary cumulative count (log-log axes) USC 2001 log(area) C. Faloutsos 68 More power laws: Korcak log(count( >= area)) Japan islands; area vs cumulative count (log-log axes) USC 2001 log(area) C. Faloutsos 69 (Korcak’s law: Aegean islands) USC 2001 C. Faloutsos 70 Olympic medals: log(# medals) Russia China 2.5 USA 2 1.5 Series1 Linear (Series1) 1 y = -0.9676x + 2.3054 R2 = 0.9458 0.5 0 0 USC 2001 0.5 1 1.5 C. Faloutsos 2 log rank 71 SALES data – store#96 count of products # units sold USC 2001 C. Faloutsos 72 TELCO data count of customers # of service units USC 2001 C. Faloutsos 73 More power laws on the Internet log(degree) -0.82 log(rank) degree vs rank, for Internet domains (log-log) [sigcomm99] USC 2001 C. Faloutsos 74 Even more power laws: • • • • Income distribution (Pareto’s law); duration of UNIX jobs [Harchol-Balter] Distribution of UNIX file sizes Web graph [CLEVER-IBM; Barabasi] USC 2001 C. Faloutsos 75 Overall Conclusions: ‘Find similar/interesting things’ in multimedia databases • Indexing: feature extraction (‘GEMINI’) – automatic feature extraction: FastMap – Relevance feedback: FALCON USC 2001 C. Faloutsos 76 Conclusions - cont’d • New tools for Data Mining: Fractals/power laws: – appear everywhere – lead to skewed distributions (Gaussian, Poisson, uniformity, independence) – ‘correlation integral’ for separability/cluster detection – PFD for dimensionality reduction USC 2001 C. Faloutsos 77 Resources: • Software and papers: – – – – www.cs.cmu.edu/~christos Fractal dimension (FracDim) Separability (sigmod 2000, kdd2001) Relevance feedback for query by content (FALCON – vldb 2000) USC 2001 C. Faloutsos 78 Resources • Manfred Schroeder “Chaos, Fractals and Power Laws” USC 2001 C. Faloutsos 79

Related documents