Download Multimedia and Time-Series Data

Multimedia and Time-Series Data When Is “Nearest Neighbor” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason Yeung CSIS 7101 Advanced Database Contribution  As dimension increase, the distance to the nearest neighbor approaches the distance to the farthest neighbor  The distinction between nearest and farthest neighbors may blur with as few as 15 dimensions  Linear scan almost out-performs Nearest Neighbor processing techniques Part One Introduce Nearest Neighbor What is nearest neighbor (NN) problem?  Given a collection of data points and a query point in a m-dimensional metric space, find the data point that is closest to the query point Nearest neighbor algorithm 1. Make two sets of nodes, set A and set B and put all nodes into set B 2. Put your starting node into set A 3. Pick the node which is closest to the last node which was placed in set A and is not in set A; put this closest neighboring node into set A 4. Repeat step 3 until all nodes are in set A Query point and its nearest neighbor NN Query point Practical Applications of NN Search  Medical Imaging  Molecular Biology  Spatial and Multimedia databases Adaptable Similarity Approach  In multimedia database, given an image database  one may want to retrieve all images that are similar to a given query image  data domain is high dimensional Color-oriented similarity of images  On Image Database Shape-oriented similarity of images  Aims at the level of individual pixels Shape-oriented similarity of 3-D objects  On 3-D protein database Approximation-based Shape similarity of 3-D surface segments  Measures the similarity of 3-D segments by using geometric approx. Exception Case  Distance between the nearest neighbor and any other point in the data set is small  We call this unstable query Part Two Nearest Neighbor in high dimensional Unstable query  A nearest neighbor query is unstable for a given € if the distance from the query point to most data point is less than (1 + €) times the distance from the query point to its nearest neighbor NN in High-Dimensional Space  Proof in the paper that the concept NN become meaningless as dimensionality (m) increases  If a pre-condition holds: As m increases, the difference in distance between the query point and all data points become negligible (i.e., the query becomes unstable) Assumption for the pre-condition to hold  The data distribution and query distribution are IID in all dimensions  Unique dimensions with correlation between all dimensions What is IID?  Independent and identically distributed  It means that the distribution of values in each dimension is identical (i.e. all values are uniformly distributed or dimensional have same skew) and independent High Dimensional indexing can be meaningful  When the dimensions of both the query point and data points follow identical distribution, but are completely dependent (i.e: value in D1 = values in D2=…)  The result is a set of data points and query point on a diagonal line  The underlying query can actually be converted to 1D NN problem Graphical View  All dimension has same value  All data points are on the diagonal Y Z X High Dimensional indexing can be meaningful (Cont’d)  The underlying dimensionality is much lower than the actual dimensionality  E.g.: It is a 3-D data set, but the data always have the Z coordinate High Dimensional indexing can be meaningful (Cont’d)  When the query point is within some small distance of a data point (instead of being required to be identical to a data point)  The result set of the query is to return all points within the closest cluster, not just the nearest point NN query in clustered data  E.g.: Data falls into discrete classes or cluster in some potentially high dimensional feature space Nearest Cluster Query point Distribution of distances in clustered data Points are close and are in same cluster (NN meaningful) Point are in other cluster which are all far Experimental studies of NN  Want to find out the rate of convergence  Based on 3 synthetic work-load and one real data set  NN can become unstable with as few as 1020 dimensions  The graph is exponential  In reality, the dimensions might be 1000 Correlated Distributions Recursive and uniform workload (NN not meaningful) Two degrees of freedom workload (NN meaningful) Part Three Linear Scan is powerful… NN indexing VS Linear scan  Linear scan can handily beats NN indexing  NN indexing is meaningful when data consists of small, well-formed clusters  And the query is guaranteed to land in or very near one of these cluster Why Linear scan  A set of sequentially arranged disk pages is much faster than unordered retrieval of the same pages  Fetching a large number of data pages through multi-dimensional index usually results in unordered retrieval Linear Scan outperforms  Both the SS tree and the R* tree at 10 dimensions in all cases  SR tree in all cases at 16 dimensional synthetic data set Justification  All the report performance studies examined situations in which the difference in distance between the query point and NN differed little from the distance to other data points  In reality, it might be different Other related work  Dimensionality Curse  Fractal Dimensions Dimensionality Curse  Vague indication that high dimensionality causes problems in some situations  Examples:  NN problem  “Boundary effects” not taken into account on NN query in high dimensional case Fractal Dimensions  It is a measure of how "complicated" a self-similar figure (data) is  NN queries become stable when fractal dimensionality is low  In reality, real data sets do not exhibit fractal behavior Conclusion  The effect of dimensionality on NN queries  High dimensional index can be meaningful  Evaluate NN workload  Linear scan outperforms NN processing technique on some meaningful workload Reference  Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft. What Is “Nearest Neighbor” Meaningful?  Thomas Seidl. Adaptable Similarity Search in 3-D Spatial Database System  http://www.dbs.informatik.unimuenchen.de/Forschung/Similarity/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Multimedia and Time-Series Data