Download Similarity Search in High Dimension via Hashing

Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun Outline • • • • • Introduction Problem Description Key Idea Experiments and Results Conclusions Introduction • Similarity Search over High-Dimensional Data – Image databases, document collections etc • Curse of Dimensionality – All space partitioning techniques degrade to linear search for high dimensions • Exact vs. Approximate Answer – Approximate might be good-enough and much-faster – Time-quality trade-off Problem Description •  - Nearest Neighbor Search ( - NNS) – Given a set P of points in a normed space , preprocess P so as to efficiently return a point p  P for any given query point q, such that • dist(q,p)  (1 +  )  min r  P dist(q,r) • Generalizes to K- nearest neighbor search ( K >1) Problem Description Key Idea • Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data • Preprocessing : – Hash the data-point using several LSH functions so that probability of collision is higher for closer objects Algorithm : Preprocessing • Input – Set of N points { p1 , …….. pn } – L ( number of hash tables ) • Output – Hash tables Ti , i = 1 , 2, …. L • Foreach i = 1 , 2, …. L – Initialize Ti with a random hash function gi(.) • Foreach i = 1 , 2, …. L Foreach j = 1 , 2, …. N Store point pj on bucket gi(pj) of hash table Ti LSH - Algorithm P pi g1(pi) T1 g2(pi) T2 gL(pi) TL Algorithm :  - NNS Query • Input – Query point q – K ( number of approx. nearest neighbors ) • Access – Hash tables Ti , i = 1 , 2, …. L • Output – Set S of K ( or less ) approx. nearest neighbors • S Foreach i = 1 , 2, …. L – S  S  { points found in gi(q) bucket of hash table Ti } LSH - Analysis • Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} – dist(p,q) < r1  ProbH [h(q) = h(p)]  p1 – dist(p,q)  r2  ProbH [h(q) = h(p)]  p2 – p1 > p2 and r1 < r2 • LSH functions: gi(.) = { h1(.) …hk(.) } • For a proper choice of k and l, a simpler problem, (r,)Neighbor, and hence the actual problem can be solved • Query Time : O(d n[1/(1+)] ) – d : dimensions , n : data size Experiments • Data Sets – Color images from COREL Draw library (20,000 points,dimensions up to 64) – Texture information of aerial photographs (270,000 points, dimensions 60) • Evaluation – Speed, Miss Ratio, Error (%) for various data sizes, dimensions, and K values – Compare Performance with SR-Tree ( Spatial Data Structure ) Performance Measures • Speed – Number of disk block accesses in order to answer the query ( # hash tables) • Miss Ratio – Fraction of cases when less than K points are found for K-NNS • Error – Average of fractional error in distance to point found by LSH as compared to nearest neighbor distance taken over entire set of queries Speed vs. Data Size Approximate 1 - NNS 20 18 Disk Accesses 16 14 LSH, error = 0.2 12 LSH, error = 0.1 10 LSH, error = 0.05 8 LSH, error =0.02 6 SR-Tree 4 2 0 0 5000 10000 15000 Number of Database Points 20000 Speed vs. Dimension Approximate 1-NNS 20 18 Disk Accesses 16 14 LSH , Error = 0.2 12 LSH, Error = 0.1 10 LSH, Error = 0.05 8 LSH, Error = 0.02 6 SR- Tree 4 2 0 0 20 40 Dimensions 60 80 Speed vs. Nearest Neighbors Approximate K-NNS 16 Disk Accesses 14 12 10 LSH, Error 0.2 8 LSH, Error 0.1 6 LSH, Error 0.05 4 2 0 0 20 40 60 80 Number of Nearest Neighbors 100 120 Speed vs. Error 450 Disk Accesses 400 350 300 250 SR-Tree 200 LSH 150 100 50 0 10 20 30 Error ( % ) 40 50 Miss Ratio vs. Data Size Approximate 1 -NNS 0.25 Error = 0.1 Miss Ratio 0.2 Error = 0.05 0.15 0.1 0.05 0 0 5000 10000 15000 Number of Database Points 20000 Conclusion Better Query Time than Spatial Data Structures Scales well to higher dimensions and larger data size ( Sub-linear dependence ) Predictable running time Extra storage over-head Inefficient for data with distances concentrated around average Future Work • Investigate Hybrid-Data Structures obtained by merging tree and hash based structures. • Make use of the structure of the data-set to systematically obtain LSH functions • Explore other applications of LSH-type techniques to data mining Questions ?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Similarity Search in High Dimension via Hashing