Download Similarity Search in High Dimension via Hashing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bloom filter wikipedia , lookup

Hash table wikipedia , lookup

Rainbow table wikipedia , lookup

Transcript
Similarity Search in High
Dimensions via Hashing
Aristides Gionis, Piotr Indyk, Rajeev Motwani
Presented by:
Fatih Uzun
Outline
•
•
•
•
•
Introduction
Problem Description
Key Idea
Experiments and Results
Conclusions
Introduction
• Similarity Search over High-Dimensional Data
– Image databases, document collections etc
• Curse of Dimensionality
– All space partitioning techniques degrade to linear
search for high dimensions
• Exact vs. Approximate Answer
– Approximate might be good-enough and much-faster
– Time-quality trade-off
Problem Description
•  - Nearest Neighbor Search ( - NNS)
– Given a set P of points in a normed space , preprocess P
so as to efficiently return a point p  P for any given
query point q, such that
• dist(q,p)  (1 +  )  min r  P dist(q,r)
• Generalizes to K- nearest neighbor search ( K >1)
Problem Description
Key Idea
• Locality Sensitive Hashing ( LSH ) to get
sub-linear dependence on the data-size for
high-dimensional data
• Preprocessing :
– Hash the data-point using several LSH
functions so that probability of collision is
higher for closer objects
Algorithm : Preprocessing
• Input
– Set of N points { p1 , …….. pn }
– L ( number of hash tables )
• Output
– Hash tables Ti , i = 1 , 2, …. L
• Foreach
i = 1 , 2, …. L
– Initialize Ti with a random hash function gi(.)
• Foreach
i = 1 , 2, …. L
Foreach j = 1 , 2, …. N
Store point pj on bucket gi(pj) of hash table Ti
LSH - Algorithm
P
pi
g1(pi)
T1
g2(pi)
T2
gL(pi)
TL
Algorithm :  - NNS Query
• Input
– Query point q
– K ( number of approx. nearest neighbors )
• Access
– Hash tables Ti , i = 1 , 2, …. L
• Output
– Set S of K ( or less ) approx. nearest neighbors
• S
Foreach
i = 1 , 2, …. L
– S  S  { points found in gi(q) bucket of hash table Ti }
LSH - Analysis
• Family H of (r1, r2, p1, p2)-sensitive functions,
{hi(.)}
– dist(p,q) < r1  ProbH [h(q) = h(p)]  p1
– dist(p,q)  r2  ProbH [h(q) = h(p)]  p2
– p1 > p2 and r1 < r2
• LSH functions: gi(.) = { h1(.) …hk(.) }
• For a proper choice of k and l, a simpler problem, (r,)Neighbor, and hence the actual problem can be solved
• Query Time : O(d n[1/(1+)] )
– d : dimensions , n : data size
Experiments
• Data Sets
– Color images from COREL Draw library
(20,000 points,dimensions up to 64)
– Texture information of aerial photographs
(270,000 points, dimensions 60)
• Evaluation
– Speed, Miss Ratio, Error (%) for various data sizes,
dimensions, and K values
– Compare Performance with SR-Tree ( Spatial Data
Structure )
Performance Measures
• Speed
– Number of disk block accesses in order to answer the
query ( # hash tables)
• Miss Ratio
– Fraction of cases when less than K points are found for
K-NNS
• Error
– Average of fractional error in distance to point found by
LSH as compared to nearest neighbor distance taken
over entire set of queries
Speed vs. Data Size
Approximate 1 - NNS
20
18
Disk Accesses
16
14
LSH, error = 0.2
12
LSH, error = 0.1
10
LSH, error = 0.05
8
LSH, error =0.02
6
SR-Tree
4
2
0
0
5000
10000
15000
Number of Database Points
20000
Speed vs. Dimension
Approximate 1-NNS
20
18
Disk Accesses
16
14
LSH , Error = 0.2
12
LSH, Error = 0.1
10
LSH, Error = 0.05
8
LSH, Error = 0.02
6
SR- Tree
4
2
0
0
20
40
Dimensions
60
80
Speed vs. Nearest Neighbors
Approximate K-NNS
16
Disk Accesses
14
12
10
LSH, Error 0.2
8
LSH, Error 0.1
6
LSH, Error 0.05
4
2
0
0
20
40
60
80
Number of Nearest Neighbors
100
120
Speed vs. Error
450
Disk Accesses
400
350
300
250
SR-Tree
200
LSH
150
100
50
0
10
20
30
Error ( % )
40
50
Miss Ratio vs. Data Size
Approximate 1 -NNS
0.25
Error = 0.1
Miss Ratio
0.2
Error = 0.05
0.15
0.1
0.05
0
0
5000
10000
15000
Number of Database Points
20000
Conclusion
Better Query Time than Spatial Data
Structures
Scales well to higher dimensions and larger
data size ( Sub-linear dependence )
Predictable running time
Extra storage over-head
Inefficient for data with distances
concentrated around average
Future Work
• Investigate Hybrid-Data Structures obtained
by merging tree and hash based structures.
• Make use of the structure of the data-set to
systematically obtain LSH functions
• Explore other applications of LSH-type
techniques to data mining
Questions ?