Download Algorithms for Nearest Neighbor Search

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Algorithms for Nearest Neighbor
Search
Piotr Indyk
MIT
Nearest Neighbor Search
• Given: a set P of n points in Rd
• Goal: a data structure, which given a query
point q, finds the nearest neighbor p of q
in P
p
q
Outline of this talk
• Variants
• Motivation
• Main memory algorithms:
– quadtrees
– kd-trees
– Locality Sensitive Hashing
• Secondary storage algorithms:
– R-tree (and its variants)
– VA-file
Variants of nearest neighbor
• Near neighbor (range search): find one/all
points in P within distance r from q
• Spatial join: given two sets P,Q, find all
pairs p in P, q in Q, such that p is within
distance r from q
• Approximate near neighbor: find one/all
points p’ in P, whose distance to q is at
most (1+e) times the distance from q to its
nearest neighbor
Motivation
Depends on the value of d:
• low d: graphics, vision, GIS, etc
• high d:
– similarity search in databases (text, images etc)
– finding pairs of similar objects (e.g., copyright
violation detection)
– useful subroutine for clustering
Algorithms
• Main memory (Computational Geometry)
– linear scan
– tree-based:
• quadtree
• kd-tree
– hashing-based: Locality-Sensitive Hashing
• Secondary storage (Databases)
– R-tree (and numerous variants)
– Vector Approximation File (VA-file)
Quadtree
• Simplest spatial structure on Earth !
Quadtree ctd.
• Split the space into 2d equal subsquares
• Repeat until done:
– only one pixel left
– only one point left
– only a few points left
• Variants:
– split only one dimension at a time
– k-d-trees (in a moment)
Range search
• Near neighbor (range search):
– put the root on the stack
– repeat
• pop the next node T from the stack
• for each child C of T:
– if C is a leaf, examine point(s) in C
– if C intersects with the ball of radius r around q, add C to
the stack
Near neighbor ctd
Nearest neighbor
• Start range search with r = 
• Whenever a point is found, update r
• Only investigate nodes with respect to
current r
Quadtree ctd.
• Simple data structure
• Versatile, easy to implement
• So why doesn’t this talk end here ?
– Empty spaces: if the points form sparse clouds,
it takes a while to reach them
– Space exponential in dimension
– Time exponential in dimension, e.g., points on
the hypercube
Space issues: example
K-d-trees [Bentley’75]
• Main ideas:
– only one-dimensional splits
– instead of splitting in the middle, choose the
split “carefully” (many variations)
– near(est) neighbor queries: as for quadtrees
• Advantages:
– no (or less) empty spaces
– only linear space
• Exponential query time still possible
Exponential query time
• What does it mean exactly ?
– Unless we do something really stupid, query time is at
most dn
– Therefore, the actual query time is
Min[ dn, exponential(d) ]
• This is still quite bad though, when the dimension
is around 20-30
• Unfortunately, it seems inevitable (both in theory
and practice)
Approximate nearest neighbor
• Can do it using (augmented) k-d trees, by
interrupting search earlier [Arya et al’94]
• Still exponential time (in the worst case)!
• Try a different approach:
– for exact queries, we can use binary search
trees or hashing
– can we adapt hashing to nearest neighbor
search ?
Locality-Sensitive Hashing
[Indyk-Motwani’98]
• Hash functions are locality-sensitive, if, for
a random hash random function h, for any
pair of points p,q we have:
– Pr[h(p)=h(q)] is “high” if p is “close” to q
– Pr[h(p)=h(q)] is “low” if p is”far” from q
Do such functions exist ?
• Consider the hypercube, i.e.,
– points from {0,1}d
– Hamming distance D(p,q)= # positions on
which p and q differ
• Define hash function h by choosing a set I
of k random coordinates, and setting
h(p) = projection of p on I
Example
• Take
– d=10, p=0101110010
– k=2, I={2,5}
• Then h(p)=11
h’s are locality-sensitive
• Pr[h(p)=h(q)]=(1-D(p,q)/d)k
• We can vary the probability by changing k
Pr
k=1
distance
Pr
k=2
distance
How can we use LSH ?
• Choose several h1..hl
• Initialize a hash array for each hi
• Store each point p in the bucket hi(p) of the
i-th hash array, i=1...l
• In order to answer query q
– for each i=1..l, retrieve points in a bucket hi(q)
– return the closest point found
What does this algorithm do ?
• By proper choice of parameters k and l, we can
make, for any p, the probability that
hi(p)=hi(q) for some i
look like this:
• Can control:
– Position of the slope
– How steep it is
distance
The LSH algorithm
• Therefore, we can solve (approximately) the near
neighbor problem with given parameter r
• Worst-case analysis guarantees dn1/(1+e) query time
• Practical evaluation indicates much better behavior
[GIM’99,HGI’00,Buh’00,BT’00]
• Drawbacks:
• works best for Hamming distance (although can be generalized
to Euclidean space)
• requires radius r to be fixed in advance
Secondary storage
• Seek time same as time needed to transfer
hundreds of KBs
• Grouping the data is crucial
• Different approach required:
– in main memory, any reduction in the number
of inspected points was good
– on disk, this is not the case !
Disk-based algorithms
• R-tree [Guttman’84]
– departing point for many variations
– over 600 citations ! (according to CiteSeer)
– “optimistic” approach: try to answer queries in
logarithmic time
• Vector Approximation File [WSB’98]
– “pessimistic” approach: if we need to scan the whole
data set, we better do it fast
• LSH works also on disk
R-tree
• “Bottom-up” approach (k-d-tree was “topdown”) :
– Start with a set of points/rectangles
– Partition the set into groups of small cardinality
– For each group, find minimum rectangle
containing objects from this group
– Repeat
R-tree ctd.
R-tree ctd.
• Advantages:
– Supports near(est) neighbor search (similar as
before)
– Works for points and rectangles
– Avoids empty spaces
– Many variants: X-tree, SS-tree, SR-tree etc
– Works well for low dimensions
• Not so great for high dimensions
VA-file [Weber, Schek, Blott’98]
• Approach:
– In high-dimensional spaces, all tree-based
indexing structures examine large fraction of
leaves
– If we need to visit so many nodes anyway, it is
better to scan the whole data set and avoid
performing seeks altogether
– 1 seek = transfer of few hundred KB
VA-file ctd.
• Natural question: how to speed-up linear
scan ?
• Answer: use approximation
– Use only i bits per dimension (and speed-up the
scan by a factor of 32/i)
– Identify all points which could be returned as
an answer
– Verify the points using original data set
Time to sum up
• “Curse of dimensionality” is indeed a curse
• In main memory, we can perform sublinear-time
search using trees or hashing
• In secondary storage, linear scan is pretty much all
we can do (for high dim)
• Personal thought: if linear search is all we can do,
we are not doing too well….
• Maybe it is time to buy a few GB of RAM
• ..but at the end everything depends on your data set
Resources
• Surveys:
– Berchtold & Keim:
– http://www.informatik.unihalle.de/~keim/PS/ICDE00.pdf
– Theodoridis:
– http://dias.cti.gr/~ytheod/research/ADBIS/handouts.pdf
– Agarwal et al (range searching):
– http://www.cs.duke.edu/~pankaj/papers.html
Resources
• Source code:
http://dias.cti.gr/~ytheod/research/indexing/
http://www.cs.sunysb.edu/~algorith/major_section/1.6.shtml
• References: see surveys plus very recent
– [Buh’00,BT’00]: J. Buhler et al:
http://www.cs.washington.edu/homes/jbuhler/
– [HGI’00]: Haveliwala et al:
http://theory.lcs.mit.edu/~indyk/webdb.ps
Contact
• If you have any question, feel free to e-mail
me at [email protected]
• Thank you !