Download KNN Paper

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cosmic distance ladder wikipedia , lookup

Transcript
Nearest
Nick
Neighbor
Roussopoulos
Stephen
Department
College
objects
to a given
requires
substantially
for location
efficient
point
Park,
or range
in space.
In this
to finding
neighbor
R-tree
object
the k nearest
for an optimistic
paper
traversaJ
to a point,
neighbors.
than
and a pessimistic
an
to find
and then generalize
We also discuss
search
it
ordering
our algorithm
and examine
the scalabfity
of the algorithm.
the behavior
of
of
The
efficient
queries
tion
Systems
specific
the
(GIS).
location
system
database.
is when
and
is not
objects.
finding
the nearest
dow
multiple
query.
Another
stars
which
The
k furthest
nearest
familiar
even
NN
with
point
and
objects
where
to a given
more
to it
query
of k nearest
if we consider
with
query
or
the
light-years
it
R-trees
of the
varying
win-
could
four
bound
be
is combined
remaining
[Gutt84],
Packed R-trees
access methods
subtrees
which
search
algorithm
[Rous85],
[Kame94],
[Beck90] have been primarily
range queries and spatial
on
an
for
overlap/containment.
efficient
branch-and-
processing
several experiments
on synthetic
to demonstrate
the performance
increases
such
explores
queries for the R-trees, introduce
ordering
and pruning
the search
nearest
of it,
and
join queries [BKS93]
based
In this paper,
we provide
2D range
that
backtracks
R-tree variations
[SRF87],
used for overlap/containment
database,
search
variations
when
in the
away.
neighbors
all
to a
in the sky could
is to find
ten
it
is useful
traditional
complex
technique
neighbors,
of spatial
potentially
contain NN until no subtree needs be visited.
In [FBF77] a NN algorithm
for k-d-trees was proposed
which was later refined in [Spro91].
request
the layout
point
searches
to use a more
are at least
versatility
substantially
five
unsuccessful
by an NN
may
case of an astrophysics
star
sizes if we were
handled
the
In the
a user
(NN)
Informa-
on the screen,
situation
user
spatial
involve
the
Neighbor
in Geographic
For example,
to find
the
of Nearest
interest
or an object
Another
is a wide variety
algorithm
which first goes down the quadtree, exploring
the subtree that contains
the query point,
in order
Then
to get a first estimate
of the NN location.
implementation
is of a particular
There
quadtrees.
The exact k-NN problem,
is also posed
for hierarchical
spatial data structures
such as the PM
quadtree. The proposed solution is a top-down recursive
INTRODUCTION
1
movie theaters,
[Same89].
However, very few have been used for NN.
In [Same90], heuristics
are provided to find objects in
strategy
of the metrics
20742
only,
metrics
as well as for pruning.
Finally,
we present the results
several experiments
obtained
using the implementation
MD
Efficient
processing
of NN queries requires spatial
data structures
which capitalize
on the proximity
of
the objects to focus the search of potential
neighbors
those
we present
algorithm
Vincent
Science
different
such queries
queries.
branch-and-bound
the nearest
Processing
search algorithms
Fr6d6ric
other spatial queries such as find the k NN to the East of
a location, or even spatial joins with NN join predicate,
such as find the three closest restaurants
for each of two
of query
in Geographic
the k nearest
neighbor
different
*
of Maryland
Abstract
encountered
type
Systems
is to find
Kelley
of Computer
University
A frequently
Information
Queries
exact
k-NN
several metrics for
tree, and perform
and real-world
data
and scalability
of our
approach.
To the best of our knowledge,
neither NN
algorithms
have been developed for R-trees, nor similar
metrics for NN search. We would also like to point out
that, although
the algorithm
and these metrics are in
as
with
*This research was sponsored partially
by the National Science
Foundation under grant BIR 9318183, by ARPA under contract
003195 Ve100ID, and by NASA/USRA
under contract 5555-09.
the context
of R-trees,
all other spatial
they
are directly
applicable
to
data structures.
Section
2 of the paper
contains
the theoretical
foundation
for the nearest neighbor
search. Section 3
describes the algorithm
and the metrics for ordering
the search and pruning
during it.
Section 4 has the
experiments
with the implementation
of the algorithm.
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the ACM copyright notice and the
title of the publication and its date appear, and notice is given
that copyin is by permission of the Association of Computing
Machinery. ? o copy otherwise, or to republish, requires
a fee and/or specific permission.
SIGMOD’95,San Jose, CA USA
Q 1995 ACM 0-89791-731 -6/95/0005.. $3.50
The conclusion
71
is in section
5.
NEIGHBOR
NEAREST
2
USING
R-trees
trees
were
in
SEARCH
R-TREES
proposed
higher
as
than
one
a natural
extension
of
dimensions
[Gutt84].
They
B-
combine most of the nice features of both B-trees and
quadtrees.
Like B-trees, they remain balanced, while
~~[~]
they
their
maintain
the flexibility
of dynamically
adjusting
or dense
grouping
to deal with either dead-space
areas, like the quad trees do. The decomposition
used in
(RECT,
oid)
where
used as a a pointer
an n-dimensional
which
in
is an object-identifier
and
to a data object and RECT
the
a 2-dimensional
of the
rectangle.
stored
further
i.e.
quadrants,
the next
rectangle
node.
entry
leaf
level
are
For
R-tree
their
atomic,
spatial
contain
line
c;
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..:
be of
the
of
objects
and
are
primitives,
segments,
entries
N
is
corner
spatial
trapezoids,
nodes
will
L
I
is
(MBR)
represents
upper-right
considered
‘n: a,
example,
RECT
composite
into
triangles,
p) where
and
possibly
decomposed
Non-leaf
(RECT,
an
lower-left
The
at the
not
pixels.
space,
Rectangle
object.
(z~~~, z~ig~, y[~~, yhigh ) which
coordinates
the
Bounding
corresponding
q
M
of the form
oid
Minimal
bounds
the form
cent ain entries
...
......
..LL.m........k.l
.. ...........‘q
R-trees is dynamic,
driven by the spatial data objects.
And with appropriate
split algorithms,
if a region of
an n-dimensional
space includes dead-space, no entry in
the R-tree will be introduced.
Leaf nodes of the R-tree
..............
........................
..........
Figure
or
1: Collection
of Rect angles
of the form
to a successor node in
is a minimal
level of the R-tree, and REC’T
which bounds all the entries in the descendent
The term
p is a pointer
branching
(or fan-out)
factor
can be used to
specify the maximum
number of entries that a node can
have; each node of an R-tree with branching factor fifty,
for example,
points
or leaf objects.
to a maximum
To illustrate
of fifty
descendants
the way an R-tree is defined
on some space, Figure
1 shows a collection of rectangles
Performance
and Figure
2 the corresponding
tree.
of an R-tree search is measured
by the number
of
disk accesses (reads) necessary to find (or not find)
the desired object(s)
in the database.
So, the R-tree
branching
factor is chosen such that the size of a node
is equal to (or a multiple
file system page.
2.1
Metrics
for
of) the size of a disk block or
Nearest
EIIfiEl
Neighbor
Iw ‘Jld F-F-h-7
Search
Given a query point P and an object O enclosed in
its MBR,
we provide
two metrics
for ordering
the
NN search.
The first one is based on the minimum
distance
(MINDIST)
of the object
O from P. The
second metric is based on the minimum
of the maximum
possible distances (MINMAXDIST)
from P to a face
(or vertex) of the MBR containing
O. MINDIST
and
MINMAXDIST
offer a lower and an upper bound on
the actual distance of O from P respectively.
These
bounds
are used by the nearest
neighbor
algorithm
Figure
to
72
2: R-tree
Construction
order and efficiently
prune the paths of the search space
I)efinition
in an R-tree.
Definition
1 A rectangle
of dimension
and
3 The minimum
a spatial
n,
T of its
will
major
R
in
Euclidean
be defined
by the
space
two
object
distance
o, denoted
by II(P,
of a poznt
P from
0)11, M:
E(n)
endpoints
11(~,
o)ll= rnd’jj 1P;- Xilz,
S
i=l
diagonal:
VX=[q,...,
zn]eo).
R = (S, T)
Theorem
where
S=[sl,
sz, ...,
s*]
and
T=
[tl,
tz,...,
t~]
a set
1 Given a point
of objects
P and
Minimum
Distance
(MINDIST)
we introduce
is a variation
of the
Proof:
MINDIST(F’,
and the construction
R, then
MINDIST
R)
< II(P,
X)1].
theorem holds when an object of R touches the circle
with center P and radius the square root of MINDIST.
of
P in E(n)
a
from
J41NDIST(P,
one must decide which MBR to search first. MI NDIST
offers a first approximation
of the NN distance to every
R),
2s:
MBR of the node and, therefore,
the search.
= 5
i=l
straightforward.
In fact,
cases, due to dead
if pi < si;
space inside the MBRs,
ti
if pi > ti;
first the MBR
than MINDIST,
and visiting
smallest MINDIST
may result in fake-drops,
otherwise.
to unnecessary
1 The distance
of definition
2 is equal to the
Euclidean
distance
from P to any
square of the minimal
of R.
pozni on the perimeter
R, then MINDIST
= O which
nodes.
the NN might
is
Lemma
2 The
edge
dimension
in
MBR
level
spatial
= O and so is equal to the square of minimal
distance of P from its closest point on the
of the
object
R-tree)
Face
2,
‘hype rface ’ in higher
be much further
For this reason,
second metric MINMAXDIST.
lemma is necessary.
less than or equal to the distance of P from any point
on the perimeter
of R. If P is on the perimeter,
again
MINDIST
Euclidean
in many
Si
Lemma
If P is inside
can be used to direct
In general, deciding which MBR to visit first in order
to minimize the total number of visited nodes is not that
lpi – r~ [2
where
{ Pi
is
When searching an R-tree for the nearest neighbor
to a query point P, at each visited node of the R-tree,
Definition
2 The distance of a point
rectangle R in the same space, denoted
R)
= O which
MINDIST
is used to determine the closest object to P
from all those enclosed in R, The equality in the above
less costly computations.
To avoid confusion, whenever
we refer to distance in this paper, we will in practice be
using the square of the distance
our metrics will reflect this.
If P is inside
less than the distance of any object within R including
one that may be touching
P. If P is outside R, then
according
to lemma
1, VX on the perimeter
of R,
zero. If the point is outside the rectangle,
we use the
square of the Euclidean distance between the point and
the nearest edge of the rectangle.
We use the square
of the Euclidean
distance because it involves fewer and
Proof:
is
The first metric
classic Euclidean
distance applied to a point and a rectangle.
If the point
is inside the rectangle,
the distance between them is
r~ =
R enclosing
the following
true:
andsi<tiforl~i~n.
AIIIVDIST(P,
an MBR
O = {oi, 1 < i < m],
But
Property:
rectangle
in
dimensions)
contains
in the DB.
we introduce
first,
(See figures
a
the following
Every
face
dimension
of any
at least
with the
i.e. visits
h4BR
one point
(t. e.
3
and
(at
any
of some
3 and ~).
perimeter,
namely itself.
If P is outside R and j coordinates, j = 1,2,.. ., n– 1
measures the
of P satisfy sj < pj ~ tj,then MINDIST
Proof:
At the leaf level in the R-tree (object level),
assume by contradiction
that one face of the enclosing
square of the length of a perpendicular
segment from
P to an edge, for j = 1 or to a plane for j = 2, or
a hyperface for j > 3. If none of the pj coordinates
MBR does not touch the enclosed object.
Then, there
exists a smaller rectangle that encloses the object which
contradicts
the definition
of the Minimum
Bounding
fall between (si, t;), then MINDIST
is the square of the
distance to the closest vertex of R by the way of selecting
I’i .
Rectangle.
For the non-leaf levels, we use an induction
on the level in the tree of the MBR. Assume any level
k >0 MBR has the MBR face property, and consider an
Notice that
in the number
computing
MINDIST
of dimensions,
O(n),
requires
MBR
only linear
at level k +1.
face of that
operations.
73
MBR
By the definition
touches
an MBR
of an MBR,
of lower
each
level, and
therefore,
with
a leaf object
by applying
the inductive
hypothesis.
————
————
I1
1
I
I1
II
-----I
II-------
r—
%
L
vel
————
MBR
i+~
—
Level
i.1
I
I
I
Minimax
avoid
have MINDIST
Eollowingdistance
I
I
MBRs,
higher than
construction
is being introduced
all the maximum
I
I
I
I
and points
In orderto
we should
have an
this upper bound.
The
(called MINMAXDIST)
to compute the minimum
value of
distances between the query point
on the each of the n axes respectively.
The
MINMAXDIST
guarantees there isanobject
within the
MBR at a distance less than or equal to MINMAXDIST,
Definition
I
———
(MINMAXDIST)
unnecessary
upper bound of the NN distance to any object inside
an MBR.
This will allow us to prune MBRs
that
i
1
1
1
1
1 ———
Distance
visiting
R
=
(S,
T)
Given
a point
P in
of
same
dimension
the
E(n)
an MBR
and
ality,
we
define
—
MINMAXDIST(P,
R)
as:
I
iWINMAXDIST(P,
Zach
edge
with
a
of
the
graphic
property
MBR
level
of
for
the
applies
Figure3:
at
object
MBRFace
the
MBRs
Property
i
is
in
R)
=
contact
DB.
(The
same
at
level
i+l)
min
(lp~
– rrn~12
+
~
l<k<n
.
.
lpi – rM~12)
,#k
I<t<n
in 2-Space
where:
rmk
=
{
Sk
~fpk
tk
otherwise.
This
I
s~
ifp~
otherwise.
and
~ v;
can be described
as follows:
For
each k select the hyperplane
Hk = rmk which contains
the closer of the two faces of the MBR orthogonal
(One of these faces has
to the kzh space axis.
Hk = sk and the other has Hk = Tk ). The point
r&fk-l)rmk,r~k+lj.
... r~n),
is
Vk! = (?’kfl, rkfz,...,
..............
,,’
.
.
construction
&.?!.#;
ti
rftf~ =
{
<
,’
the farthest vertex from P on this face. MINMAXDIST
then, is the minimum
of the squares of the distance
each of these points.
,.’
,,’
to
Notice that this expression can be efficiently
implemented,
in O(n) operations
by first computing
S =
Enclosed
Figure4:
~l<i<~
lpi – r~i!z,
(the distance from P to the furthest vertex on the MBR), then iteratively
selecting the
minimum
of S – IPk ‘~~k12+lPk–rm~12forl<k<n.
MBRs or Objects
MBR
Face Property
in 3-Space
Theorem
2 Gwen a point
a set of objects
erty
74
holds:
O = {oi,
30 ~ O, II(P,
1<
P and an MBR
i < m},
R enclosing
the following
0)1] < MINMAXDIST(P,
propR).
Proof:
Because
MBR’s
face is touching
MBR.
MBR
2, we know
at least one object
Since the definition
an estimate
its
of lemma
of MINMAXDIST
of the NN distance
at
the
extremity
each
the
3
as an upper
bound
In this
touching
for
faces,
section
we present
R-tree
a branch-and-bound
this
MINDIST
and MINMAXDIST
metrics
prune the search tree, then we present
for finding
1-NN and finally, generalize
the proposition
that
than MINMAXDIST
for finding
to order and
the algorithm
the algorithm
the k-NN.
of the NN distance.
R whose distance
from P is within
or equal to MINMAXDIST
some object inside an MBR,
‘miss’ some object.
Figures
MAXDIST
Algorithm
traversal algorithm
to find the k-NN objects to a given
query point.
We first discuss the merits of using the
3.1
Theorem 2 says that MINMAXDIST
is the minimum
distance that guarantees the presence of an object O in
larger
Neighbor
R-trees
guarantees that MINMAXDIST
is greater than or equal
to the NN distance. On the other hand, a point object
located exactly at the vertex of the MBR at distance
MINMAXDIST
would contradict
one could find a smaller distance
Nearest
produces
to an object
of one of its
that
within
distance
could
5 and 6 illustrate
MINDIST
and
in a 2-Space and 3-Space respectively.
MIN-
and
Ordering
and
Branch-and-bound
this distance, A value
would always ‘catch’
but a smaller
MINDIST
MINMAXDIST
Pruning
algorithms
the
for
Search
have been studied
and
used extensively in the areas of artificial
intelligence and
operations research [HS78]. If the ordering and pruning
heuristics are chosen well, they can significantly
reduce
the number of nodes visited in a large search space,
Search
F
The
Ordering:
algorithm
heuristics
and in the following
orderings
of the MINDIST
we use in our
experiments
are based on
and MINMAXDIST
metrics.
The MINDIST
ordering is the optimistic
choice, while
the MINMAXDIST
metric is the pessimistic
(though
T>~,M=,,zx[
.
not worst case) one.
Since MINDIST
estimates
the
distance from the query point to any enclosed MBR or
...”
,..-,
.,.
.. .
,.
c
.,.
-...”
...
data object w the minimum
distance from the point to
the MBR itself, it is the most optimistic
choice possible.
Due to the properties
of MBRs and the construction
MIN14AXDIST
of it,
/
/.
MI NMAXD I S ~...-.-.
MINMAXDIST
ordering
that
In applying
to a query
ordering
query
Figure
5: MINDIST
and MINMAXDIST
a depth
point
first
the
case,
pessimistic
not
only
on the
to each of the
objects)
within
each
to find
the optimal
MBRs
from the root to the leaf node(s),
size and layout of the MBRs
(or
in 2-Space
most
traversal
in an R-tree,
depends
point
produces
need ever be considered.
MBR.
distance
along
visit
from
the
the
path(s)
but also on the
in the leaf node
In
particular,
one
I
can construct examples in which the MINDIST
metric
ordering produces tree traversals that are more costly
(in terms of nodes visited)
than the MINMAXDIST
I%..
metric,
This is shown in figure 7, where the MINDIST
MINDIsT
MINNAXDIST(P,
the NN
MBR
(P,
ordering
will lead the search to MBR1
require of opening up Ml 1 and M12. If
hand, MINMAXDIST
metric ordering was
MBR2 would result in an smaller estimate
R)
R)
metric
which would
on the other
used, visiting
of the actual
distance to the NN (which will be found to be in M21)
which will then eliminate
the need to examine M 11 and
M12.
The MINDIST
ordering
optimistically
assumes
that the NN to P in MBR Al is going to be close
P),
which is not always the case.
to MINDIST(M,
Figure
6: MINDIST
and MINMAXDIST
Similarly,
predefine
3-Space
counterexamples
ordering.
As we stated
most optimistic
75
could be constructed
above, the MINDIST
ordering,
but that
for any
metric produces
is not always the
3.2
22
‘F
,,
21
During
is not always the better
ordering
best
choice.
Many other orderings
are possible
choosing metrics which compute the distance from
or equal
MBRs
during
by
the
is greater
an MBR
it as an
contains
2). This
than
3.3
P to
a given
object
●
O
(DB
to the MBR
objects
level),
Finding
description
the
of the
k Nearest
prune
A sorted
buffer
Query
Point,
where
k is greater
are:
of at most k current
nearest
neigh-
bors is needed.
for
●
The MBRa pruning is done according to the distance
of the furthest nearest neighbor in this buffer.
The next section provides experimental
both MINDIST
and MINMAXDIST.
4
Experimental
results
using
Results
We implemented
our k-NN search algorithm
and designed and carried out our experiments
in order to
demonstrate
the capability
and usefulness of our NN
Although
we specify only the use of MINMAXDIST
in downward
pruning,
in practice, there are situations
where it is better
to apply MINDIST
(and in fact
strategy 3) instead. For example, when there is no dead
space (or at least very lit tie) in the nodes of the R-tree,
the
MINDIST
is a much better estimate of II(P, N)ll,
actual distance to the NN than is MINMAXDIST,
at
than will
Generalization:
The only differences
3. every MBR M with MINDIST(P,M)
greater than
the actual distance from P to a given object O is
discarded because it cannot enclose an object nearer
than O (theorem 1). We use this in upward pruning.
MBRs
node
8 for the pseudo-code
Neighbors
to a given
than zero.
pruning.
the MINMAXDIST(P,M)
So, it will
a leaf
The algorithm
presented above can be easily generalized
to answer queries of the type:
Find The k Nearest
M can be discarded (actually
replaced by
estimate of the NN distance)
because M
an object O’ which is nearer to P (theorem
is also used in downward pruning.
all levels in the tree.
At
Neighbors
1 and 2). We use this in downward
which
nonmetric
ABL.
to
the search:
from
the ordering
to the node corresponding
branch.
See Figure
algorithm.
distance
computes
current value of Nearest
and each computed value and
updates Nearest
appropriately.
At the return from the
recursion,
we take this new estimate
of the NN and
apply pruning
strategy
3 to remove all branches with
M1ND1S7’(P,
M) > Nearest
for all MBRs M in the
greater than the
1. an MBR M with MINDIST(P,M)
MINMAXDIST(P,M’)
of another MBR M’ is discarded because it cannot contain the NN (theorems
2. an actual
is infinity.
the algorithm
calls a type specific distance function for
each object and selects the smaller dist ante between
We utilize
the two theorems we
Search Pruning:
developed to formulate
the following
three strategies to
prune
the algorithm
itself recursively
query point to faces or vertices of the MBR which are
further away. The importance
of MINMAXDIST(P,M)
is that it computes the smallest
distance between point
P and MBR
M that guarantees the finding of an object
less than
(call it Nearest)
phase, at each newly visited
ABL until the ABL is empty:
For each iteration,
the
algorithm
selects the next branch in the list and applies
of this
distance
Algorithm
bounds (e.g. MINDIST,
Definition
2) for all its MBRa
and sorts them (associated
with their corresponding
node) into an Active
Branch
List (ABL).
We then
apply pruning strategies 1 and 2 to the ABL to remove
unnecessary branches.
The algorithm
iterates on this
2. MINMAXDIST ordering: If we visit MBR2 first, and then M21,
when we eventually visit MBR1, we can prune Ml 1 and Ml 2.
in M at a Euclidean
MINMAXDIST(P,M).
Search
distance
the descending
leaf node,
1. MINDIST ordering: If we visit MBR1 first, we have to visit Ml 1, M12,
MBR2 and M21 before finding the NN.
7: MINDIST
Neighbor
the nearest neighbor
MBR2
The NN Is somewhere In there
Figure
Nearest
The algorithm
presented here implements
an ordered
depth first traversal.
It begins with the R-tree root node
and proceeds down the tree. Originally,
our guess for
search approach as applied to GIS type of queries. We
examined the behavior of our algorithm
as the number
of neighbors increased, the cardinality
of the data set
size grew, and how the MINDIST
and MINMAXDIST
metrics affected performance.
We performed
more candidate
available
MINMAXDIST.
sets.
76
our
real-world
The real-world
experiments
spatial
data
on both
publically
data sets and synthetic
sets included
segment
data
based
TIGER
data files for the city
of Long
Beach,
CA and
Montgomery
County, MD, and observation records from
the International
Ultraviolet
Explorer
(1.U.E) satellite
from
RECURSIVE
PROCEDURE
Point,
N. A.S.A.
Beach
stored
Nearest)
Our
examples
will
be from
the
Long
data, which consists of 55,000 street segments
as pairs of latitude
and longitude
coordinates.
nearestNeighborSearch
(Node,
NODE
POINT
Node
Point
//
//
Current NODE
Search POINT
For the synthetic
data experiments,
we generated test
data files of lK, 2K, 4K, 8K, 16K, 32K, 64K, 128K and
NEARESTN
Nearest
//
Nearest
256K
Neighbor
8K.
//Local
newNode
Then
For
distance
to actual
// Generate Active
genBranchList(Point,
prune
Perform
objects
approximately
for both the MINDIST
and MINMAXDIST
ordering
metrics applied to the Long Beach data. We generated
.rect
three uniform sets of querys of 100 points each based
regions of the Long Beach, CA data set. The first
was from a sparse (few or no segments at all) region
the city, the second was from a dense (large number
nodes
Branch List
Node, branchList)
Downward
metric
on
set
of
of
segments per unit area), and the third was a uniform
sample from the MBR of the whole city map. We then
values
executed a series of nearest neighbor queries for each of
the query points in each of these regions and plotted the
average number of nodes accessed against the number
Pruning
// (may discard all branches)
last = pruneBranchList(Node,
50 in all indexes,
In Figure 9 we see the average of 100 queries for
each of several different
numbers of nearest neighbors
and visit
// Sort ABL based on ordering
sortBranchList(
branchList)
//
of 8K by
generated
We built the R-tree indexes by first presorting
the
data files using a Hilbert
[Jaga90] number generating
function,
and then applying
a modified
version of
Node. branchi.rect)
Nearest .dist := dist
Nearest .rect := Node. branchi
level - order,
in a grid
and randomly
[Rous85] R-tree packing
technique
according
to the
suggestion of [Kame94].
The branching
factor of both
the terminal
and non-terminal
nodes was set to be
i := 1 to Node#count
dist := objectDIST(Point,
If (dist < Nearest .dist)
// Non-leaf
Else
as rectangles)
were unique
8K space.
branchList
dist, last, i
// At leaf level - compute
If Node.type
= LEAF
(stored
points
using a different seed value for each data set. We then
generated 100 equally spaced query points in the 8K by
Variables
NODE
BRANCHARRAY
integer
points
The
of nearest
Point,
neighbors.
Nearest,
branchList)
// Iterate through the Active Branch List
For i := 1 to last
newNode := Node. branchbranChList,
// Recursively visit child nodes
nearestNeighborSearch(
newNode,
Point,
Nearest)
am
// Perform Upward Pruning
last := pruneBranchList(
Node,
branchList)
Point,
Nearest,
2a20
4acQ
m.m
Figure 9: MINDIST
parison
Figure
8: Nearest
Neighbor
Search Pseudo-Code
ordered
I
mm
12am
and MINMAXDIST
For this experiment
MAXDIST
20.00
we see that
Metric
graphs
searches were similar
No. of Ntighbm
Com-
of the MIN-
in shape to the
graphs of the MINDIST
ordered searches but the number of pages accessed was consistently,
approximately
20% higher.
MINMAXDIST
performed
the worst in
dense regions of the various data sets which was not
surprising.
It turned out that in all the experiments
77
we
performed
were
comparing
similar
to this
the
one.
two
Since
metrics,
this
the
occurred
Database Size Scalability
results
with
I
2CUX2
real
world
surmise
and
that
MINDIST
(pseudo)
for
spatially
is the
sake of clarity
randomly
sorted
ordering
and
distributed
and
metric
simplicity,
data,
packed
of choice.
the
rest
we
So, for
128-Nei
.K.N::&&’&
18.CKI
R-trees,
/
16.00
12.00
will
the
results
Figure
10 shows
synthetic
the
of the
data.
100
the
We
equally
MINDIST
metric
results
ran
averaged
the
results
points
for
25
query
using
&w
from
4.00
‘;
//
aca
of an experiment
and
spaced
only.
— .
. ..-.
,.’;
“,, -----,----
–
T-Nei&l
. . . . . . . . . . -------------------? --------------.--
““.
‘-
“
(
--.>;0
r ---
different
Log(2) (Sue m KB;
2.CKI
l!.oo
of k NN (ranging from 1 to 121) on the lK, 4K,
16K, 64K and 256K data sets. We graphed the results as
the (average) number of pages (nodes) accessed versus
the number of nearest neighbors searched for,
/
,,,#----,
/
,’,’/
/
103I
show
wmi~fii%-
.
7cFi&JiE&--/
14.W
the
of the figures
- Synthetic Dats
w—
both
4.a2
8.00
cum
values
Nearest
Neighhor
Scalability
. Synthetic
Figure
I
Data
,WS Acra6d
$5-SK -POilui
]8.00
&frcPXr,&x.F&
16.00
~
.-.
/
12.00
-
/
w.
laoa
d.
T.it f%iZI - .
This is an important
observation
because it shows
that the algorithm
behaves well for ordered data sets
N.
4am
WX2
lcam
8am
of Nci2htcm
12am
10: Synthetic
From
the
experimental
observations,
First,
increased
number
ratio
the
with
ity.
Since
the
same
gree
a (small)
all
the
node
synthetic
dominant
proportionality
data
sets).
that
the
other
4K,
gave
In
distinct
16K
experiment
of
between
the
set
the
with
the
number
number
(base
of
[og2(2561<)
pages
data
appeared
to be piecewise
examined
tree
the
steps
the
increases.
2K
height
appear
index
noticed
(producing
a banded
were
so close
next
nodes
of the
at the
The
has
lK
R-trees
where
the
R-tree
index
of 2, the
and
graph
the
has
4K,
other metrics
our algorithm
environments.
logarithm
the
height
a depth
8K,
16K,
the Nearest
Neighbor
tested our k-NN
real and synthetic
well
and
and how to characterize
the behavior of
in dynamic
as well as static database
6
ACKNOWLEDGEMENTS
curves
We would like to
insightful
comments
We then
observed
and pruning
(so
128 nearest
functions.
and
a
We plotted
of kilobytes
prove useful
not constructed
Nonetheless,
updates.
to be valuable tools in
with size of the data sets.
Further
research on NN
queries in our group will focus on defining and analyzing
the
for
directing
and might
data sets showed that the algorithm
scales up
with respect to both the number of NN requested
the
of
accessed
terms
points
a depth
size
queries.
step
to each
examined
the
ordering
our experiments
was either
We implemented
and thoroughly
algorithm.
The experiments
on both
fact
experiment.
we
in
of them are possible
as well or subject to (many)
these two metrics were shown
search.
The
Although
con-
ordered
increased.
the most pessimistic
in the case where the R-tree
effectively
in this
linear
combination
clus-
against
in
de-
aver-
= 8) for each of 1, 16, 32,64
We
of the
produces
that
the
neighbor
size
suggests
spatially
11,
accessed
with
the
two metrics
depth first spatial
have shown that MINDIST
ordering was more effective
in the case where the data were spatially
sorted,
other orderings or more sophisticated
metrics using a
and
the
of
MINMAXDIST,
sublinearly
increase
set
queries.
created
set size grew,
curves
number
neighbor
the
in
grew
Figure
of nearest
2) of the
data
to run
the
data
least
of
first metric,
MINDIST,
produces
the
ordering possible, whereas the second,
that ever need be considered.
in a linear
50),
components
64K
correlation
limited
sets were
of neighbors
insight
two
nei~hbors
grew
groupings
and
us the
make
of proportional-
in the
accesses
The
most optimistic
accessed
strongly
(at
as the number
can
curves
as the
of page
in three
we
of nearest
We also introduced
can be used to guide an ordered
search.
constant
data
term
Second,
1 Results
(approximately
of the
of
tered
of pages
cardinality
is the
pattern)
behavior,
fractional
stant
age number
Experiment
aa the number
of similarity
this
Data
the height
CONCLUSIONS
of a given query point.
that
Figure
with
In this paper, we developed a branch-and-bound
R-tree
traversal algorithm
which finds the k Nearest Neighbors
----2am
2 Results
and 64K data sets have a depth of 3, and the 128K and
the 256K data sets have a depth of 4.
5
ace
Experiment
-----
-----
8.00
4,04
Data
and the cost of NN increases linearly
the tree.
~k.=..
14.00
.@r~,
11: Synthetic
that
of the
of 1,
32K
78
thank
Christos
and suggestions.
Faloutsos
for
his
References
[Beck90]
and
Beckmann,
N.,
B.
“The
robust
Seeger,
H.-P.
access method
ACM
SIGMOD,
Kriegel,
R*-tree:
for points
pp 322-331,
R. Schneider
an efficient
and
and rectangles,”
May
1990.
[BKS93] Brinkhoff,
T., Kriegel,
H. P., and Seeger, B.,
“Efficient
Processing
of Spatial
Joins Using Rtrees,”
246.
[FBF77]
Proc.
ACM
SIGMOD,
J. H.,
Friedman,
May
Bentley,
1993, pp. 237-
J. L.,
and
Finkel,
R. A., “An algorithm
for finding the best matches
in logarithmic
expected time,” ACM Trans. Math.
Software,
[Gutt84]
3, September
Guttman,
A,,
1977, pp. 209-226.
“R-trees,:
A Dynamic
Structure
for Spatial Searching,”
MOD, pp. 47-57, June 1984.
[HS78]
Horowitz,
E.,
Sahni,
Computer
Algorithms,”
1978, pp. 370-421.
[Jaga90]
with
Jagadish,
Multiple
S.,
H. V., “Linear
SIG-
“Fundamentals
Computer
Attributes,”
Index
Proc. ACM
Science
Clustering
Proc.
of
Press,
of Objects
ACM
SIGMOD,
May 1990, pp. 332-342.
[Kame94]
Kamel,
I.
and
Faloutsos,
C.,
Tree: an Improved
R-Tree Using
of VLDB,
1994, pp. 500-509,
[Rous85]
Roussopoulos,
Spatial
R-trees,”
[Same89]
N.
and
Search on Pictorial
Proc. ACM
D.
“Direct
Using Packed
May 1985.
Samet, H., “The Design & Analysis
Data Structures,”
RProc.
Leifker,
Databases
SIGMOD,
“Hilbert
Fractals,”
Addison-Wesley,
Of Spatial
1989.
Of Spatial
Data
[Same90] Samet, H., “Applications
Structures,
Computer
Graphics, Image Processing
and GIS,” Addison-Wesley,
1990.
[SRF87]
Sellis
T.,
Roussopoulos,
N.,
and
Faloutsos,
C., “The R+-tree:
A Dynamic
Index for MultiProc. 13th International
dimensional
Objects,”
Conference
507-518.
on Very
Large
Data
Bases,
“Refinements
[Spro91] Sproull,
R. F.,
Neighbor
Searching in k-Dimensional
gorithmic,
6, 1987, pp. 579-589.
to
1987, pp.
Nearest-
Trees,”
Al-
79