Download Range Searching - CSE-IITM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linked list wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Red–black tree wikipedia , lookup

Binary tree wikipedia , lookup

Quadtree wikipedia , lookup

B-tree wikipedia , lookup

Interval tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
CS6100: Topics in Design and
Analysis of Algorithms
Range Searching
John Augustine
CS6100 (Even 2012): Range Searching
The Range Searching Problem
Given a set P of n points in Rd, for fixed integer d ≥ 1,
we want to preprocess and store it in a data structure
so that, given a query range, typically an axis parallel
rectangle, we can report all the points in the range
quickly.
For 1D range searching, we will study (i) balanced
binary search trees and (ii) skip lists.
For 2D point sets, we will study (i) kd-trees and (ii)
Range trees, both of which can be extended to arbitrary
d-dimensional point sets.
4,000
salary
G. Ometer
born: Aug 19, 1954
salary: $3,500
4,000
3,000
4
2
3,000
19,500,000
19,500,000
19,559,999
CS6100 (Even 2012): Range Searching
19,559,999
date of birth
1
Balanced Binary Search Trees (BBST)
Given a set P of n points in R stored in a sorted array
A, we can construct a tree that has depth O(log n).
For simplicity, we begin with the assumption that
n = 2k for some integer k.
The data nodes are in the leaves. The internal nodes
store values that guide the search.
The root node stores 2k−1th element in A. While
searching for a value x in the query phase, if x is less
than or equal to the value stored in the root, the search
is guided to the left sub tree. Otherwise, the search is
guided to the right subtree.
The left subtree is constructed recursively over points
in A stored from locations 1 through 2k−1. The right
subtree is constructed over points in A located from
positions 2k−1 + 1 through 2k .
When constructing the internal node on 2 element
point sets, the left subtree simply points to the smaller
CS6100 (Even 2012): Range Searching
2
of the two points and the right subtree to the larger,
thus terminating the recursion.
The construction can be easily adapted for arbitrary n.
See below for an example.
49
23
80
10
3
3
37
19
30
10 19 23 30 37
µ
89
62
49
59
70
89
59 62 70 80
100
100 105
µ0
Lemma 1. If the set of points is sorted, we can
construct the BBST in O(n) time. If not, it takes
O(n log n) as we have to sort the points set. The
BBST data structure requires O(n) storage space.
CS6100 (Even 2012): Range Searching
3
To search for a single value µ, we start at the root
node and ask if µ is greater than the value stored in the
root. If it is, we move to the right subtree, otherwise,
we move to the left. We continue recursively till the
leaf, where we can report if µ is present.
To query a range [µ, µ0], we traverse the tree for both
µ and µ0 until we find the internal node where the two
split ways — call it vsplit.
root(T)
νsplit
µ
the selected subtrees
µ0
At vsplit, we part ways for µ and µ0. As we traverse
towards µ (past vsplit), just before we move to some
left subtree, we report all points in the right subtree.
We deal with µ0 symmetrically.
CS6100 (Even 2012): Range Searching
4
Lemma 2. The time to report points in some range
[µ, µ0] is O(k +log n) where k is the number of points
in [µ, µ0].
Proof. The tree traversal requires O(log n) time.
Reporting points in each subtree requires O(k 0) time,
where k 0 is the number of points on which that
particular subtree is built. Therefore, O(k) time is
required to report all k points.
Preprocessing Time
Space
Searching for 1 element
Reporting a range with k items
Insertion
Deletion
O(n log n)
O(n)
O(log n)
O(k + log n)
O(log n)
O(log n)
Table 1: Performance bounds of a BBST containing n
points.
CS6100 (Even 2012): Range Searching
5
Skip Tree
While the static implementation of a binary search
tree is very straightforward, making the data structure
dynamic (i.e., adding and deleting the points from the
points set) is non-trivial.
The skip tree is a randomized data structure that allows
easy implementation including updates (insertions and
deletions).
On expectation, it has the same performance bounds
as BBST’s (in Table 1).
Head Pointer
CS6100 (Even 2012): Range Searching
6
Construction
Again, we assume that the set P of n points is given
to us in sorted order. We denote the ith element of P
in the sorted list by pi.
In our data structure, we use nodes with four pointers:
left, right, top and bottom.
We first construct the bottom level (or level 0), which
is a linked list of the sorted list using the four-pointer
node structure. The bottom pointers are set to null.
For each pi, we toss a fair coin repeatedly until we
get Heads. Let `i be the number of Tails before we
obtain the first Heads.
Vertical Pointers. We make `i identical nodes
containing pi, one for each level up to level `i, and we
chain them up as follows. For j < `i, the top pointer
of jth node points to the j + 1th node and the bottom
pointer of j + 1th node points to node j. The top
pointer of the `ith node is null.
CS6100 (Even 2012): Range Searching
7
The number of levels ` = maxi `i. For each level, we
have two special boundary nodes, one to the left of all
nodes in that level, and the other to the right. The
boundary nodes are also chained up.
Horizontal Pointers. We establish horizontal links at
each level j starting from j = 1 up to j = `. We start
from the left boundary of level j. For each node η in
level j (starting from the left boundary) we step down
to its copy in level j − 1 and traverse to the right until
we come to a node in level j − 1 that has a copy η 0 in
level j. We establish bidirectional links between and η
and η 0 and continue this process from η 0 until we reach
the right boundary.
The head pointer points to the left boundary of level
`.
CS6100 (Even 2012): Range Searching
8
Searching for a Point p
Here, given p, we want to report if P (stored using the
skip list datastructure) contains p.
For simplicity, assume that the left boundary nodes
store −∞ and the right boundary nodes store +∞.
Start from the head pointer.
Repeat the following steps:
1. Find the last node whose value is at most than p. If
the value is exactly p, we have found it, so we can
terminate.
2. Else, if we have reached level 0, then, report that p
is not in P and terminate.
3. Else, step directly down one level.
CS6100 (Even 2012): Range Searching
9
Exercises
1. How do we search for points in a range?
2. How do we insert a new node?
3. How do we delete a new node?
4. Suppose you are given a skip list, can you
strategically add and delete points so that the query
times become bad (i.e., ω(log n))? Note that you
will have to play the role of an adaptive adversary
that can see the coin tosses (and therefore see the
data structure as it evolves).
5. Suppose the coin tosses are hidden to you and you
can’t measure the actual query times. Can you still
strategically add and delete points so that the query
times become bad? (Such an adversary that cannot
see the coin tosses is called an oblivious adversary.)
CS6100 (Even 2012): Range Searching
10
6. An alternative way to ask the previous question
is the following. How do we prove that, under
an oblivious adversary, the expected performance
bounds of a skip list matches Table 1?
CS6100 (Even 2012): Range Searching
11
kd-Trees
Recall that we now want to perform 2D range searches.
salary
G. Ometer
born: Aug 19, 1954
salary: $3,500
4,000
3,000
19,500,000
19,559,999
date of birth
So, we need a data structure that considers both the
x AND the y coordinates.
Kd-Trees achieve this by alternating between x and y.
Let us now recursively construct the kd-Tree given a
set P of n points in 2D.
As in BBST’s, the data is stored in the leaves. The
internal nodes serve the purpose of guiding searches to
the required leaves.
The root node (level 0) of the kd-Tree corresponds to
the entire data set.
CS6100 (Even 2012): Range Searching
12
To construct the level 1 nodes, i.e., the left and right
children of the root, we split the data along the x
median. The subtree rooted at the left child of the
root node stores all points with x coordinate values no
more than the x median. The rest are stored in the
right subtree of the root node.
Pright
Pleft
`
To construct level 2 nodes, we again split the points
stored in the subtrees rooted at each of the level 1
nodes into two roughly equal halves. However, this
time, we split along the y median.
We continue recursively alternating between splitting
along x and y medians.
CS6100 (Even 2012): Range Searching
13
`1
`5
p4
`1
`7
p9
p5
`3
`2
p10
p2
`2
`8
p7
p1
p3
`4
`6
`7
`3
p8
`9
`5
`4
p6
`6
`8
p1 p2
p3 p4
p5
`9
p8
p9
p10
p6 p7
Algorithm B UILD K D T REE(P, depth)
Input. A set of points P and the current depth depth.
Output. The root of a kd-tree storing P.
1. if P contains only one point
2.
then return a leaf storing this point
3.
else if depth is even
4.
then Split P into two subsets with a vertical line ` through the median x-coordinate
of the points in P. Let P1 be the set of points to the left of ` or on `, and let
P2 be the set of points to the right of `.
5.
else Split P into two subsets with a horizontal line ` through the median ycoordinate of the points in P. Let P1 be the set of points below ` or on `,
and let P2 be the set of points above `.
6.
νleft ← B UILD K D T REE(P1 , depth + 1)
7.
νright ← B UILD K D T REE(P2 , depth + 1)
Create a node ν storing `, make νleft the left child of ν, and make νright the right
8.
child of ν.
9.
return ν
CS6100 (Even 2012): Range Searching
14
Preprocessing Time and Storage
At each internal node, we have to split P into two
sets. This requires O(n) time if the internal node
is built on n elements. Subsequently, two recursive
calls are made to points sets that contain roughly n/2
elements. Therefore, the recurrence relationship on the
preprocessing time of n elements is:
T (n) = O(n) + 2T (n/2),
which evaluates to T (n) = O(n log n).
To analyse the space required by a kd-tree that
stores n points, first note that suppose a binary tree T
has n leaves and each of its internal nodes has exactly
two children, then T has n − 1 internal nodes. Since
any kd-tree is such a tree, the space required is O(n).
CS6100 (Even 2012): Range Searching
15
Region of a node
Note that each node in the kd-tree has a region
associated with it. The region associated with the
root is the entire plane. Subsequently, the region gets
divided based on where the points are spilt.
`1
`1
`2
`3
`2
region(ν)
CS6100 (Even 2012): Range Searching
ν
`3
16
Query Procedure
Traverse the kd-tree, but only visit nodes whose regions
intersect the query rectangle.
• When a region is fully contained in the query
rectangle, just report all points in the subtree.
• When traversal reaches a leaf, check its containment
in the query rectangle and report if necessary.
Lemma 3. A query with an axis
√ parallel rectangle in
a kd-tree of n points takes O( n + k) time, where k
is the number of points reported.
Proof Sketch.
Reporting all points in a region fully contained in the
query rectangle takes time linear in the number of
points in the region. Therefore, the time to report all
points in regions contained within the query rectangle
will take O(k) time.
Consider the nodes that were visited, but whose regions
were not fully contained by the query rectangle. We
CS6100 (Even 2012): Range Searching
17
only spend O(1) time in each such node. Therefore,
we can account for the remaining running time by
(asymptotically) counting the number of such nodes.
• The region of each such node is cut by one of the
four boundaries of the query rectangle.
• Therefore, the number of such nodes is
asymptotically upper bounded by the maximum
number of intersections of a line with regions in
the kd-tree.
• We build a recurrence function Q(n) that captures
the maximum number of regions in an n-node kdtree that a line can intersect.
• Since the kd-tree alternates between vertical and
horizontal splits, Q(n) must be defined across two
levels. In particular,
Q(n) = 2 + 2Q(n/4), which
√
evaluates to Q( n).
√
Thus the total query time is O( n + k).
CS6100 (Even 2012): Range Searching
18
Range Trees
The Range Tree is a data structure for range searching
whose (non-output sensitive term in
√ the) query time is
polylogarithmic in n instead of O( n)?
Its preprocessing time and space complexity is
O(n log n).
The key to designing multi-dimensional range searching
data structures is to combine searching along multiple
coordinate axes.
While we alternated between x and y coordinate in kdtrees, in range trees, we first build on the x-coordinate
and then, for each internal node on the x-coordinate
tree, we build a separate tree on the y-coordinate.
To construct the range tree, it is helpful to store two
copies of the set of points (at each recursive call), one
sorted according to the x coordinates and the other
sorted according to the y coordinates.
CS6100 (Even 2012): Range Searching
19
Recall 1D Range Searching
Before we see how 2D range trees can be constructed,
we first recall 1D BBST’s.
νsplit
µ
µ0
We store the data as leaves in a balanced binary search
tree.
The canonical subset P (v) of a node v is the data
stored in the leaves of the subtree rooted at v.
In 2D range trees, the primary tree is a 1D BBST based
on the x-coordinate of the points. For each internal
node v, we additionally store an associated tree based
on the y-coordinates of the canonical subset P (v) of
v.
CS6100 (Even 2012): Range Searching
20
2D Range Tree
T
binary search tree on
x-coordinates
Tassoc (ν)
binary search tree
on y-coordinates
ν
P(ν)
P(ν)
Algorithm B UILD 2DR ANGE T REE(P)
Input. A set P of points in the plane.
Output. The root of a 2-dimensional range tree.
1. Construct the associated structure: Build a binary search tree Tassoc on the set Py of ycoordinates of the points in P. Store at the leaves of Tassoc not just the y-coordinate of the
points in Py , but the points themselves.
2. if P contains only one point
3.
then Create a leaf ν storing this point, and make Tassoc the associated structure of ν.
4.
else Split P into two subsets; one subset Pleft contains the points with x-coordinate less
than or equal to xmid , the median x-coordinate, and the other subset Pright contains
the points with x-coordinate larger than xmid .
5.
νleft ← B UILD 2DR ANGE T REE(Pleft )
6.
νright ← B UILD 2DR ANGE T REE(Pright )
7.
Create a node ν storing xmid , make νleft the left child of ν, make νright the right
child of ν, and make Tassoc the associated structure of ν.
8. return ν
CS6100 (Even 2012): Range Searching
21
Lemma 4. A 2D range tree on n data points takes
O(n log n) storage.
Proof. A data point p is stored only in the associated
trees attached to the nodes of the first level tree on
the path from root to p. At a given level, a point
p is stored in only one associated structure. Since
the associated tree structure uses linear storage, each
data point contributes to O(1) of the storage in each
of the O(log n) levels. Therefore, the total space is
O(n log n).
Algorithm 2DR ANGE Q UERY(T, [x : x0 ] × [y : y0 ])
Input. A 2-dimensional range tree T and a range [x : x0 ] × [y : y0 ].
Output. All points in T that lie in the range.
1. νsplit ←F IND S PLIT N ODE(T, x, x0 )
2. if νsplit is a leaf
3.
then Check if the point stored at νsplit must be reported.
4.
else (∗ Follow the path to x and call 1DR ANGE Q UERY on the subtrees right of the
path. ∗)
5.
ν ← lc(νsplit )
6.
while ν is not a leaf
7.
do if x 6 xν
8.
then 1DR ANGE Q UERY(Tassoc (rc(ν)), [y : y0 ])
9.
ν ← lc(ν)
10.
else ν ← rc(ν)
11.
Check if the point stored at ν must be reported.
12.
Similarly, follow the path from rc(νsplit ) to x0 , call 1DR ANGE Q UERY with the
range [y : y0 ] on the associated structures of subtrees left of the path, and check if
the point stored at the leaf where the path ends must be reported.
CS6100 (Even 2012): Range Searching
22
Theorem 1. A 2D range tree on n data points
can be constructed in O(n log n) time and occupies
O(n log n) space. A range search query on a range
with k points in it takes O(k + log2 n) time.
Proof. The construction time can be proved using
ideas from proof of Lemma 4. On the primary BBST
(based on points sorted according to x-coordinates), we
perform a 1D range search for nodes whose canonical
subsets have x coordinates that overlap with the x
coordinates of the range that we are searching for.
There are O(log n) such nodes. For each of these
nodes, we look at the associated BBST (base on the
canonical subset sorted according to the y-coordinate)
and perform a 1D range search for points whose y
coordinates fall within the range we are searching for.
Overall, these traversals require O(log2 n) time.
In these associated BBST’s we look for subtrees
that are fully contained within our search range and
report all points in such subtrees. Since such reporting
is linear in the number of points stored in those
subtrees, this adds an O(k) term in the query time.
Therefore, total query time is O(k + log2 n) time.
CS6100 (Even 2012): Range Searching
23