Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS6100: Topics in Design and Analysis of Algorithms Range Searching John Augustine CS6100 (Even 2012): Range Searching The Range Searching Problem Given a set P of n points in Rd, for fixed integer d ≥ 1, we want to preprocess and store it in a data structure so that, given a query range, typically an axis parallel rectangle, we can report all the points in the range quickly. For 1D range searching, we will study (i) balanced binary search trees and (ii) skip lists. For 2D point sets, we will study (i) kd-trees and (ii) Range trees, both of which can be extended to arbitrary d-dimensional point sets. 4,000 salary G. Ometer born: Aug 19, 1954 salary: $3,500 4,000 3,000 4 2 3,000 19,500,000 19,500,000 19,559,999 CS6100 (Even 2012): Range Searching 19,559,999 date of birth 1 Balanced Binary Search Trees (BBST) Given a set P of n points in R stored in a sorted array A, we can construct a tree that has depth O(log n). For simplicity, we begin with the assumption that n = 2k for some integer k. The data nodes are in the leaves. The internal nodes store values that guide the search. The root node stores 2k−1th element in A. While searching for a value x in the query phase, if x is less than or equal to the value stored in the root, the search is guided to the left sub tree. Otherwise, the search is guided to the right subtree. The left subtree is constructed recursively over points in A stored from locations 1 through 2k−1. The right subtree is constructed over points in A located from positions 2k−1 + 1 through 2k . When constructing the internal node on 2 element point sets, the left subtree simply points to the smaller CS6100 (Even 2012): Range Searching 2 of the two points and the right subtree to the larger, thus terminating the recursion. The construction can be easily adapted for arbitrary n. See below for an example. 49 23 80 10 3 3 37 19 30 10 19 23 30 37 µ 89 62 49 59 70 89 59 62 70 80 100 100 105 µ0 Lemma 1. If the set of points is sorted, we can construct the BBST in O(n) time. If not, it takes O(n log n) as we have to sort the points set. The BBST data structure requires O(n) storage space. CS6100 (Even 2012): Range Searching 3 To search for a single value µ, we start at the root node and ask if µ is greater than the value stored in the root. If it is, we move to the right subtree, otherwise, we move to the left. We continue recursively till the leaf, where we can report if µ is present. To query a range [µ, µ0], we traverse the tree for both µ and µ0 until we find the internal node where the two split ways — call it vsplit. root(T) νsplit µ the selected subtrees µ0 At vsplit, we part ways for µ and µ0. As we traverse towards µ (past vsplit), just before we move to some left subtree, we report all points in the right subtree. We deal with µ0 symmetrically. CS6100 (Even 2012): Range Searching 4 Lemma 2. The time to report points in some range [µ, µ0] is O(k +log n) where k is the number of points in [µ, µ0]. Proof. The tree traversal requires O(log n) time. Reporting points in each subtree requires O(k 0) time, where k 0 is the number of points on which that particular subtree is built. Therefore, O(k) time is required to report all k points. Preprocessing Time Space Searching for 1 element Reporting a range with k items Insertion Deletion O(n log n) O(n) O(log n) O(k + log n) O(log n) O(log n) Table 1: Performance bounds of a BBST containing n points. CS6100 (Even 2012): Range Searching 5 Skip Tree While the static implementation of a binary search tree is very straightforward, making the data structure dynamic (i.e., adding and deleting the points from the points set) is non-trivial. The skip tree is a randomized data structure that allows easy implementation including updates (insertions and deletions). On expectation, it has the same performance bounds as BBST’s (in Table 1). Head Pointer CS6100 (Even 2012): Range Searching 6 Construction Again, we assume that the set P of n points is given to us in sorted order. We denote the ith element of P in the sorted list by pi. In our data structure, we use nodes with four pointers: left, right, top and bottom. We first construct the bottom level (or level 0), which is a linked list of the sorted list using the four-pointer node structure. The bottom pointers are set to null. For each pi, we toss a fair coin repeatedly until we get Heads. Let `i be the number of Tails before we obtain the first Heads. Vertical Pointers. We make `i identical nodes containing pi, one for each level up to level `i, and we chain them up as follows. For j < `i, the top pointer of jth node points to the j + 1th node and the bottom pointer of j + 1th node points to node j. The top pointer of the `ith node is null. CS6100 (Even 2012): Range Searching 7 The number of levels ` = maxi `i. For each level, we have two special boundary nodes, one to the left of all nodes in that level, and the other to the right. The boundary nodes are also chained up. Horizontal Pointers. We establish horizontal links at each level j starting from j = 1 up to j = `. We start from the left boundary of level j. For each node η in level j (starting from the left boundary) we step down to its copy in level j − 1 and traverse to the right until we come to a node in level j − 1 that has a copy η 0 in level j. We establish bidirectional links between and η and η 0 and continue this process from η 0 until we reach the right boundary. The head pointer points to the left boundary of level `. CS6100 (Even 2012): Range Searching 8 Searching for a Point p Here, given p, we want to report if P (stored using the skip list datastructure) contains p. For simplicity, assume that the left boundary nodes store −∞ and the right boundary nodes store +∞. Start from the head pointer. Repeat the following steps: 1. Find the last node whose value is at most than p. If the value is exactly p, we have found it, so we can terminate. 2. Else, if we have reached level 0, then, report that p is not in P and terminate. 3. Else, step directly down one level. CS6100 (Even 2012): Range Searching 9 Exercises 1. How do we search for points in a range? 2. How do we insert a new node? 3. How do we delete a new node? 4. Suppose you are given a skip list, can you strategically add and delete points so that the query times become bad (i.e., ω(log n))? Note that you will have to play the role of an adaptive adversary that can see the coin tosses (and therefore see the data structure as it evolves). 5. Suppose the coin tosses are hidden to you and you can’t measure the actual query times. Can you still strategically add and delete points so that the query times become bad? (Such an adversary that cannot see the coin tosses is called an oblivious adversary.) CS6100 (Even 2012): Range Searching 10 6. An alternative way to ask the previous question is the following. How do we prove that, under an oblivious adversary, the expected performance bounds of a skip list matches Table 1? CS6100 (Even 2012): Range Searching 11 kd-Trees Recall that we now want to perform 2D range searches. salary G. Ometer born: Aug 19, 1954 salary: $3,500 4,000 3,000 19,500,000 19,559,999 date of birth So, we need a data structure that considers both the x AND the y coordinates. Kd-Trees achieve this by alternating between x and y. Let us now recursively construct the kd-Tree given a set P of n points in 2D. As in BBST’s, the data is stored in the leaves. The internal nodes serve the purpose of guiding searches to the required leaves. The root node (level 0) of the kd-Tree corresponds to the entire data set. CS6100 (Even 2012): Range Searching 12 To construct the level 1 nodes, i.e., the left and right children of the root, we split the data along the x median. The subtree rooted at the left child of the root node stores all points with x coordinate values no more than the x median. The rest are stored in the right subtree of the root node. Pright Pleft ` To construct level 2 nodes, we again split the points stored in the subtrees rooted at each of the level 1 nodes into two roughly equal halves. However, this time, we split along the y median. We continue recursively alternating between splitting along x and y medians. CS6100 (Even 2012): Range Searching 13 `1 `5 p4 `1 `7 p9 p5 `3 `2 p10 p2 `2 `8 p7 p1 p3 `4 `6 `7 `3 p8 `9 `5 `4 p6 `6 `8 p1 p2 p3 p4 p5 `9 p8 p9 p10 p6 p7 Algorithm B UILD K D T REE(P, depth) Input. A set of points P and the current depth depth. Output. The root of a kd-tree storing P. 1. if P contains only one point 2. then return a leaf storing this point 3. else if depth is even 4. then Split P into two subsets with a vertical line ` through the median x-coordinate of the points in P. Let P1 be the set of points to the left of ` or on `, and let P2 be the set of points to the right of `. 5. else Split P into two subsets with a horizontal line ` through the median ycoordinate of the points in P. Let P1 be the set of points below ` or on `, and let P2 be the set of points above `. 6. νleft ← B UILD K D T REE(P1 , depth + 1) 7. νright ← B UILD K D T REE(P2 , depth + 1) Create a node ν storing `, make νleft the left child of ν, and make νright the right 8. child of ν. 9. return ν CS6100 (Even 2012): Range Searching 14 Preprocessing Time and Storage At each internal node, we have to split P into two sets. This requires O(n) time if the internal node is built on n elements. Subsequently, two recursive calls are made to points sets that contain roughly n/2 elements. Therefore, the recurrence relationship on the preprocessing time of n elements is: T (n) = O(n) + 2T (n/2), which evaluates to T (n) = O(n log n). To analyse the space required by a kd-tree that stores n points, first note that suppose a binary tree T has n leaves and each of its internal nodes has exactly two children, then T has n − 1 internal nodes. Since any kd-tree is such a tree, the space required is O(n). CS6100 (Even 2012): Range Searching 15 Region of a node Note that each node in the kd-tree has a region associated with it. The region associated with the root is the entire plane. Subsequently, the region gets divided based on where the points are spilt. `1 `1 `2 `3 `2 region(ν) CS6100 (Even 2012): Range Searching ν `3 16 Query Procedure Traverse the kd-tree, but only visit nodes whose regions intersect the query rectangle. • When a region is fully contained in the query rectangle, just report all points in the subtree. • When traversal reaches a leaf, check its containment in the query rectangle and report if necessary. Lemma 3. A query with an axis √ parallel rectangle in a kd-tree of n points takes O( n + k) time, where k is the number of points reported. Proof Sketch. Reporting all points in a region fully contained in the query rectangle takes time linear in the number of points in the region. Therefore, the time to report all points in regions contained within the query rectangle will take O(k) time. Consider the nodes that were visited, but whose regions were not fully contained by the query rectangle. We CS6100 (Even 2012): Range Searching 17 only spend O(1) time in each such node. Therefore, we can account for the remaining running time by (asymptotically) counting the number of such nodes. • The region of each such node is cut by one of the four boundaries of the query rectangle. • Therefore, the number of such nodes is asymptotically upper bounded by the maximum number of intersections of a line with regions in the kd-tree. • We build a recurrence function Q(n) that captures the maximum number of regions in an n-node kdtree that a line can intersect. • Since the kd-tree alternates between vertical and horizontal splits, Q(n) must be defined across two levels. In particular, Q(n) = 2 + 2Q(n/4), which √ evaluates to Q( n). √ Thus the total query time is O( n + k). CS6100 (Even 2012): Range Searching 18 Range Trees The Range Tree is a data structure for range searching whose (non-output sensitive term in √ the) query time is polylogarithmic in n instead of O( n)? Its preprocessing time and space complexity is O(n log n). The key to designing multi-dimensional range searching data structures is to combine searching along multiple coordinate axes. While we alternated between x and y coordinate in kdtrees, in range trees, we first build on the x-coordinate and then, for each internal node on the x-coordinate tree, we build a separate tree on the y-coordinate. To construct the range tree, it is helpful to store two copies of the set of points (at each recursive call), one sorted according to the x coordinates and the other sorted according to the y coordinates. CS6100 (Even 2012): Range Searching 19 Recall 1D Range Searching Before we see how 2D range trees can be constructed, we first recall 1D BBST’s. νsplit µ µ0 We store the data as leaves in a balanced binary search tree. The canonical subset P (v) of a node v is the data stored in the leaves of the subtree rooted at v. In 2D range trees, the primary tree is a 1D BBST based on the x-coordinate of the points. For each internal node v, we additionally store an associated tree based on the y-coordinates of the canonical subset P (v) of v. CS6100 (Even 2012): Range Searching 20 2D Range Tree T binary search tree on x-coordinates Tassoc (ν) binary search tree on y-coordinates ν P(ν) P(ν) Algorithm B UILD 2DR ANGE T REE(P) Input. A set P of points in the plane. Output. The root of a 2-dimensional range tree. 1. Construct the associated structure: Build a binary search tree Tassoc on the set Py of ycoordinates of the points in P. Store at the leaves of Tassoc not just the y-coordinate of the points in Py , but the points themselves. 2. if P contains only one point 3. then Create a leaf ν storing this point, and make Tassoc the associated structure of ν. 4. else Split P into two subsets; one subset Pleft contains the points with x-coordinate less than or equal to xmid , the median x-coordinate, and the other subset Pright contains the points with x-coordinate larger than xmid . 5. νleft ← B UILD 2DR ANGE T REE(Pleft ) 6. νright ← B UILD 2DR ANGE T REE(Pright ) 7. Create a node ν storing xmid , make νleft the left child of ν, make νright the right child of ν, and make Tassoc the associated structure of ν. 8. return ν CS6100 (Even 2012): Range Searching 21 Lemma 4. A 2D range tree on n data points takes O(n log n) storage. Proof. A data point p is stored only in the associated trees attached to the nodes of the first level tree on the path from root to p. At a given level, a point p is stored in only one associated structure. Since the associated tree structure uses linear storage, each data point contributes to O(1) of the storage in each of the O(log n) levels. Therefore, the total space is O(n log n). Algorithm 2DR ANGE Q UERY(T, [x : x0 ] × [y : y0 ]) Input. A 2-dimensional range tree T and a range [x : x0 ] × [y : y0 ]. Output. All points in T that lie in the range. 1. νsplit ←F IND S PLIT N ODE(T, x, x0 ) 2. if νsplit is a leaf 3. then Check if the point stored at νsplit must be reported. 4. else (∗ Follow the path to x and call 1DR ANGE Q UERY on the subtrees right of the path. ∗) 5. ν ← lc(νsplit ) 6. while ν is not a leaf 7. do if x 6 xν 8. then 1DR ANGE Q UERY(Tassoc (rc(ν)), [y : y0 ]) 9. ν ← lc(ν) 10. else ν ← rc(ν) 11. Check if the point stored at ν must be reported. 12. Similarly, follow the path from rc(νsplit ) to x0 , call 1DR ANGE Q UERY with the range [y : y0 ] on the associated structures of subtrees left of the path, and check if the point stored at the leaf where the path ends must be reported. CS6100 (Even 2012): Range Searching 22 Theorem 1. A 2D range tree on n data points can be constructed in O(n log n) time and occupies O(n log n) space. A range search query on a range with k points in it takes O(k + log2 n) time. Proof. The construction time can be proved using ideas from proof of Lemma 4. On the primary BBST (based on points sorted according to x-coordinates), we perform a 1D range search for nodes whose canonical subsets have x coordinates that overlap with the x coordinates of the range that we are searching for. There are O(log n) such nodes. For each of these nodes, we look at the associated BBST (base on the canonical subset sorted according to the y-coordinate) and perform a 1D range search for points whose y coordinates fall within the range we are searching for. Overall, these traversals require O(log2 n) time. In these associated BBST’s we look for subtrees that are fully contained within our search range and report all points in such subtrees. Since such reporting is linear in the number of points stored in those subtrees, this adds an O(k) term in the query time. Therefore, total query time is O(k + log2 n) time. CS6100 (Even 2012): Range Searching 23