Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Eindhoven University of Technology MASTER An experimental evaluation of the logarithmic priority-R tree Abbas, U. Award date: 2006 Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 08. May. 2017 TECHNISCHE UNIVERSITEIT EINDHOVEN Department of Mathematics and Computer Science An Experimental Evaluation of the Logarithmic Priority-R tree by Ummar Abbas Advisor dr. Herman Haverkort Review Committee dr. Herman Haverkort prof. dr. Mark de Berg dr. ir. Huub van de Wetering Eindhoven, November 2006 2 \No amount of experimentation can ever prove me right; a single experiment can prove me wrong." — Albert Einstein 3 4 Abstract The Logarithmic PR-tree is an R-tree variant based on the PR-tree that maintains the worst case query time while the tree structure is updated. This thesis is dedicated to the experimental study of the LPR tree using the C++ template based I/O-ecient library TPIE. It compares the performance of the LPR tree with the R∗ -tree, one of the most popular dynamic R-tree structures. 5 6 Acknowledgements I am very grateful to dr. Herman Haverkort, my advisor, for all the support and help during this thesis. This work would not have been completed in time without the numerous discussions with him. I am also indebted to him for spending the huge eort reviewing the early versions of this document, in a very short time. My sincere thanks to prof. dr. Mark de Berg, head of the Algorithms group, to provide me with an opportunity to work here and to allow me to work part-time on this thesis, together with my job. Special thanks to Micha Streppel, for the help and guidance in using TPIE. I am thankful to my wife Shabana and son Suhail, for giving me all the encouragement I needed, and providing me all the time that was necessary to complete this thesis. Finally, I would like to express my deepest gratitude to my parents, who motivated me to take up this course. Ummar Abbas October 25, 2006 7 8 Contents 1 Introduction 13 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 Aim and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 R-Trees 2.1 Original R-tree . . . . . . . . . . . . . . . . 2.2 Dynamic Versions of R-trees . . . . . . . . . 2.2.1 Guttman's R-tree update algorithms 2.2.2 The R∗ -tree . . . . . . . . . . . . . . 2.2.3 The Hilbert R-tree . . . . . . . . . . 2.2.4 R+ -tree . . . . . . . . . . . . . . . . 2.2.5 Compact R-tree . . . . . . . . . . . 2.2.6 Linear Node Splitting . . . . . . . . 2.3 Static Versions of R-trees . . . . . . . . . . 2.3.1 The Hilbert Packed R-tree . . . . . . 2.3.2 TGS R-tree . . . . . . . . . . . . . . 2.3.3 Buer R-tree . . . . . . . . . . . . . 3 PR-tree Family 3.1 Priority R-tree . . . . . 3.1.1 Pseudo-PR-tree . 3.1.2 PR-tree . . . . . 3.2 LPR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 General Implementation Issues 4.2 Two Dimensional Rectangle . . 4.3 Pseudo-PR-tree . . . . . . . . . 4.3.1 Data Structures . . . . . 4.3.2 Construction Algorithm 4.3.3 Implementation Issues . 4.4 LPR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Design and Implementation . . . . . . . . . . . . . . . . 9 17 17 18 18 19 21 23 23 24 24 24 25 26 27 27 27 30 30 33 33 35 36 36 38 44 45 4.4.1 4.4.2 4.4.3 4.4.4 Structure . . . . . . . Insertion Algorithm . Deletion Algorithm . . Implementation Issues 5 Experiments 5.1 Experimental Setup . 5.2 Datasets . . . . . . . . 5.2.1 Real life data . 5.2.2 Synthetic data 5.3 Bulk Load . . . . . . . 5.4 Insertion . . . . . . . . 5.5 Deletion . . . . . . . . 5.6 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 47 48 51 53 53 53 53 54 55 58 62 65 6 Conclusions 71 A Tables of Experimental Results 73 A.1 Bulk Load . . . . . . . . . . . . A.1.1 LPR tree . . . . . . . . A.1.2 R∗ tree . . . . . . . . . . A.2 Insertion . . . . . . . . . . . . . A.2.1 Insertion Time . . . . . A.2.2 Insertion I/O's . . . . . A.3 Deletion . . . . . . . . . . . . . A.3.1 Deletion I/O's and time A.4 Query . . . . . . . . . . . . . . A.4.1 LPR-tree . . . . . . . . A.4.2 R∗ -tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Brief Introduction to TPIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 74 75 75 76 77 77 78 78 79 81 10 List of Figures 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 Bulk Load CPU time - Uniform dataset. . . . . . . . . . . . . . . . . . . . Bulk Load CPU time - Normal dataset. . . . . . . . . . . . . . . . . . . . Bulk Load CPU time - TIGER dataset. . . . . . . . . . . . . . . . . . . . Bulk Load I/O - Uniform dataset. . . . . . . . . . . . . . . . . . . . . . . Bulk Load I/O - Normal dataset. . . . . . . . . . . . . . . . . . . . . . . . Bulk Load I/O - TIGER dataset. . . . . . . . . . . . . . . . . . . . . . . . Insertion CPU time - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . . Insertion CPU time - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . Insertion average CPU time - LP R tree vs. R∗ tree. . . . . . . . . . . . . Insertion I/O's - LP R tree. . . . . . . . . . . . . . . . . . . . . . . . . . . Insertion I/O's - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Insertion Average I/O's - LP R tree vs. R∗ tree. . . . . . . . . . . . . . . . Deletion CPU time - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . . Deletion CPU time - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . Deletion average CPU time - LP R tree vs. R∗ tree. . . . . . . . . . . . . . Delete I/O - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deletion I/O - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deletion average I/O's - LP R tree vs. R∗ tree. . . . . . . . . . . . . . . . Query CPU time (in msec) per B rectangles output . . . . . . . . . . . . . Query I/O's per B rectangles output. . . . . . . . . . . . . . . . . . . . . Empirical Analysis - Theoretical vs. Experimental query I/O results for LPR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 56 56 57 57 57 58 59 59 60 60 61 61 63 63 64 64 65 65 68 69 70 12 Chapter 1 Introduction 1.1 Context Spatial data management is required in several areas such as computer aided design (CAD), VLSI design, geo-data applications etc. Data objects in such a spatial database are multi-dimensional (typically 2D or 3D). It is important to be able to search and retrieve objects using their spatial position. Classical indexing mechanisms such as the B-tree and its various variants are not suitable for multiple dimensions and for range queries. The R-tree data structure introduced by Guttman[8] is considered as one of the most ecient mechanism to handle multi-dimensional spatial data. It is aimed at handling geometrical data, such as, points, line segments, surfaces, volumes and hyper volumes. Since its introduction to the scientic community, various variants of this structure have been proposed. 1.2 Aim and Approach This thesis is dedicated to the experimental study of the LPR-tree, an R-tree variant that claims to maintain the worst-case-optimal query guarantees made by its static variant the PR-tree[3]. In particular, the thesis aims to achieve the following: To verify experimentally, the theoretical worst case optimal query guarantees made on the LPR-tree. The performance comparison in terms of IO's and time of the LPR-tree against state-of-art dynamic update algorithms such as the R∗ -tree and the Hilbert-Rtree. Derive some conclusions on when and under what specic conditions or scenarios the LPR-tree can (or cannot) out perform these R-tree variants. We assume the following I/O model : 13 There is fast internal memory(main memory) that can hold M two-dimensional rectangles and a single slow disk which holds data and the results of computation. Data is transferred between the internal memory and the external memory in terms of blocks. A block can hold B two-dimensional rectangles. When a block is read from or written to the external memory, an I/O is said to have been performed. Algorithms have full control over which block is evicted from the main memory and written to the disk and vice-versa. I/O's are most important bottle-neck while working with very large data/sets. Therefore, the number of block reads/writes performed by (external memory) algorithms are considered to be the measure of the eciency of an algorithm. To perform the above investigations comprehensively, the LPR-tree is implemented using TPIE, an IO-ecient C++ library. Experiments are designed to observe the performance of the LPR-tree against the variation of several parameters like distribution of the dataset and size. The same kind of experiments have to be repeated for other R-tree variants to compare and analyze the results. Due to time constraints and the nonavailability of the implementation of other dynamic R-trees using the TPIE platform, the thesis restricts its comparative study to the R∗ -tree. The experiments are designed to cover the following areas: Performance (in terms of I/O's and time) of the LPR-tree update algorithm. This includes the cases where the tree gets rebuilt during updates. Compare the performance for the update algorithms of the LPR-tree and the R∗ tree. Compare the performance of the LPR query algorithm after bulk load against the performance achieved when queries are interleaved with updates. Compare the query results of the LPR-tree and R∗ -tree on similar data/sets and under similar conditions. 1.3 Overview The thesis is organized into the following chapters: Chapter 2 introduces the original R-tree to the reader (in some detail) to provide a context and show how various variants have been developed. This chapter also presents a survey of some popular R-tree variations, that have based on dierent heuristics to achieve good query performance while still having a good update performance. Because of the inherent heuristic nature of these algorithms, they do not give any asymptotical bounds on the worst-case query performance. 14 Chapter 3 describes the two-dimensional pseudo-PR tree, PR-tree and the LPRtree data structures. It also describes the update algorithms. The PR-tree is the rst variant of an R-tree that guarantees optimal worst-case query time and has its construction not based on heuristics. To achieve the same query performance during updates, the LPR-tree introduces update algorithms for insertion and deletion. Chapter 4 describes all the practical implementation issues together with the pseudo-code of algorithms and data structures. Chapter 5 describes the various experiments carried out on the LPR-tree, gives the results and provides analysis of those results. Chapter 6 gives the conclusions derived from the experimental study that answers the questions raised in section 1.2. Appendix A can be used to get the exact gures of various quantities measured during the experiments. Appendix B gives a brief introduction into TPIE, the IO-ecient library used for the implementation of the LPR-tree. 15 16 Chapter 2 R-Trees 2.1 Original R-tree An R-tree is a hierarchical data structure based on the B + -tree. A B + -tree is a height balanced dynamic data structure used to index single-dimensional data and supports ecient insertion and deletion algorithms. All data is stored in leaf nodes while internal nodes contain keys and pointers to other nodes. While performing queries on this structure, the keys stored in the internal nodes help in traversing the tree structure from the root down to the leaf node using binary search type comparisons. The R-tree is a similar structure characterized by the parameters k and K where K is the maximum number of entries that t in one node and k ≤ K2 represent the minimum number of such entries. The R-tree has the following properties : Each leaf node contains between k and K records where each record is of the form (MBR, oid) where MBR is the minimum bounding box enclosing the spatial data object identied by oid. Each internal node, except for the root, has between k and K children represented by records of the form (MBR, p) where p is the pointer to a child and MBR is the minimum bounding box that spatially contains the MBRs in this child. Based on this denition, an R-tree on N rectangles has a height of at most Θ(logk N ). A query to nd a rectangle in a R-tree is called an exact match query. However, the more common form of query is the window (or range) query. Given a rectangle Q, the following forms a window query: nd all the data rectangles that are intersected by Q. To answer such a query, we simply start at the root of the R-tree and recursively visit all the nodes with minimum bounding boxes intersecting Q; when encountering a leaf l we report all the rectangles in l that intersect Q. This results in the following algorithm: Algorithm WindowQuery (ν, Q) Input: Node ν in the R tree and a query rectangle Q. Output: Set A containing all the rectangles in the tree 17 that intersect Q. 1. if ν is a leaf 2. then Examine each data rectangle r in ν to check if its intersects Q. If it intersects then A ←A ∪ frg. 3. else (∗ ν is an internal node. ∗) 4. Examine each entry to nd exactly those children whose minimum bounding box intersects Q. Let the set of intersecting children nodes be Sc . 5. for µ ∈ Sc 6. A ←A ∪ WindowQuery (µ, Q). 7. return A. Various R-tree variations[12] have been proposed, some of them adapting for specic instances and environments. All the R-tree based data structures proposed in literature can be classied into one of the two categories. Dynamic Versions of R-trees : These are R-tree based data structures where the objects are inserted or deleted on a one-by-one basis. Static Versions of R-trees : These are R-tree based data structures that are built using algorithms for bulk loading using a-priori known static data. 2.2 2.2.1 Dynamic Versions of R-trees Guttman’s R-tree update algorithms Guttman[8] provides insertion and deletion algorithms for the R-tree structure proposed by him. The insertion and deletion algorithms use the bounding boxes from the nodes to ensure that nearby elements are placed in the same leaf node. Inserting a rectangle into the R-tree, basically involves adding the rectangle to a suitable leaf node. As rectangles get inserted, at a certain point, the leaf node overows thus requiring a split. This creates another child pointer in the parent node, which may cause the parent node to split, and so on, eventually, in the worst case, splitting the root node. The insertion algorithm is based on heuristics on the splitting process so that a good query performance can be achieved. The algorithm can be summarized as follows: Algorithm Insert (r) Input: A rectangle r to be inserted. 1. Descend through the tree to nd a leaf node L whose MBR requires least enlargement to accommodate r. 2. if L does not have enough room to accomodate r 3. then split the node L to obtain an additional leaf node LL containing r. Propogate the split upwards through the tree and if necessary splitting the root node to create a new root. 4. else 18 5. Place r in L. 6. return. In step 2, the leaf node where the rectangle has to be placed is chosen. This is done by descending through the tree starting at the root node until a leaf is found. At each step, the entry whose MBR requires least enlargement to include r is chosen. When the leaf node is already full, the node is split resulting in the (K+1) rectangles being redistributed in two leaf nodes, according to some splitting policy. If the newly added leaf makes its parent overow, then the split has to be recursively propagated to the upper levels (step 3). There are three splitting techniques to split nodes in step 3, the linear split, the quadratic split and the exponential split. Their names come from their complexity. These three splitting techniques can be summarized as follows: 1. Linear Split This algorithm is linear in K and in the number of dimensions. Conceptually this algorithm chooses two rectangles that are furthest apart as seeds. The remaining rectangles are chosen in a random order and added to the node that requires least enlargement of its MBR. 2. Quadratic Split The cost is quadratic in K and linear in the number of dimensions. The algorithm picks two of the K + 1 entries to be the rst elements of the two new groups by choosing the pair such that the area of a rectangle covering both entries, minus the area of the entries themselves, would be greatest. The remaining entries are then assigned to groups one at a time. At each step the area expansion required to add each remaining entry to each group is calculated, and the entry assigned is the one showing the greatest dierence beween the two groups. 3. Exponential Split All possible groupings are tested and the split that results in the least overlap area of the two MBR's is chosen. However even for reasonably small value of K , this strategy is very expensive as the number of possible groups is approximately 2K−1 . 2.2.2 The R∗ -tree Guttman's update algorithms are based completely on minimizing the overlap area of MBR's. Insertion and deletions are intermixed with queries and there is no periodic reorganization. The structure must allow overlapping rectangles, which means it cannot guarantee that there is only one search path for an exact match query. The R∗ -tree is a variant of R-tree proposed by Beckman et al[5] that strives to reduce the number of search paths for queries by incorporating a combined optimization of several parameters. The following is a summary of the parameters it tries to optimize: Area covered by each MBR should be minimized Minimizing the dead space (area covered by the MBR but not by the data rectangles) will improve query performance as decisions of which paths are to be traversed can be made at higher levels of the tree. 19 Overlap between MBR's should be minimized A larger overlap implies more paths to be searched for a query. Hence this optimization also serves the purpose of reducing search paths. The perimeters of MBR's should be minimized This optimization results in MBR's that are more quadratic. Query rectangles that are quadratic benet the most from this optimization. As quadratic rectangles can be packed more easily, the bounding boxes at higher levels in the R-tree are expected to be smaller. In fact this optimization will lead to lesser variance in lengths of bounding boxes, indirectly achieving area reduction. Storage utilization should be optimized Storage utilization is dened as the ratio of the total number of rectangles in the R-tree to the maximum number of rectangles (capacity) that can be stored across all the nodes of the R-tree. Low storage utilization would mean searches in larger number of nodes during queries. As a result query cost is very high with low storage utilization, especially when a large part of the data set satises the query. Optimization of the afore-mentioned parameters is not independent as they aect each other in a very complex way. For instance to minimize dead space and overlap, more freedom in choosing the shape is necessary which would cause rectangles to be less quadratic. Also, minimization of perimeters may lead to reduced storage optimization. Based on results of several experiments using these optimization criteria, Beckman et al propose the following two strategies to obtain signicant gain in query performance: A new node splitting algorithm that uses the rst three optimization criteria. This algorithm is as follows: Algorithm Split (ν) Input: Node ν in the R tree that contains maximum number K +1 of either data rectangles or MBR's. Output: Node ν and µ containing K + 1 entries. 1. for each axis 2. Sort the entries along min then by max values on this axis. 3. Determine the K − 2k + 2 distributions from the K +1 entries into two groups such that each group contains a minimum of k entries. 4. Compute σ, the minimum sum of the perimeters of the two MBR's across all the distributions. 5. Choose the axis with the minimum σ. Split is performed perpendicular to this axis. 6. From the K − 2k + 2 distributions along the chosen axis in line, choose the distribution with minimum overlap. Resolve ties by choosing the distribution with minimum dead space. The two groups of entries are collected in the nodes ν and µ 20 7. return ν and µ The split algorithm rst determines the axis along which the split will be performed. To do this, it considers, for each axis in step 3, K − 2k + 2 distributions of the K + 1 nodes into two groups, where the i-th distribution is determined by having the rst group contain the rst (k − 1) + i entries sorted along that axis and the second group having the remaining entries. An insertion algorithm that uses the concept of forced reinsert to reinsert a fraction of entries of an overowing node to re-balance the tree at certain steps. Algorithm Insert (r) 1. Invoke ChooseSubTree to determine the leaf node ν where the insertion has to take place. 2. if ν contains less than K data rectangles. 3. then Insert r in ν . 4. else (∗ ν has maximum entries. Handle overow. ∗) 5. if this is the rst time overow has occured in this level 6. then Reinsert 30% of the rectangles of ν whose centroid distances from the node centroid are the largest. 7. else 8. Invoke Split (ν). Propagate the split upwards if necessary. 9. return Algorithm ChooseSubTree 1. ν ←root of the R∗ tree. 2. while Children nodes of ν are not leaves 3. ν ←MBR requires least area enlargement choosing the entry µ whose MBR has the to include r. Resolve ties by least area. 4. Choose the leaf node whose MBR requires least overlap enlargement to include r. Resolve ties by choosing nodes that need least area enlargement. The dynamic reorganization of the R-tree by the reinsertion strategy during insert achieves a kind of tree re-balancing and signicantly improves query performance. As reinsertion is a costly operation, the number of rectangles reinserted have been experimentally tuned to 30% to yield best performance. Also reinsertion is restricted to be done once for each level in the tree. 2.2.3 The Hilbert R-tree The Hilbert R-tree[10] is an R-tree variant which uses the notion of Hilbert value to dene an ordering on the data rectangles. The Hilbert value of a n-dimensional point is calculated using the n-dimensional Hilbert curve. The Hilbert R-tree constructed on a data/set of n-dimensional rectangles uses the centroid of the rectangles to dene the 21 ordering. Such an ordering has been shown [10] to preserve the proximity of such spatial objects quite well. The ordering on the data rectangles based on the Hilbert value, also facilitates the Hilbert R-tree to use a conceptually dierent splitting technique known as deferred splitting. In addition to this variation in splitting algorithm, every internal node ν in the R-tree structure stores, in addition to the usual MBR, the largest Hilbert value(LHV) of the data rectangles that are stored in the subtree rooted at ν . In the original R-tree, when a node a overows, a split is performed, as a result of which, two nodes are created from a single node. This is referred to as a 1 to 2 splitting policy. The Hilbert R-tree implements the concept of deferred splitting by using a 2 to 3 splitting policy. This means a split is not performed when a node overows and either that node or one its sibling, also known as cooperating siblings can accommodate an additional entry. In general there could be a s to s + 1 splitting policy. Finally these concepts of node ordering according to Hilbert value and deferred splitting approach come together in the following insertion algorithm: Algorithm Insert (r) Input: A rectangle r to be inserted. 1. h ←Hilbert value of the centroid of r. 2. Recursively descend through the tree, to select leaf node, ν , for insertion. At each node, select the child node entry with the minimum LHV value greater than h. 3. if ν has an empty slot. 4. then Insert r into ν . 5. else 6. HandleOverow (ν, r) that will a create a new leaf µ if split was inevitable. 7. Propogate the node split in line 6 upwards, adjusting MBR's and LHV of the nodes. If an overow of the root caused a split, create a new root whose children are the previous root and the new node created by the split. 8. return. Algorithm HandleOverow (ν, r) Input: A rectangle r to be inserted. Output: A new node µ if a split occurred. 1. ² ←all the entries of ν and its s − 1 cooperating siblings. 2. ² ←² ∪ r. 3. if (|²| < s.K ) (∗ atleast one of the s − 1 cooperating siblings is not full. ∗) 4. then 5. Distribute ² among the s nodes respecting the Hilbert ordering. 6. return null. 7. else (∗ all the s − 1 siblings are full. ∗) 8. then 9. Create a new node µ and distribute ² among the s + 1 nodes respecting Hilbert ordering. 22 the 10. return µ. The Hilbert tree acts like a B + tree for insertions and like an R tree for queries, thereby achieving its acclaimed performance. However it is vulnerable, performancewise, to large objects. Performance is also penalized when the space dimensionality is increased. In this case, proximity is not well preserved by the Hilbert curve, leading to increased overlap of MBR's in internal nodes. 2.2.4 R+ -tree The R+ -tree was proposed[11] as a variation to the R-tree structure to improve performance of exact match queries. The original R-tree suered from the problem that an exact match query may lead to investigation of several paths from the root to the leaf, especially in cases where the data rectangles are dense/clustered. To obtain better query performance in such cases, the R+ -tree introduces a variation in the R-tree structure that does not allow overlapping of MBR's at the same level of the tree. This is achieved by duplicating the stored data rectangles in more than one node. Because of this structural dierence, the following changes are made to the update algorithms: Query - Query algorithm is similar to the one used for the R-tree with the only dierence of removing duplicate results. Insertion - The insertion algorithm proceeds in the same way as the original R-tree to nd the node whose MBR overlaps the rectangle r, to be inserted. Once such a node is found, it is either inserted there, if there is enough space for r or the node is split resulting in a, sometimes drastic, reorganization of the tree structure which eventually duplicately stores r. Under certain extreme circumstances, this can even lead to a dead-lock[12]. Deletion - Duplication of stored rectangles means that the deletion algorithm must take care to delete all occurrences of the rectangle to be deleted. Deletion is followed by a phase where the MBR's have to be adjusted. However deletion may reduce storage utilization signicantly, requiring the tree to be periodically reorganized. 2.2.5 Compact R-tree The Compact R-tree mechanism, focuses on the improving storage utilization to improve query performance. A very simple heuristic is applied to improve space utilization during insertions. When a node ν overows, K rectangles among the K+1 available rectangles, are chosen such that MBR is the minimum possible. These rectangles are kept in ν and the other rectangle is moved to one of its sibling provided it has the space and whose MBR requires least enlargement. A split takes place only when all the siblings are completely lled with K rectangles each. This heuristic, has experimentally been shown to improve utilization to around 97% to 99%. Insertion performance is improved by the 23 fact that lesser splits are required. However performance of window queries is seen, not to dier much from the Guttman's R-tree. 2.2.6 Linear Node Splitting This technique proposed in [1] introduces a new algorithm for linear node splitting. This algorithm essentially can be substituted for the splitting algorithm used in the original R-tree. This technique splits nodes based on the following heuristics, which are used in the order stated: Distribute rectangles as evenly as possible. Minimize overlapping area between the nodes. Minimize the total perimeter of the two nodes. This means when a node ν overows, each of the K+1 rectangles is assigned two of four lists Lxmin , Lymin , Lxmax and Lymax . More precisely, for each rectangle r, it is determined whether this rectangle is closer to left or the right edge of the MBR of ν . The rectangle is then assigned to the x-dimensional lists, Lxmin or Lxmax list. Then, according to the y dimensions of the rectangle, it is assigned to one of the y-dimensional lists, Lymin or the Lymax list. The node is split along the x-dimension if MAX (|Lxmin |, |Lxmax |) > MAX (|Lymin |, |Lymax |). If this is not true the split is performed along the y-dimension unless the two lists are of the same size. In the latter case, the overlap of these sets is considered. Finally if this turns out to be equal as well, the total coverage is considered. Experiments have shown that these heuristics result in R-trees that have better characteristics and result in better performance for window queries in comparison with quadratic algorithm proposed by Guttman. 2.3 2.3.1 Static Versions of R-trees The Hilbert Packed R-tree The Hilbert Packed R-tree[9] is a R-tree structure designed with the aim of achieving 100% space utilization with good query performance for applications where the R-tree does not or most infrequently requires modications. In order to achieve very good query performance the data rectangles that are in close proximity must be clustered together in the same leaf. Similar to its dynamic counterpart, this structure uses the Hilbert curve and the resulting Hilbert values of the centroid of the rectangles as a heuristic to cluster rectangles. The tree is constructed in a bottom up manner, starting from the leaf level and nishing at the root. The construction algorithm is outlined below: Algorithm HilbertPack (S) Input: Set S of data rectangles to be organized into a R tree. 24 Output: R-tree, τ , packed with the 1. for each data rectangle r in S 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. data rectangles in the set S . Calculate the Hilbert value of the centroid of r. Sort the rectangles in S on ascending Hilbert values calculated in line 1. while S 6= ∅ do Create a new leaf node ν . Add B rectangles to ν and remove them from S . while there are > 1 nodes at level l do I ← MBRs of while I 6= ∅ do nodes in level l. Create a new internal node µ. Add B rectangles with child pointers to µ and remove them from I . l ←l+1. Set the root node of τ to the one node left. return τ . Experiments ([9]) showed that this variant of the R-tree, outperforms the original R-tree with quadratic split and the R∗ -tree signicantly. 2.3.2 TGS R-tree Unlike the Packed Hilbert R-tree that constructs the R-tree bottom up, Top-down Greedy Splitting(TGS) method presented in [7] constructs the tree a top down manner while using an aggressive approach that greedily constructs the various subtrees of the R-tree. A top down approach minimizes the cost of the levels that allow a potentially bigger reduction in the overall cost; i.e.; the top levels of the R-tree. Essentially, the algorithm recursively partitions a set of N rectangles into two subsets by a cut orthogonal to an axis. This cut must satisfy the following two conditions: 1. Cost of an objective function f (r1 , r2 ), where r1 and r2 are the MBR's of the resulting two subsets, is minimized. 2. One subset has a cardinality of i.S for some i where S is xed per level so that the resulting subtrees are packed i.e; the space utilization is 100%. The algorithm for to perform a cut is summarized as follows: Algorithm TGS (n, f ) Input: n - number of rectangles in the data set. f - function f (r1, r2) that measures the cost of the split. Output: Two split subsets that form two subtrees of a R-tree. 25 1. if n ≤ K return. 2. for each dimension d. 3. for each ordering in this dimension. n 4. for i ←1 to M -1 5. r1 ←MBR of rst i.S rectangles. 6. r2 ←MBR of other rectangles. 7. Remember i if f (r1 , r2 ) is at its best position. 8. Split the input set at the best position found in line 7. Orderings in each dimension that are considered in line 3 are based on min coordinate, coordinate, both (min followed by max) and the centroid of the input rectangles. The cutting process described above is repeated recursively for the resulting subsets until a cut is no longer possible. This binary split process can easily be extended to a K -ary split, where each internal node has K -entries. This means, to build the root of a (subtree of) an R-tree on a given set of rectangles, the algorithm repeatedly partitions the rectangles, into two sets, until they are divided into K subsets of equal size. Each subsets bounding box is stored in the root, and the subtrees are constructed recursively on each of the subsets. max 2.3.3 Buffer R-tree Buer R-tree[2] is not really a static R-tree, but provides ecient algorithms for bulk updates. It achieves I/O eciency by exploiting the available main memory to cache rectangles when a rectangle is inserted. More precisely, it attaches buers to every node M at blogB ( 4.B )c-th level of the tree, where i = 1, 2, .., and M is the maximum number of rectangles that can t in main memory, B is the block size. A node with an attached buer is called a buer node. In contrast with many other R-tree variations, the BR-tree does not split a node immediately, when a node overows due to insertion. Instead it stores this rectangle in the buer of the root node. When the number of items in this buer exceed M/4, a specialized procedure is executed to free buer space. Essentially, this procedure, moves rectangles from a full buer to next appropriate buer node at lower levels of the tree. Such movements must respect various branching heuristics. When we reach the leaf, the rectangle is inserted and split is performed when there is an overow. Evidently, for some insertions, there are no I/O incurred. The BR-tree supports bulk insertions, bulk deletions, bulk loading and batch queries. Experimental results show that BR-tree requires smaller execution times to perform bulk updates and produces a good index for query processing. 26 Chapter 3 PR-tree Family 3.1 Priority R-tree The Priority R-tree[3], or PR-tree, is the rst R-tree variant that guarantees a worst case performance that is provably asymptotically optimal. The name of the tree derives from the use of priority rectangles to bulk load the tree. This algorithm makes use of an intermediate data structure called the pseudo-PR-tree. In the next section this data structure is described together with a construction algorithm. The exact pseudo-code describing the implementation of the algorithm is presented in chapter 4. For simplicity, the description is for the two-dimensional case. The discussion and the results can be easily generalized to higher dimensions. 3.1.1 Pseudo-PR-tree Let S = fR1 , .. RN g be a set of N input data rectangles in the plane. The set is mapped to a set of four-dimensional points S ∗ where each element of S ∗ is obtained from the corresponding rectangles in S using the following relation: Definition S ∗ (Ri ) ≡ ((xmin (Ri ), ymin (Ri ), xmax (Ri ), ymax (Ri )) where: xmin (Ri ), ymin (Ri ) represent the coordinates of the left-bottom vertex of Ri . xmax (Ri ), ymax (Ri ) represent the coordinates of the right-top coordinates of Ri . A Pseudo-PR-tree T (S ) is dened recursively as follows: The root has at most six children, namely four priority leaves and two so called KD-nodes. The root also stores the MBR of each of the children nodes. 27 The rst priority leaf contains the B rectangles in S with smallest xmin coordinates, the second, the B rectangles among the remaining rectangles with smallest ymin coordinates, the third, the B rectangles among the remaining rectangles with largestxmax coordinates and nally the fourth, the B rectangles among the remaining rectangles with largest ymax coordinates. The set Sr of remaining rectangles is divided into two sets S1 and S2 of approximately the same size.The KD-nodes are the roots of the pseudo-PR-trees constructed using S1 and S2 respectively. The division of rectangles is performed using the xmin , ymin , xmax , ymax coordinates in a round robin fashion. This means, the division performed at the root node, is based on the xmin -values, the division at next level of recursion is based on the ymin -values, then based on the xmax -values, then the ymax -values, then the xmin -values and so on. The split value used is stored in the internal node in addition to the bounding boxes. Since each leaf or node of the pseudo-PR-tree is stored in O(1) disk blocks, and since atleast 4 of the six leaves contain a Θ(B) rectangles, the tree occupies O(N/B) disk blocks. It has q been proved[3] that a window query on a pseudo-PR-tree with N rectangles uses O( NB + BT ) I/O's in the worst case. Following this denition, constructing the tree using a set of N rectangles in O( NB log N ) I/O's. However bulk loading the tree can be done I/O eciently using O( NB log MB NB ) I/O's. This is done using a four-dimensional grid that denes a partition of the four dimensional space and stores the counts of rectangles that lie in a each unit of such a partition. The construction algorithm recursively constructs Θ(log(M )) levels using this grid. Essentially the grid helps in preventing I/O's that would have otherwise been required to split the input set correctly based on the appropriate dimension. The following algorithm describes the construction of the Θ(log(M )) levels the tree. The algorithm is then recursively used to construct the complete tree. Algorithm Construct (S ∗ , ν, level) Input: S ∗ - Set of 4-dimensional points representing the N rectangles in twodimensional space. ν - Root of a pseudo-PR (sub)tree. level - level of the node ν . Output: Pseudo-PR Tree rooted at ν containing the points in S ∗ . 1. Construct four sorted lists Lxmin , Lymin , Lxmax , Lymax containing the points in S ∗ sorted by1 the respective dimensions. 1 2. z ←α.M 4 (∗ Θ(M 4 ) : α >= 0 has to be chosen by the implementation ∗). 3. Using the sorted lists created in step 1, create a four dimensional grid using the (kN/z)-th coordinate along each dimension.Keep the counts of points in each grid cell, where 0 ≤ k ≤ z − 1. 28 4. 5. 6. 7. 8. 9. 10. Split (S ∗ , level, ν) Add priority leaves to all the nodes creates in the previous step. while S ∗ has unprocessed points and subtree rooted at ν has space. do r ←next point in S ∗ . Add r to the tree, so that all the properties of a pseudo-PR-tree are preserved return First the lists are sorted along each of the four dimensions. This helps in initializing the rectangle counts in the grid. The algorithm Split is used in step 4 to create all the internal KD-nodes by recursively splitting the grid until Θ(log M ) levels of the tree are constructed. A nal step in the construction algorithm (step 6) distributes these rectangles in the tree, respecting the properties of a pseudo-PR-tree. More precisely, we ll the priority leaves by scanning S ∗ and ltering each point p, through the tree, one by one, as follows: We start at the root ν , of the tree, and check its priority leaves νxmin , νymin , νxmax and νymax , one by one, in that order. If we encounter a non-full leaf we simply place p there; if we encounter a full leaf νdim and p is more extreme than the least extreme point p0 in νdim , we replace p0 with p and continue the ltering process with p0 . After we check νymax , we continue to check (recursively) one of the KD-nodes of ν . The KD-node to check is chosen by using the split value stored in ν . The algorithm Split that creates all the internal nodes, is summarized below. Algorithm Split (S, level, ν) Input: S - set of 4-dimensional to be split. points, level - current level being constructed, ν - node (∗ Constructs Θ(log M ) of the (sub)tree rooted at ν recursively. ∗) 1. if level > β. log M (∗ Θ(log M ) : β >= 0 has to be chosen by the implementation ∗) 2. then return 3. d ←split dimension of level. 4. Using the grid in step 3 of Construct nd, using the split dimension, d, the exact slice where the set can be divided into 2 sets S1 and S2 of roughly same size. 5. Create two nodes ν1 and ν2 whose parent is ν and store the split value used in ν . 6. Split (S1 , level + 1, ν1 ) 7. Split (S2 , level + 1, ν2 ) 8. return At each recursive step, the appropriate dimension d, is used to split the grid (step 4. Using the grid, it is easy to nd the approximate slice l, where the set of rectangles can be divided into roughly two equal halves. Once this is achieved, the exact slice could be found by scanning the sorted stream along dimension d. In order to do this, only O(N/(Bz)) blocks have to be scanned as we have to scan only the rectangles that lie in slice l. Once this is done, a new slice l0 containing O(z 3 ) grid cells is added to the grid, basically splitting slice l. The rectangle counts in the grid cells belonging to these slices are computed using the same O(N/(Bz)) blocks. 29 3.1.2 PR-tree A two dimensional PR-tree is an R-tree with a fanout ofqΘ(B) constructed using a Pseudo-PR-tree. It maintains the query performance of O( NB + BT ) I/O's in the worst case. The following algorithm is used to construct a PR-tree, in a bottom-up manner, on a set S of two dimensional rectangles: Algorithm Construct (S) Input: Set S of N rectangles in two dimensional space. Output: PR Tree rooted at node ν 1. V0 ←Leaves from the set S with Θ(B) rectangles in each leaf. 2. i ←0. 3. while Number of MBR's of Vi ≥ B . 4. do 5. τVi ←Pseudo-PR-tree on the MBR's of Vi . 6. Vi+1 ←leaves τVi . 7. Match the MBR's of the nodes in Vi+1 with the rectangles in Vi child pointers in Vi to the nodes in Vi+1 . 8. i ←i + 1. 9. Construct the root node, ν , from MBR's of Vi and set its children. 10. return PR-tree rooted at ν . and set the It can be proved[3] that this algorithm bulk loads the PR-tree in O( NB log MB NB ) I/O's. PR-tree can be updated using standard heuristic based R-tree update algorithms in O(logB N ) I/O's in the worst-case but without maintaining the query eciency. 3.2 LPR-tree Logarithmic Priority R-tree is an adaptation of the conventional R-tree structure, based on the pseudo-PR-tree structure aimed to support the same worst case query performance while the tree is updated. The adaptations in the structure are two-fold: Internal nodes store additional information besides the MBR. Leaf nodes are not at the same level. The root of a LPR-tree has a number of subtrees of varying capacities. Each of these subtrees, known as Annotated PR-trees (APR-trees), is a normal Pseudo-PR-tree where each internal node ν of the tree, stores the following information: Pointers to each of ν 's children, and the MBR of each child. Split value that is used to cut ν in the four dimensional kd-tree. 30 For each priority leaf of ν , the least extreme value of the relevant coordinate of any rectangle stored in that leaf. A LPR-tree has up to dlog(N/B)e + 3 subtrees, τ0 , τ1 , τ2 , ..., τdlog(N/B)e+2 . τ0 can store at most B rectangles, and τi for i > 0 has a capacity to store at most 2i−1 B rectangles. Since each APR-tree has a dierent capacity, the leaf nodes are not at the same level. In addition to these adaptations, the LPR-tree structure proposes a disk layout strategy for the nodes of the tree: Each internal node of an APR-tree at depth i, such that i = 0 (mod blog Bc) and its descendant internal nodes down to level i + blog Bc − 1 are stored in the same block. The smaller APR-trees τi for m ≥ i ≥ 0 where m = log MB are stored in the main memory. N Of the larger APR-trees τi for dlog(N/B)e + 2 ≥ i ≥ l where l = log M , the top i − l levels are kept in main memory and the rest of the tree is stored on the disk. The remaining APR-trees τi for l > i > m, are stored completely on the disk. The LPR-tree is bulk loaded with a set S of N rectangles, by building the APR-tree The rest of the trees are left empty. Insertion is done using the following algorithm: τdlog(N/B)e+2 . Algorithm Insert (r) Input: r - Rectangle to be inserted. 1. if number of insertions made so far ≥ bulk loaded. number of rectangles with which the tree was 2. then 3. S ←{Rectangles from all the APR-trees} ∪ {r}. 4. Reconstruct the LPR-tree using S . 5. return. 6. if τ0 has less than B rectangles 7. then 8. Insert r in τ0 . 9. else 10. j ←1. 11. for i ←1 to dlog(N/B)e + 2 12. if τi is empty. 13. then j ←i and break (∗ continues with step 14 ∗) 14. S ←Rectangles from all trees τk where 1 ≤ k ≤ j − 1. 15. Empty all trees, τk where 0 ≤ k ≤ j − 1. 16. Build an APR-tree using S and store it as τj . 17. Insert r in τ0 . 31 18. return. N )) I/O's amortized. Insertion using this algorithm takes O( B1 (log MB NB )(log2 M Deletion is done using the following algorithm: Algorithm Delete (r) Input: r - rectangle to be deleted from the LPR-tree. 1. if number of deletions made so far ≥ half the number 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. tree was bulk loaded. then S ←{Rectangles of rectangles with which the from all the APR-trees} \ {r}. Reconstruct the LPR-tree using S . return. Search for r in each subtree. if r is not found then return . L ←Priority leaf where r is found. Delete r from L. νp ←parent of L. if L contains more than B/2 nodes then return . Replenish L with rectangles from the sibling priority nodes and from the children priority nodes of the parent of L. 15. return. The search algorithm in line 6 is quite trivial. It recursively searches the tree starting at the root. It uses the annotated information stored at each internal node about its priority leaves and the split value to determine the child node to continue the search. Eventually it either locates the node in a leaf or will report failure when the search rectangle is not found. If deleting a rectangle from a node L, causes the node to underow, then this node has to be replenished with rectangles (step 14. This is done by moving the most extreme B/2 rectangles among the sibling priority nodes that follow L and the children priority nodes of sibling KD-nodes of L to L. It is possible that one of the priority nodes from which rectangles were moved might underow. Such priority nodes are also replenished recursively in the same manner as L. Chapter 4 will describe the pseudo-code used in replenishing rectangles in detail. Taking the rebuilding of the tree in step 1 of the Delete algorithm and the replenishing of nodes, it can be shown[3] that deleting a rectangle from a LPR-tree takes N N log2 M ) I/O's amortized. O(logB M 32 Chapter 4 Design and Implementation I implemented the PR-tree family for two-dimensional rectangles. The sections that follow will explain the design and implementation of the the data structures and the algorithms. Section 4.1 describes some general implementation issues that were found during the implementation and how they were handled. Section 4.2 will describe fundamental data structures used throughout the implementation. Sections 4.3 and 4.4 describe the data structures of the pseudo-PR-tree and the LPR-tree in terms of TPIE concepts. These sections will also describe implementation issues related to the respective areas explicitly. 4.1 General Implementation Issues Although the PR-tree had been implemented, the code could not be reused for implementing the update algorithms because the underlying TPIE library had undergone major changes (for example, in caching mechanisms etc.). Most of the following issues occurred at dierent places in the implementation. Almost all of them are related to the usage of TPIE (See Appendix B). Memory tracking AMI block objects represent logical disk blocks in TPIE. These blocks have unique identifers in an AMI collection. Very often these block objects are created in a certain place in the code and have to be deleted at a dierent place (for instance, due to caching). When blocks are not deleted, the available memory runs out and TPIE simply aborts the application. Also when a deleted block is accessed, exceptions occur at unpredictable places. To solve these range of errors, a memory tracker class was created. Every time a new statement is executed, the memory tracker is invoked to record the block id and the line number where the allocation is made. Similarly, every time a delete statement is executed, the memory tracker is invoked to release the allocation reference (block id). This memory tracker has the following benets: 33 It can be used at dierent points in the code to check whether allocations and de-allocations are matching or have occurred correctly. – It can detect if blocks are re-allocated without getting de-allocated. – At the end of the program, it can dump the allocations that have not yet been de-allocated together with the line numbers where they were created, which greatly helps in xing memory leaks. In fact, this trick was used to x the R∗ -tree implementation that was included in the TPIE distribution. – I/O Counts TPIE provides an interface to obtain the number block reads and writes for AMI stream and AMI collection objects. However the interface to obtain these statistics for the AMI stream does not work. Hence to work around this problem we take the number of item reads and writes and divide it by the number of such items that could t in a block. For the LPR-tree we know that this assumption will give a good approximation as most of the block I/O for streams occur during sorting when almost all blocks that are read or written are full. Miscellaneous issues There were several minor problems related to using TPIE, the following are some of the most important ones which took some eort to trace out: It is not possible to delete an item from an AMI stream. This is a problem for the LPR-tree implementation, as it is required to lter out rectangle already placed in a tree to start the next recursive step in the construction of the tree. To do so, it would be nice to be able to delete the already placed rectangles from the four streams that are sorted along the four respective directions. To work around this problem, we lter out the rectangles to be placed in the tree and sort them again along the four dimensions. – Sorting an AMI stream that is already sorted returns an error code indicating the stream is already sorted as opposed to a normal success scenario where this return value would indicate success. Also in such a case, the stream that is provided to store the sorted objects, would be empty as the user is expected to reuse the original stream. This problem was easily worked around. – When a block collection (usually representing a tree) is stored to a disk, TPIE would also store a stack of free blocks in a separate le. Care should be taken that while making a copy of the tree stored in the disk, the stack(.stk) le should also be copied. Not doing so would results in crash in dierent places in TPIE not easily revealing the actual problem. – 34 4.2 Two Dimensional Rectangle Data Structure The two dimensional axis parallel rectangle is represented by the following data structure: DataStructure TwoDRectangle Begin double min [2] double max [2] AMI bid id End The arrays min and max contain the minimum and maximum values of the coordinates in the x and y direction respectively. id represents a unique identier in a stream of rectangles. This data structure will simply be referred to as rectangle in the subsequent discussion. A two dimensional stream of rectangles, TwoDRectangleStream is an AMI stream of TwoDRectangle objects. Operations During the various operations on the LPR Tree, the following two operations are frequently performed on a TwoDRectangle : – Intersection of rectangles Two rectangles are said to intersect if their edges intersect or if one rectangle is completely contained in another. It should be noted that intersection of rectangles in commutative. The following pseudo-code is used to determine rectangle intersection: PseudoCode Intersects (r1 , r2 ) Input: Two TwoDRectangle objects r1 and r2 . Output: true if r1 and r2 intersect, false otherwise. 1. b1 ← r2xmin > r1xmax (∗ r2 lies to the right of r1 . ∗) 2. b2 ← r2 xmax < r1 xmin (∗ r2 lies to the left of r1 . ∗) 3. b3 ← r2 ymin > r1 ymax (∗ r2 lies above r1 . ∗) 4. b4 ← r2 ymax < r1 ymin (∗ r2 lies below r1 . ∗) 5. return not (b1 or b2 or b3 or b4 ). Basically, the above algorithm checks if the rectangles do not intersect and returns the negation of that result. – Computing the minimum bounding box 35 Given a list of rectangles the minimum bounding box is computed by linearly traversing the list and keeping track of the most extreme coordinates in the respective directions. The following pseudo code describes this procedure. PseudoCode ComputeMinimumBoundingBox (r, n) Input: List r of TwoDRectangle objects. Output: TwoDRectangle that is the minimum bounding box of all the rect- angles in the specied list. 1. mbb ←r0 (∗ mbb is set to the rst rectangle in the list ∗) 2. for i ←1 to n 3. do 4. if mbbxmin > ri xmin 5. then mbbxmin ← rixmin 6. if mbbymin > ri ymin 7. then mbbymin ← ri ymin 8. if mbbxmax < ri xmax 9. then mbbxmax ← ri xmax 10. if mbbymax < riymax 11. then mbbymax ← ri ymax 12. return mbb. 4.3 Pseudo-PR-tree 4.3.1 Data Structures We rst describe the priority node and the internal node (KD-node) structure. The tree itself is represented by the root node which is an internal node. The complete tree is stored in the disk as a AMI collection. PriorityNode PriorityNode is an AMI block that can hold at most B rectangles. All these rectangles are stored in the el eld of the block. In implementation terms, the priority node class is derived from a AMI block. The info eld of the block, in this case, contains the number of rectangles currently present in the block. The priority node stores rectangles in a sorted order. The sorting order is determined by the dimension the node represents. So for example, if the node is a xmin or a ymin priority node, the rectangles are sorted in the ascending order according to their coordinate value in that dimension. The sorting order is descending when the node is a xmax or a ymax node. A rectangle can be added to a priority node only when it contains less than B rectangles. The rectangles stored are always maintained in a sorted order according to the dimension the priority node represents in the pseudo-PR-tree. The insertion 36 pseudocode is described below. It follows a binary search pattern to locate the position of insertion. PseudoCode InsertRectangle (r, dim) Input: TwoDRectangle r to be inserted 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. along the specied dimension dim.(∗ Assumption is the priority node has at least one position free. ∗) insertV alue ← rdim f irst ← 0 ; last ← (n − 1) ; mid ← 0 compare ←< if IsMinDimension (dim) then compare ←> while f irst ≤ last do mid ← b(f irst + last)/2c midRectangle ← elmid midV alue ← midRectangledim if compare (midV alue, insertV alue) = true then last ← (mid − 1) lastRectangle ← ellast lastV alue ← lastRectangledim if compare (lastV alue, insertV alue) 6= true then break else if midV alue 6= insertV alue then f irst ← (mid + 1) else break elmid ← r When a rectangle gets inserted in step 21, all the rectangles from the position of insertion are moved one position to the right. A similar procedure is followed for deleting a rectangle. When the priority node is already full, the most extreme rectangle, that is, the rectangle at position (n − 1), will be replaced if the current rectangle is more extreme than the rectangle that is being inserted. Internal Node (KD-node) The following data structure describes the internal node of the pseudo-PR-tree. 37 DataStructure InternalNode Begin TwoDRectangle minimumBoundingBoxes [6] double leastExtremeValues [4] double splitValue priorityNodes [4] AMI bid AMI bid subTreeIds [2] int subTreeIndices [2] End The internal node holds pointers (block id's) of the children priority nodes and the roots of the recursive subtrees. The node itself is stored in a AMI block. It also contains the annotated information, namely, the minimum bounding boxes of the all the children nodes and the least extreme values of each of the priority nodes. There can be at most 4 priority nodes and 2 internal nodes. To identify the children KD-nodes completely, it is also necessary to store the location in the block where the node could be found in addition to their block id's. 4.3.2 Construction Algorithm The construction algorithm recursively builds the tree in a top-down fashion. I/O eciency in the construction is obtained using a Grid. The grid is a 4-dimensional structure dened by the coordinate axes, xmin , xmax , ymin , ymax . There are z 3-dimensional slices dened orthogonal to each dimension d, using the input stream that has the fourdimensional points sorted on the coordinate values of dimension d. In particular, these slices are dened such that the number of rectangles that lie in any slice dened by adjacent slices are equal. The z slices orthogonal to each of four dimensions divide the four-dimensional space dened by the set S ∗ into a grid that contains z 4 grid cells. Each grid cell will hold the count of the number of rectangles that lie in that cell. As the grid is kept in the main memory, splitting the internal nodes to create sibling KD-nodes can be performed I/O eciently as only a limited number of blocks I/O's will be necessary to perform such a split. Using the grid each recursive step builds a part of the tree in memory and distributes the rectangles. The distribution ensures that the properties of a pseudo-PR-tree are maintained correctly. We rst describe the structure of the grid and some of the important operations on the grid followed by the pseudo code for the construction algorithm itself. Axis Segments There are z axis segments orthogonal to each of the four dimensions. As described earlier, the grid will be split during the construction algorithm recursively, to create the internal nodes of the pseudo-PR-tree. The grid helps in determining the slice l, that requires a split. However in order to determine the exact position of split, 38 part of the input stream sorted along the dimension of the split and has rectangles that lie in the slice l, needs to be accessed. To achieve this, for each axiss segment orthogonal to a certain dimension d, we need to store the osets to the input stream sorted along d. This is achieved by having four hash tables, AxisSegments[4], one for each of the dimension, whose keys are the coordinate values dening an axis and the values are the osets to the correct position in the sorted stream. To eciently use memory, these hash tables are shared across the sub-grids that are created as a result of splitting the grid. During the process of splitting, new axis segment gets added to the hash table. Grid Given the memory constraints, the size of the grid is tuned to 16. The grid is implemented as a collection of GridCell objects. To be able to eciently retrieve cells from the grid given its address in terms of the dening coordinate values, the grid cells are assigned id's. This is required for the operations on the grid that will be described later. The grid is then a hash table of these cell id's to the actual grid cells. DataStructure GridCell Begin double axisBegin [4] double axisEnd [4] int numberOfRectangles int id End The id's of the grid cell can be calculated from the index of the four coordinate axis dening the grid cell using the following formula: id = 163 ∗ Index(AxisSegmentsxmin , axisBeginxmin ) + 162 ∗ Index(AxisSegmentsymin , axisBeginymin ) + 16 ∗ Index(AxisSegmentsxmax , axisBeginxmax ) + Index(AxisSegmentsymax , axisBeginymax ) Index(h, k) is a function that retrieves the sorted position of the key k , table h. in the hash At each step of construction, the grid is required to be split across a specied dimension d, depending on the level at which the tree is being constructed. The grid has to be split in such a way that the rectangles are approximately distributed evenly among the two halves. The split has to be performed Splitting the Grid 39 at a coordinate value along dimension d, such that rectangles that are greater than or equal to this value are on one side and rest on the other side. The eciency of the grid can be seen here as the grid keeps the counts of all the rectangles. The following pseudo-code describes the splitting of the grid: PseudoCode SplitGrid (grid, d) Input: Dimension d along which the grid has to split. Output: Two grids each having approximately half the rectangles. 1. sortedStream ←Sorted stream along dimension d. 2. slices ←GetSortedSlices (grid, d)(∗ slices is a map of the coordinate value to the rectangle count in that slice. ∗) 3. count ← 0 P 4. Choose smallest i such that count ← ij=0 slices[j] > StreamLength (sortedStream)/2 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. splitSlice ←Key(slices, i); of f set ← AxisSegmentsd [splitSlice] (∗ Find precise coordinate ∗) previousCount ← count − slices[i] while ReadItem (sortedStream, r) and Contains (grid, r) do ++previousCount; oset ←CurrentPosition (sortedStream) if previousCount ≥ count then break (∗ goes to step 14 ∗) splitV alue ← rd Insert (AxisSegments, of f set, splitV alue) (∗ Insert a new axis ∗) grid1 ← nil; grid2 ← nil (∗ Create two new empty grids ∗) for each GridCell in grid. (∗ Distribute the cells into two grids ∗) g ←CurrentItem (grid) if g.axisBegind < splitSlice then AddGridCell (grid1, g) if g.axesBegind > splitSlice then AddGridCell (grid2, g) else (∗ Split the cell into two and add it to both grids. ∗) g1, g2 ← g if g1.axisEndd 6= splitV alue then (∗ selectedSlice is not the exact split ∗) g1.axisEndd ← splitV alue AddGridCell (grid1, g1) g2.axisBegind ← splitV alue g2.numberOf Rectangles ← 0 AddGridCell (grid2, g2) else AddGridCell (grid2, g) 40 34. Seek (sortedStream, of f set) (∗ Update the rectangle counts if a slice was split in step 26 ∗) 35. while ReadItem (sortedStream, r) and Contains (slice, r) 36. do 37. cellId ← GetCellId (r) 38. g1 ← grid1[cellId]; g2 ← grid2[cellId]; 39. − − g1.numberOf Rectangles; ++g2.numberOfRectangles; 40. return grid1, grid2 We rst obtain the slice l, where the split should occur in step 4. We then (in step 8) look up the sorted stream along dimension d by seeking to the correct oset using the hash table AxisSegments[d]. We go through a small amount of rectangles in this stream that lie in the selected slice. This gives a more precise split value that would then become a new axis along dimension d. This axis and the oset to the axis in the sorted stream are added to the hash table AxisSegments[d] (step 15). Two new grids are then created. Grid cells that lie to left or to the right of the split value are easily distributed. Care has to be taken when a grid cell lies in the slice that was split (step 23). If the slice selected in step 4 was an exact split, we simply add this slice to the second grid (step 33). However when a new axis is introduced, we have to update the axis dening the boundaries of the grid cells in step 26. When this happens, the rectangle counts are adjusted by scanning the rectangles that lie in this slice in step 35. Construction using the grid The construction algorithm is implemented using a set of procedures that share some data by being part of a class. The following class describes the necessary data and operations to perform construction. Class PseudoPRTree Data TwoDRectangleStream inputStream InternalNodeBlock cachedBlocks[] PriorityNode cachedPriorityNodes[] Queue leavesToConstruct Operations void Distribute(r, currentBlock, index, level, stream) void Construct(rootBlock, index) void SplitNode(currentBlock, grid, currentCount, level) The construction algorithm constructs the tree recursively with each recursive step constructing z nodes in the tree. The following pseudo code describes the sequence of operations performed in one recursive step of the construction algorithm: 41 PseudoCode Construct (rootBlock, index) Input: rootBlock and index identifying the rootN ode of the tree constructed. Output: pseudo-PR-tree constructed using the input stream. 1. for each dimension in xmin , ymin , xmax , ymax 2. AMI sort (inputStream, dimension) (∗ Sort in all dimensions ∗). 3. grid ← ConstructGrid (sortedStreams)(∗ Construct the initial grid. ∗) 4. SplitNode (rootBlock, grid, 0, 0) 5. remainingRectangles ←nil (∗ rectangles that could not be distributed are tracked. ∗) 6. while ReadItem (inputStream, r) (∗ Distribute the stream ∗) 7. do 8. Distribute (r, rootBlock, 0, 0, remainingRectangles)(∗ end while ∗) 9. substreams[| leavesT oConstruct |] ← nil (∗ Filter the streams associated with each leaf ∗) 10. Clear (cachedP riorityN odes) (∗ Writes the priority nodes to the disk ∗) 11. Clear (cachedBlocks) (∗ Write the internal nodes to the disk ∗) 12. while ReadItem (remainingRectangles, r) 13. do 14. for each leaf l in leavesT oConstruct 15. if Contains (l, r) 16. then AddItem (substreamsl , r) 17. for each leaf l in leavesT oConstruct 18. tree ←CreateTree (substreamsl ) 19. Construct (l.rootBlock, l.index) The operation Construct takes an internal node and constructs a subtree rooted at that node in memory, containing at most z nodes. Each such construction phase, uses the recursive function SplitNode (step 4) to construct all the internal nodes as per the properties of the pseudo-PR-tree. Later a distribution phase (step 6 performed by the function Distribute, distributes rectangles across the created internal nodes. The construction phase creates a grid (step 3) using the sorted streams (created in step 1) that guides the splitting process. The last phase in the construction algorithm involves, recursively creating subtrees for all the leaves that could not be constructed completely because of memory constraints. The address of such leaves will be stored in the queue leavesToConstruct. The pseudo-code that constructs the internal nodes of the tree using the grid is described below: PseudoCode SplitNode (currentBlock, grid, currentCount, level) Input: –currentBlock - Block to store internal nodes created. –grid - Grid object to split the nodes evenly. –currentCount - Number of internal nodes created so far. –level - Current level being constructed. 1. currentN ode ← currentBlockindex 42 2. if currentCount < z (∗ z = 16, size of the grid ∗) 3. then 4. if NumberOfRectangles (grid) > 4*B (∗ If sucient rectangles are available for a split ∗) 5. then 6. splitDimension ←level % 4 7. g1 , g2 ←SplitGrid (grid, splitDimension) 8. for g in g1 , g2 9. if NumberOfRectangles (g) > 0 10. then if currentBlock is full 11. then currentBlock ←NewInternalNodeBlock ( ) 12. ++currentCount 13. AddInternalNode (currentBlock) 14. SplitNode (currentBlock, g1, currentCount, level + 1) 15. SplitNode (currentBlock, g2, currentCount, level + 1) 16. else (∗ Reached memory limit ∗) 17. if NumberOfRectangles (grid) > 4*B 18. then (∗ node needs to be split, add it to leavesT oConstruct ∗) 19. Add (leavesT oConstruct, currentBlockId, level) 20. else 21. AddInternalNode (currentBlock) The operation SplitNode uses the Grid to split nodes, at each level, in a KDtree fashion. Once a block is allocated, it is recursively used to store the internal nodes created until it has no space left for more nodes. This disk layout strategy achieves I/O eciency during queries. When the maximum number of internal nodes z is created, the leaves that need further splits are stored in the queue, leavesT oConstruct (step 19). The class PseudoPRTree also shows an internal cache of priority nodes and blocks for internal nodes. These caches will be lled when the internal nodes are created in step 11 and later used during the distribution phase. The distribution phase is described by the following pseudo-code: PseudoCode Distribute (r, currentBlock, index, level, remainingRectangles) Input: –r - TwoDRectangle object to be distributed in the tree. –currentBlock - Block to store internal nodes created. –index - index of the location of the node in the currentBlock . –level - Level at which distribution occurs. –remainingRectangles - Stream of rectangles that could not be distributed. 1. currentN ode ← currentBlockindex 2. inserted ←false 3. for each dimension in xmin , ymin , xmax , ymax 4. if priorityN odedimension in currentN ode does not exist 43 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. then if Create priorityN odedimension NumberOfRectangles (priorityN odedimension ) < B then AddRectangle (priorityN odedimension , r) inserted ← true break else rr ←GetRectangle (priorityN odedimension , (B − 1)) if ReplaceRectangle (priorityN odedimension , r) (∗ Replace rectangle if r is more extreme than rr along dimension ∗) then r ← rr if inserted = false n ←NumberOfSubtrees (currentN ode) if n > 0 then splitDimension ←level % 4 subtreeBlock, index ←Choose subtree according to split value and splitDimension Distribute (r, subtreeBlock, index, level+1, remainingRectangles) else AddItem (remainingRectangles, r) The operation Distribute takes a rectangle r that has to be placed in the subtree rooted at the node currentBlockindex such that the properties of a pseudo-PR-tree are not violated. In order to do this, the algorithm rst tries to nd a place in the priority nodes xmin , ymin , xmax , ymax , in that order. This means if a priority node p, has less than B rectangles, the rectangle r gets added in the correct sorted position according to the dimension of the priority node (step 9). If this is not the case, it is checked if r is more extreme than the least extreme rectangle rr in p (step 12). If this is indeed the case, we replace rr by r in p and continue the distribution process with rr. If none of the priority nodes could accommodate r (or rr in case r is replaced in step 14), the search for the correct position continues in the appropriate subtree, that is chosen using the dimension in which the tree was split and the split value (step 21). If the tree is completely full, r is added to the stream, remainingRectangles. This stream is later ltered for each of the leaves that have to be recursively constructed. 4.3.3 Implementation Issues Designing the structure of the grid The Grid is a collection of grid cells. During the construction of the pseudo-PRtree, it is often desirable to quickly obtain the grid cell that contains a particular 44 rectangle. This happens for instance, when the grid cell has to be updated with counts of rectangles. To be able to do this, the grid was implemented as a hash table that contains a mapping of a unique key(id) that identies the grid cell, to a grid cell itself. The formula to obtain the id from the coordinates of a rectangle was already presented. The id of the grid cell should also be derivable from the coordinates of the rectangle that lies in the grid cell. The solution to this problem is to obtain the key from the index of the coordinates that dene the boundary of the grid cell. This index can easily be obtained using the AxisSegments hash table. Maintaining vs. computing the minimum bounding box Improving the algorithms to update a PriorityNode The rectangles in a priority node are kept in a sorted order to easily obtain the least extreme value, which is frequently required during the distribution of rectangles. The minimum bounding box to be kept in the parent node could either be computed when the tree has been constructed or maintained as the tree is being constructed. Maintaining the bounding box was found to be more expensive because rectangles enter and leave the priority nodes very often during its construction phase. Hence the computation of the bounding box was chosen to be done after the construction has completed. To store rectangles in a sorted order along a particular dimension, it is important to insert new rectangles in the correct position. A linear traversal of this list made the construction algorithm very slow. Hence a binary search is performed to locate the position of insertion or deletion before actually inserting or deleting a rectangle. 4.4 LPR-tree 4.4.1 Structure The LPR-tree contains a sequence of pseudo-PR-trees (with annotated information). The most important operations related to the LPRTree structure are the Insert, Delete and the Query algorithms. The following class describes how these algorithms are implemented. 45 Class LPRTree Data InternalNodeBlock cachedBlocks[] PriorityNode cachedPriorityNodes[] RootNodeBlock rootBlock outputStream AMI collection Operations void Insert(r) bool Delete(r) void Query(queryWindow, outputSize, nodes, void GetRectangles() void CacheTree(treeIndex) priorityNodes) The root node of the LPR-tree contains links to the root node blocks of the children pseudo-PR-trees. Depending on the size of these pseudo-PR-trees either the full tree or a part of the tree is cached. This gives the benet to be I/O ecient while updating the tree. The info eld of this root node block, contains the following elds: numberOfRectangles - This indicates the number of rectangles with which the tree was last bulk loaded. numberOfInsertions - Indicates the number of insertions made to the tree since the last time the tree was bulk loaded. numberOfDeletions - Indicates the number of insertions made to the tree since the last time the tree was bulk loaded. During updates, the tree is reconstructed by collecting all the rectangles. This happens in the following two situations: The number of insertions becomes equal to the number of rectangles with which the tree was bulk loaded. The number of deletions becomes equal to half the number of rectangles with which the tree was bulk loaded. When such a situation happens, the caches are emptied and the tree is bulk loaded again, resetting these counters to 0 and setting the numberOfRectangles to the correct number. The GetRectangles method, recursively traverses the entire tree and writes all the rectangles to a stream, subsequently deleting all the blocks and nodes from the tree. Bulk loading a LPR-tree involves constructing the τdlog(N/B)e+1 child pseudo-PRtree. This s followed by a phase in which a part or whole of the tree is cached. Given the tree index, the number of levels to cache is described by the following pseudo code: 46 PseudoCode CacheTree (treeIndex) Input: Index of the child pseudo-PR-tree to be cached. 1. block ← GetBlock (rootN odeBlock, treeIndex) (∗ Gets the root block ∗) 2. maxLevels ← −1; cache ← false N 3. l ← log M 4. m ← log MB 5. if NumberOfItems (block) > 0 6. then 7. if treeIndex ≤ m 8. then maxLevels ← −1 9. else if l > (m + 1) and treeIndex < l 10. then cache ←false 11. else maxLevels ← (treeIndex − l; ) 12. if cache = true 13. then Recursively cache all the priority nodes and internal nodes depth of maxLevels. 4.4.2 up to a Insertion Algorithm The following pseudo code describes the high-level implementation details to insert a rectangle in a LPR-tree: PseudoCode Insert (r) Input: TwoDRectangle object, r to be inserted. 1. if (numberOf Insertions + 1) = numberOf Rectangles 2. then 3. stream ←GetRectangles ( ) 4. AddItem (stream, r) 5. Construct (stream) 6. else 7. t0Count ← NumberOfRectangles (t0Block) 8. if t0Count >= B 9. then 10. stream ←nil 11. for i ← 0 to n 12. block ← GetBlock (rootN odeBlock, treeIndex) (∗ Gets the root block ∗) 13. if NumberOfRectangles (block) > 0 14. then GetRectangles (i) (∗ Add all the rectangles to stream ∗) 15. else break (continues with step 16) 16. tree ←ConstructPseudoPRTree (stream, outputStream) 17. CacheTree (tree) 18. else (∗ τ0 block has space for r ∗) 47 19. AddRectangle (t0Block, r) When the number of rectangles inserted has reached the threshold, the tree is reconstructed by collecting all rectangles in a stream that also contains the rectangle to be inserted. When this not the case, the rectangle is inserted in the τ0 tree. If τ0 is already full, a search is made for the rst empty tree τi in step 12. All the rectangles in the trees preceeding this tree are moved to a stream and the trees discarded. τi is then bulk loaded with the stream of rectangles collected. Having now made sure that τ0 has space, the rectangle is inserted into τ0 . 4.4.3 Deletion Algorithm The deletion algorithm uses the least extreme values of priority nodes and the split values of the nodes to search for the rectangle that has to be deleted. As a consequence of the distribution strategy of rectangles in a pseudo-PR-tree, it is guaranteed to have exactly one path in a pseudo-PR-tree to search for a specic rectangle for deletion. This path may result in the rectangle being found or it is concluded that this rectangle is not present in the tree. While deleting a rectangle, we take care to update the minimum bounding boxes of the priority nodes. The internal nodes are updated after deletion by keeping track of the path followed for deletion. In this section we describe the pseudocode for deletion followed by a description of the algorithm to replenish rectangles in under-full nodes. PseudoCode Delete (r, block, index, path) Input: r - TwoDRectangle object to be deleted. block - InternalN odeBlock representing a node in the tree (initially the root block) index - index of the position in block of the current node. path - path to the deleted rectangle that is initially empty. Output: true if r is found and deleted, false otherwise. 1. node ← blockindex 2. currentV alue ← rdimension 3. for each dimension in xmin , ymin , xmax , ymax 4. lev ← node.leastExtremeV aluedimension 5. if lev ≤ currentValue (* ≥ for max nodes *) 6. then 7. p ← node.priorityN odesdimension 8. if RemoveRectangle (p, r) 9. then 10. Add (path, dimension) 11. if NumberOfRectangles (p) < B/2 12. Replenish this node and update bounding boxes 48 13. else ComputeMinimumBoundingBox (p, node) 14. return true 15. else 16. if lev = currentV alue 17. goto step 3 with next dimension. 18. else return false 19. if NumberOfSubtrees (node) > 0 20. then 21. subtreeBlock, subtreeIndex ←Choose subtree according to split value 22. Add (path, subtreeIndex) 23. return Delete (r, subtreeBlock, subtreeIndex, path) 24. else return false Given a rectangle r, the above pseudo code recursively goes through the tree. In step 3 it checks each of the priority nodes based on the least extreme values stored in the parent node. If r falls in this range, either it gets deleted or it is concluded that the rectangle is not found in the tree except when the least extreme value is the same as the coordinate value of the search rectangle (step 17). This is because, it may be possible that more than one rectangle in the tree has the same coordinate value along a specied dimension which also happens to be the least extreme value of the priority node. If none of the priority nodes could be checked as the rectangle falls out of range or when the special case of step 17 occurs, the sub trees are searched recursively in step 20. The path to the deleted rectangle is stored to be able to correct the minimum bounding boxes. This path stores the index of the child from the root to the priority node that contained the deleted rectangle. More precisely index of the child is the index of the subtree in case of an internal node(step 22), otherwise it is the dimension of the priority node(step 10). Note that we don't have to store the id's of the blocks as this information is already present in the internal nodes. As the recursion terminates when a priority node is reached, the last index in this path must be the dimension of the priority node. When deletion happens successfully and the rectangle count in the priority node falls below (B/2), the node is replenished (step 12) with rectangles from its priority node siblings or from the priority node children of its sibling KD-nodes. This is described by the following pseudo code: PseudoCode ReplenishNodes (p, d, node) Input: PriorityNode p that is the d-th child under-full. node of the InternalNode node that is 1. stream ← nil (∗ Stream contains items of rectangles and their priority node address ∗) 2. underF lowStack ←nil 3. priorityN odes ←nil (∗ List of addresses of nodes from which rectangles are collected ∗) 4. for each dimension succeeding d in xmin , ymin , xmax , ymax 49 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. q ← node.priorityN odesdimension rectangles ←GetRectangles (q) AddItem (stream, rectangles) Add (priorityN odes, Address(q)) for each sub-tree of node subtreeN ode ← root of the sub-tree. for each dimension in xmin , ymin , xmax , ymax rectangles ←q ← subtreeN ode.priorityN odesdimension GetRectangles (q) AddItem (stream, rectangles) Add (priorityN odes, Address(q)) n ←MIN(B/2, StreamLength (stream)) AMI sort (stream, d) for i ← 0 to n ReadItem (stream, r, q)(∗ q is priority node where the rectangle r AddRectangle (p, r) RemoveRectangle (q, r) ComputeMinimumBoundingBox (p) for i ←0 to Count (priorityN odes) q ← priorityN odesi if NumberOfRectangles (q) < B/2 Push (q, underF lowStack) else ComputeMinimumBoundingBox (q) while underF lowStack is not empty q ← Pop (underF lowStack) ReplenishNodes (q.priorityN ode, q.dimension, q.parent) exists. ∗) Rectangles are rst collected in an AMI stream from the sibling priority nodes in step 9 and from the children priority nodes of the KD-nodes of the parent of p (step 16). The stream is then sorted. The number of rectangles to be replenished is a minimum of B/2 and the number of rectangles collected (step 17). This is required to ensure the correctness of the LPR-tree after deletion. During this process of replenishing p, the nodes from which rectangles were borrowed may run under-full. The addresses of these priority nodes are collected in a stack (step 24). The address of a priority node, is dened by the id of its AMI block, its dimension and the id of AMI block of its parent. These underowing priority nodes are replenished recursively in step 29. Its important to note that these rectangles are replenished in the reverse order. This means, for instance, if while replenishing the xmin priority leaf, all its siblings run under-full, then those children must be replenished in the order ymax , xmax and ymin . This is to ensure that the priority node can be replenished with correct number of rectangles while still preserving all the properties of a pseudo-PR-tree. Such a reverse order can easily be ensured by storing the addresses of these under-owing nodes in a stack. 50 4.4.4 Implementation Issues Correctness of LPR-tree structure Implementing the update algorithms taking care of every precise detail is very dicult. To easily nd and x implementation bugs, a small procedure was written to check the correctness of the tree. When working with large datasets, this was a good help in diagnosing problems. The correctness is checked by verifying certain trivial facts about the structure of the LPR-tree. The following are some of these rules: Any priority node that is under-full (having less than B/2 rectangles) is a node that has no succeeding sibling priority nodes and the parent of such a node does not have any subtrees. – The most extreme value of a priority node (of a certain dimension d) at level i ( i > 0)is less extreme than the least extreme value of the corresponding priority node in the same dimension of its parent at level i − 1. – All priority nodes of the upper levels of a tree that are children of a node having KD-nodes as children must be full. – Replenishing nodes A lot of eort was spent on correctly implementing the replenishing of nodes during deletion. The following were the two main problem areas encountered: When a priority node for instance, pxmin , underows due to deletion it get replenished with sibling priority nodes or the children priority nodes of sibling KD-nodes. It may be possible that some of these nodes which gave their rectangles to pxmin may underow. In such a situation, these priority nodes also need to get recursively replenished in the reverse order. Reverse order means that rst the priority nodes of sibling KD-nodes have to replenished then the sibling priority nodes in the order ymax , xmax , ymin , xmin . This order is necessary to maintain the correctness of the tree. It also ensures that any node that needs replenishment can be adequately replenished. – It is important to replenish a underowing priority node with at most (B/2) rectangles and not more to preserve the correctness of the tree. Before explaining the problem, the following observations are made. Observation 1 : The least extreme rectangle of a priority node, in any dimension, is more extreme than the most extreme value along that dimension across all the priority node children of the sibling KD-nodes. This observation immediately follows from the structure of the LPR-tree. Observation 2 : A priority node can underow to an extent much below (B/2) and thereby requiring more than (B/2) rectangles to be completely packed. Consider the situation that atleast two of priority nodes (ν and µ) are – 51 having (B/2) + 1 rectangles. When a rectangle from the rst of these priority node ν , gets deleted, it requires replenishment. It is possible that the most of these rectangles taken from µ. In this case, µ required more than (B/2) rectangles to be completely packed. Observation 3 : For a priority node, pd, in any dimension d, greater than xmin , having n rectangles where n < B , the following is true: the n rectangles in pd together with the rectangles of its priority node siblings, may not have the most extreme B rectangles, along dimension d, in the subtree rooted at its parent. Any priority node having KD-siblings is completely full only after bulk load. After bulk load, all priority nodes of a node ν , collectively have B most extreme along dimension d in the subtree rooted at its parent. Now assume that we completely pack the priority nodes that are underowing. Also assume that ymin , xmax , ymax priority nodes, whose parent is η, have exactly (B/2) + 1 rectangles. Also assume that η has only one sibling KD-node φ as the other sibling KD-node ψ is deleted as all its rectangles in its subtree have been deleted earlier. Now, deleting one rectangle from ymin may cause xmax to underow below (B/2) (From Observation 2 ) and also assume that no rectangles are taken from ymax . As a result, xmax requires replenishment but not ymax as ymax has enough rectangles to just stay above the threshold. Now replenishing xmax to the fullest extent may cause ymax to run completely empty, while also borrowing rectangles from the priority nodes of the sibling KD-node, φ. This ymax node will now be replenished with rectangles from the children priority nodes of its KD-siblings. All priority nodes of φ provide the most extreme B rectangles in the ymax direction. However it is possible that these B rectangles are not the most extreme ymax rectangles in the subtree rooted at φ(from Observation 3 ). Moving these rectangles to ymax node of η violates Observation 1 which makes the LPR-tree inconsistent. However moving a maximum of (B/2)rectangles will ensure that no priority node can run empty while one its sibling priority nodes or the priority nodes of the sibling KD-nodes have more than (B/2) rectangles. Thereby, the above mentioned problem will not occur. 52 Chapter 5 Experiments 5.1 Experimental Setup The LPR-tree is implemented in C++. The code has been developed using the Microsoft Visual Studio 2003 compiler for the Windows XP platform. TPIE[4] is used as the library to control block allocations and count I/O's (See Appendix B). Each rectangle has a size of 40 bytes. The implementation uses a maximum possible block size of 1638 rectangles. The experiments were run on a Pentium 4 CPU, 2.00 Ghz with 256 MB of RAM. The amount of memory available for TPIE is restricted to 64 MB. This constraint is taken from the experiments on (static) PR-trees performed earlier[3]. The experiments are performed using the LPR-tree implementation presented in this thesis and the R∗ -tree implementation included in the distribution of TPIE[4, 6]. 5.2 Datasets We use both real-life and synthetic datasets. 5.2.1 Real life data As the real-life data we use the TIGER data of the geographical features in the United States of America. As most of our test results are expected to vary with the size of the datasets, we would like to have real life datasets of varying sizes. The TIGER dataset is distributed over six CD-ROMs. We choose to experiment with road line segments of the two of the CD-ROMs. We collect 17 million rectangles from this dataset and create 7 streams. Five of these streams contain one million rectangles and the rest is split into two streams of 4 million and 8 million rectangles respectively. 53 5.2.2 Synthetic data To investigate the query performance of these dynamic R-trees over various extreme parameters and distribution characteristics, we use the following datasets. 1. Uniform Distribution : This dataset is designed to test the performance of R trees over rectangles whose centers follow a uniform distribution in the unit square. The width and height of the rectangles is also generated uniformly at random as a number between 0 and 0.001. These gures are the same as the the experiments on (static) PR-trees performed earlier[3]. In general, we refer to this distribution as the UNIFORM dataset. 2. Normal Distribution :This dataset has xed sized squares of size 0.001. The centers of the squares follow a normal distribution whose mean is 0.50 and standard deviation is 0.25. In general, we refer to this distribution as the NORMAL dataset. The following table describes these datasets. Dataset Identifier UNIFORM(na ) NORMAL(n) Rectangle - ρ Center(ρ) = (x, y) W idth(ρ) = Center(ρ) = (x, y) W idth(ρ) = Height(ρ) = 0.001. where x and y are uniformly generated at random such that 0 ≤ x, y ≤ 1.0, with the current system time used as the seed b for randomization. UniformRandom (0, 0.001). is also chosen in a similar manner. Height where x and y are generated at random from a normal distribution with µ = 0.50, σ = 0.25, with the current system time used as the seed for randomization. a n - number of rectangles to generate b note that the generator is seeded only once for the generation of n rectangles. Both these distributions are used to generate 17 streams similar to the TIGER datasets. For convenience the described datasets are given names as indicated below: 54 Dataset Identifier Uni1 1 .. Uni1 5 Uni4 Uni8 Uni1.5 Nor1 1 .. Nor1 5 Nor4 Nor8 Nor1.5 Tig1 1 .. Tig1 5 Tig4 Tig8 Tig1.5 5.3 Description Uniform distribution of 1 million rectangles. Uniform distribution of 4 million rectangles. Uniform distribution of 8 million rectangles. Rectangles from Uni1 1 and half the rectangles from Uni1 2. Normal distribution of 1 million rectangles. Normal distribution of 4 million rectangles. Normal distribution of 8 million rectangles. Rectangles from Nor1 1 and half the rectangles from Nor1 2. TIGER dataset of 1 million rectangles. TIGER dataset of 4 million rectangles. TIGER dataset of 8 million rectangles. Rectangles from Tig1 1 and half the rectangles from Tig1 2. Bulk Load Experiment Description The comparative bulk performance of the R∗ -tree and the LPR-tree is consistent with their static variants[3] i.e; for instance, the performance of the R∗ -tree bulk loaded with the TIGER dataset using the Hilbert construction algorithm and some R∗ -tree heuristics, is three times better than the performance of LPR-tree. Hypothesis 5.3.1. The bulk load performance of the R∗ -tree linearly increases with the number of rectangles in the dataset. The LPR-tree also shows a similar behavior. Hypothesis 5.3.2. Procedure Perform bulk load with Uni1, Uni4, Uni4 datasets. Repeat this with similar datasets in the NORMAL and TIGER sets. It is expected that I/O's increase linearly with data set size in case of the eastern data sets while the data distribution of the synthetic datasets should have little or no eect on the I/O performance of the algorithm. Results The CPU time taken to bulk load for the various datasets for the LPR and the R∗ -tree is shown in gure 5.1, 5.2 and 5.3. The CPU time appears to increase in a slightly super-linear fashion with the dataset size for both the LPR-tree and the R∗ -tree. For the LPR-tree, this can be explained by the fact that the grid implementation creates two new grids with each split. The LPR-tree also shows negligible time dierence between dierent distributions. In fact, the minor dierences could 55 be more attributed to the sorting time than to the nature of distribution. In contrast, the R∗ -tree is more sensitive to the distribution performing worst for normal dataset and the best with about 25% less time for the uniform dataset. Figure 5.1: Bulk Load CPU time - Uniform dataset. Figure 5.2: Bulk Load CPU time - Normal dataset. The I/O counts shown in gure 5.4, 5.5 and 5.6 seem to increase almost linearly with dataset size.This is inline with expectations for the LPR-tree. For the R∗ -tree I/O's are expected to scale linearly with dataset size within the same distribution. The results seem to be independent of the type of distribution for LPR-tree whereas for R* trees, there is variation, with almost 13% more I/O's incurred in the best case compared to the worst case for the 8 million datasets. The same gure for LPR-tree is less than 1.3 The most important observation is the large I/O dierence between the LPR and the R∗ -trees. A large part of this I/O cost for the LPR-tree could be attributed to sorting of streams that is done at each recursive step in the implementation. 56 Figure 5.3: Bulk Load CPU time - TIGER dataset. Figure 5.4: Bulk Load I/O - Uniform dataset. Figure 5.5: Bulk Load I/O - Normal dataset. 57 Figure 5.6: Bulk Load I/O - TIGER dataset. 5.4 Insertion Experiment Description The average number of I/O's incurred per rectangle inserted remain approximately the same until the LPR-tree is rebuilt. The same is true for CPU time spent per insertion. Hypothesis 5.4.1. The average number of I/O's per insertion (amortized) on the LPR-tree is better than the same average number of I/O's per insertion (amortized) on the R∗ -tree, on similar datasets and under similar conditions. Hypothesis 5.4.2. The performance results of the insertion algorithm on LPR-tree shows little or no variation (on average) to the nature of the distribution of the data rectangles. Hypothesis 5.4.3. Procedure First the LPR-tree that was bulk loaded with Uni4 is loaded. We insert the rectangles in U ni1 i where i²[1, 5]. After the insertion of every U ni1 i, we measure the average number of I/O's required per insertion. We expect a peak in I/O's when inserting U ni1 4. This is due to the cleanup/rebuilding of tree. We repeat this experiment with NORMAL and TIGER datasets. Similar experiments are carried out on the R∗ -tree. Results Figure 5.7 shows that the insertion CPU time for the LPR-tree, is almost the same per rectangle. The graph shows the CPU time for every million rectangles inserted in the tree. We do see a sudden increase when a million rectangles are inserted the fourth time. This is due to the rebuilding of the tree that happens when the insertion count becomes equal to the count of the number of rectangle. At this 58 point the tree is doubled in capacity and bulk loaded. The CPU time taken seems to almost independent to variations in distribution of rectangles. Its important to note that when the tree does get rebuilt, its capacity doubles as a result, the expensive cost for rebuilding has to be incurred again only much later. Figure 5.7: Insertion CPU time - LPR-tree. Figure 5.8 shows that the insertion CPU time for the R∗ -tree is always above 600 seconds for every million rectangles inserted any time. The probability that insertion time with the R∗ -tree is going to be greater than with the LPR-tree is quite high. Figure 5.9 shows that the average CPU time for insertion is almost the same with R∗ -tree having only a slightly better performance. Figure 5.8: Insertion CPU time - R∗ tree. Figure 5.10 shows the average I/O counts per 100 rectangles inserted for the LPRtree. The graph is plotted for every million rectangles inserted. The sudden increase on inserting the fourth million rectangles is because of the rebuilding of the tree. But on an average over the 5 million rectangles inserted we see in gure 5.12 that there are around 23 I/O's per 100 rectangles inserted. There is a sharp deviation 59 Figure 5.9: Insertion average CPU time - LP R tree vs. R∗ tree. of around +16 I/O's from the best case. However if we look at the insertion cost per rectangle, this is not much. Once again there is negligible deviation in results across dataset types with the TIGER dataset performing the worst. This is correct as insertion algorithm basically performs bulk loading which is inert to variations in dataset types. Figure 5.10: Insertion I/O's - LP R tree. Figure 5.11 shows the average I/O counts per 100 rectangles inserted for the R∗ tree. There is quite a bit of variation with nature of distribution with the TIGER dataset giving the worst performance. The I/O counts comparison between LPR and R∗ -tree shown in gure 5.12 is interesting. It clearly shows that insertion outperforms the R∗ -tree by a really large amount. The poor R∗ -tree performance could be attributed to the really dense datasets and the cost of re-inserting 30% of the rectangles from the overowing seems to be quite high even though this is done once per level. 60 Figure 5.11: Insertion I/O's - R∗ tree. Figure 5.12: Insertion Average I/O's - LP R tree vs. R∗ tree. 61 5.5 Deletion Experiment Description The average number of I/O's incurred per rectangle deleted in the LPR-tree reduces on average as rectangles get deleted. The same is true for CPU time spent per deletion except when the tree gets bulk loaded during which CPU time taken would be signicantly higher. Hypothesis 5.5.1. The average number of I/O's per deletion amortized) on the LPR-tree is better than the same average number of I/O's per deletion (amortized) on the R∗ -tree, on similar datasets and under similar conditions. Hypothesis 5.5.2. The results of the delete algorithm on the LPR-tree, shows little or no variation (on average) to the nature of the distribution of the data rectangles. Hypothesis 5.5.3. Procedure We load the LPR-tree containing 4 million rectangles. We insert the rectangles in U ni1 i where i²[1, 3]. This tree is saved and used as the base of the deletion experiments. We then successively delete rectangles U ni1 i where i²[1, 3]. We measure the average number of I/O's after every deletion of U ni1 i. We expect deletions to take more I/O's than insertion on average because of the reorganization resulting from replenishing underfull nodes. Also a peak in I/O's is expected when half the rectangles get deleted. This is due to the cleanup/rebuilding of tree. However the average number of I/O's per deletion is expected to decrease as rectangles get deleted. This is due to the fact that the tree is becoming smaller in size. Results The results of the CPU time measurements for LPR-tree are depicted in Figure 5.13. In line with the other results, there seems to be very little eect of the variation of the distribution of rectangles in datasets. The rst million rectangles deleted take around 700 seconds. The second million rectangles take more CPU time as cost is incurred in the reconstruction of the tree. The deletion of the third million rectangles seem to go faster than the rst million. This can be explained by the fact that the tree is smaller than the tree from which the rst million rectangles were deleted. Figure 5.14 shows the CPU time measurements for R∗ -tree. The CPU time taken to delete 1 million rectangles is on the average 1200 seconds for the normal and tiger datasets. An equivalent experiment on uniform datasets failed to complete even after 4.5 hours. Hence the data shown here does not have statistics for this dataset. In comparison to R∗ -tree, deletion CPU time on the LPR-tree over dierent type of datasets has less variation. It also shows that, deletion in the LPR-tree is very 62 Figure 5.13: Deletion CPU time - LPR-tree. often expected to be much faster compared to the R∗ -tree. An overall average comparison of deletion times shown in Figure 5.15 shows the LPR-tree having an advantage over the R∗ -tree. Figure 5.14: Deletion CPU time - R∗ tree. The I/O's incurred per deletion for the LPR-tree are shown in Figure 5.16. Interestingly the average counts per deletion go down with every million rectangles deleted. Also the cost incurred in restructuring is not much. This is because the tree is bulk loaded with only 2 million rectangles. Its hard to explain why there is no variation in the I/O counts with variation in dataset distribution. The best guess, one could make is that rectangles are very dense in the dataset and every deletion seems to have equal amount of eect w.r.t to replenishing nodes. Deletion is more expensive than insertion w.r.t I/O's and this is clear from the results. The I/O's incurred per deletion for the R∗ -tree are shown in Figure 5.17. The R∗ tree apparently performs slightly better on the TIGER dataset compared to the 63 Figure 5.15: Deletion average CPU time - LP R tree vs. R∗ tree. Figure 5.16: Delete I/O - LPR-tree. 64 NORMAL dataset. In contrast with LPR-tree, where cost is reduced for every million rectangles deleted (across datasets), the R∗ -tree does not show any predictable behavior with I/O's staying at almost the same level. The comparison of average I/O's(Figure 5.18) clearly shows that R∗ -tree needs twice the number of I/O's on average compared to the LPR-tree, to perform deletion. Figure 5.17: Deletion I/O - R∗ tree. Figure 5.18: Deletion average I/O's - LP R tree vs. R∗ tree. 5.6 Query Experiment Description LPR-tree updates (insertions and deletions) should not affect the query performance. In other words, query performance in terms of I/O's and CPU time should be close to the query performance after bulk load. Hypothesis 5.6.1. 65 Hypothesis 5.6.2. The average number of I/O's incurred per output rectangle reported, in a LPR-tree, is, better than the R∗ -tree on similar datasets and under similar conditions. The R∗ -tree performs best on the NORMAL dataset (squares) and slightly worse on the UNIFORM and TIGER dataset. Hypothesis 5.6.3. Procedure LPR-tree is bulk loaded with uni1.5. For each of the UNIFORM, NORMAL and TIGER datasets, 1000 randomly generated window queries are performed on this tree. Each rectangle representing a window query is generated as follows: Dataset Identifier UNIFORM Window query - ρ where x and y are uniformly generated at random such that 0.3 ≤ x, y ≤ 0.7, with the current system time used as the seed for randomization. – W idth(ρ) = UniformRandom (u, v), where u, v is a number uniformly chosen at random such that, 0.2 ≤ u < v ≤ 0.4. Height is also chosen in a similar manner. – Center(ρ) = (x, y) NORMAL where x and y are generated at random from a normal distribution with µ = 0.50, σ = 0.25, with the current system time used as the seed for randomization. – Center(ρ) = (x, y) – W idth(ρ) = dom (0.001, 0.1) TIGER . ρ = Height(ρ) = UniformRan- ComputeMinimumBoundingBox (s) where s, is a set of rectangles chosen from the tig1P2 stream i−1 jq ) + such that si = tig1 2k+ji , where ji = ( q=0 U nif ormRandom(0, 10). k is chosen at uniformly distributed osets for each query rectangle. Then uni 3 and uni 4 are inserted with the 1000 random queries being performed in between these insertions. This is followed by deletion of uni 3 and uni 4. Once 66 again queries are performed before and after deletions. Finally, uni 3 and uni 4 are re-inserted with queries performed before and after the insertions. Similar experiments are carried out with the NORMAL and UNIFORM datasets. The same set of experiments are repeated for the R∗ -tree. Results Figure 5.19 shows the query CPU time per B rectangles output reported for the LPR and the R∗ -trees. Clearly, the LPR-tree performs much better than the R∗ tree for all types of distributions with query times being twice to three times better for the LPR-tree. The variation in distribution could be explained by the fact that the queries for the dierent distributions return very dierent output sizes. For instance, the uniform datasets reports very good CPU time per output,because its output size on average is quite big (more than 100000) compared to normal dataset (around 2000 - 3000). This shows that query cost gets amortized with a large number of queries reporting large output sizes. This same variation w.r.t datasets can be seen in the R∗ -tree. The implementation of deletion in the R∗ -tree hangs while deleting the uniform dataset. So there are no statistics for this dataset. A last interesting observation is that query CPU time seems to remain the same over the interleaved insertions and deletions which was expected from the LPRtree. However in the case of the R∗ -tree, query time deteriorates after deletions especially for the NORMAL dataset. This could be attributed to the same kind of increase in I/O's for this dataset. Figure 5.20 shows the I/O's per B rectangles reported for the LPR and R∗ -tree. As with the time statistics, the LPR-tree outperforms the R∗ -tree by a good margin across all datasets. There is very little eect of the updates on the query performance with the graph showing a near horizontal line. To answer the same queries R∗ -tree requires three times to six times more I/O's compared to the LPR-tree. The interleaved updates once again seem to make query costs expensive for the R∗ -tree for the NORMAL dataset. To verify the theoretical worst-case query guarantees of the LPR-tree experimentally, is very dicult. However, given all the queries q on the LPR-trees, performed across all the datasets, I plot the theoretical value, NB + BT , using the average output size and average number of I/O's to answer these queries (Figure 5.21). Then I plotted the experimentally measured I/O's on a dierent scale on the same graph. Adjusting the scale of experimental results, it was possible gure out the worst performing set of queries which is point 11 on this graph. When the scale is adjusted, the experimental curve completely lies below the theoretical curve. As we know that the R∗ -tree takes more I/O's to answer these set of queries, the worst-case behavior of the LPR-tree is more closer to the worst-case optimum number of I/O's than the R∗ -tree. 67 Figure 5.19: Query CPU time (in msec) per B rectangles output . 68 Figure 5.20: Query I/O's per B rectangles output. 69 Figure 5.21: Empirical Analysis - Theoretical vs. Experimental query I/O results for LPR-tree . 70 Chapter 6 Conclusions From the experiments and the results obtained the following conclusions can be made. The R∗ -tree is more ecient in terms of I/O and time to construct static R-tree structures. However query performance of the LPR-tree is much better than R∗ tree. Update algorithms of the LPR tree outperform the R∗ -tree by a large amount in terms of IO's. However insertion and deletion time are quite close on average. Considering that LPR-tree performs the updates more faster most of the time with the rebuilding cost getting amortized, one would still prefer the LPR-tree over the R∗ tree. Query performance for the LPR-tree remains very good with interleaved insertions and deletions and here the LPR-tree denitely has an edge over the R∗ -tree. Based on the above observations, LPR-tree should be preferred over the R∗ -tree when indexing data that is frequently updated. For static data, if query performance is more important than indexing, LPR-tree is still preferable over the R∗ -tree, otherwise the R∗ -tree is better. In other words, to have static R-tree structures over relatively small datasets (less than 1 million data objects), R∗ -tree is preferable. Experimental verication of the query guarantees show a single set of queries that have worst-case performance than all the remaining queries. As we know that the R∗ -tree takes more I/O's to answer this query, the worst-case behavior of the LPRtree is more closer to the worst-case optimum number of I/O's than the R∗ -tree. 71 72 Appendix A Tables of Experimental Results A.1 Bulk Load A.1.1 LPR tree Dataset Blk Reads a Blk Writes b Strm Reads c Strm Writes d I/O's Uni1 1 Uni4 Uni8 Nor1 1 Nor4 Nor8 Tig1 1 Tig4 Tig8 a Number b Number c Number d Number 1062 4314 8678 1102 4311 8589 1072 4379 8642 1731 6960 12984 1783 6902 13776 1746 7085 13879 20892 122763 285043 20903 126850 288579 21999 132221 285393 10048 67626 160675 10054 68017 160596 10049 69765 159672 of AMI block objects read from the external memory of AMI block objects written to the external memory of block writes incurred while reading from an AMI stream object of block reads incurred while writing to an AMI stream object 73 33733 201663 467380 33842 206080 471540 34866 213450 467586 Time(sec) 143 859 2247 138 882 2223 133 855 2047 A.1.2 R∗ tree Dataset Blk Reads Blk Writes Strm Reads Strm Writes I/O's Time(sec) Uni1 1 Uni4 Uni8 Nor1 1 Nor4 Nor8 Tig1 1 Tig4 Tig8 1035 4168 8361 1352 5472 11088 1233 4837 9845 2072 8338 16724 2706 10946 22178 2468 9676 19692 3053 12216 24431 3054 12216 24433 3054 12216 24433 74 1832 7331 14663 1833 7332 14665 1833 7332 14664 7992 32053 64179 8945 35966 72364 8588 34061 68634 65 262 588 67 302 809 58 314 694 A.2 Insertion A.2.1 Dataset lpr uni4a lpr nor4 lpr tig4 r* uni4 b r* nor4 r* tig4 Insertion Time Dataset inserted(mln) Time(sec) uni1 1 uni1 2 uni1 3 uni1 4 uni1 5 nor1 1 nor1 2 nor1 3 nor1 4 nor1 5 tig1 1 tig1 2 tig1 3 tig1 4 tig1 5 uni1 1 uni1 2 uni1 3 uni1 4 uni1 5 nor1 1 nor1 2 nor1 3 nor1 4 nor1 5 tig1 1 tig1 2 tig1 3 tig1 4 tig1 5 363 529 397 2839 378 347 505 386 2795 338 371 531 408 3136 367 767 583 840 945 1010 593 655 1090 1212.5 565 782.5 850 787.5 787.5 767.5 a LPR-tree on the Uni4 dataset b R∗ -tree on the Uni4 dataset 75 A.2.2 Insertion I/O’s Dataset Dataset inserted lpr uni4 uni1 1 uni1 2 uni1 3 uni1 4 uni1 5 lpr nor4 nor1 1 nor1 2 nor1 3 nor1 4 nor1 5 lpr tig4 tig1 1 tig1 2 tig1 3 tig1 4 tig1 5 r* uni4 uni1 1 uni1 2 uni1 3 uni1 4 uni1 5 r* nor4 nor1 1 nor1 2 nor1 3 nor1 4 nor1 5 r* tig4 tig1 1 tig1 2 tig1 3 tig1 4 tig1 5 Blk Reads Blk Writes Strm Reads Strm Writes I/O per 100 9807 21725 31993 66509 76207 9918 22013 32385 66983 76859 9884 21995 32355 67051 76885 5878028 6098508 6521546 6828814 6735878 5409684 6162222 6723870 6155400 5971599 9127032 8311476 8633678 7828134 7477655 10297 22750 33588 69094 79359 10408 23059 33999 69488 79975 10370 23052 33976 69596 79997 5878420 6099128 6522337 6829608 6736691 5410228 6163002 6724688 6156271 5972418 9127874 8312339 8634539 7828961 7478448 76 45396 118024 168995 582407 627216 45327 119834 170680 587696 632260 45360 121660 173206 619578 665682 611 613 612 613 612 612 612 612 612 613 613 613 613 613 613 22962 61834 87616 324700 347357 22904 62764 88467 326224 348717 22925 62376 88046 330984 353791 1 2 2 2 2 2 1 2 2 2 2 3 3 2 2 8.8462 13.5871 9.7859 72.0518 8.7429 8.8557 13.9113 9.7861 72.486 8.742 8.8539 14.0544 9.85 75.9626 8.9146 1175.706 1219.8251 1304.4497 1365.9037 1347.3183 1082.0526 1232.5837 1344.9172 1231.2285 1194.4632 1825.5521 1662.4431 1726.8833 1565.771 1495.6718 A.3 A.3.1 Deletion Deletion I/O’s and time Dataset Dataset deleted lpr uni4 uni1 1 uni1 2 uni1 3 lpr nor4 nor1 1 nor1 2 nor1 3 lpr tig4 tig1 1 tig1 2 tig1 3 r* uni4 uni1 1 uni1 2 uni1 3 r* nor4 nor1 1 nor1 2 nor1 3 r* tig4 tig1 1 tig1 2 tig1 3 Blk I/O's Strm Reads Strm Writes Time(sec) I/O per 100 6265945 11823870 16244805 6452651 12065165 16514678 6266815 11825313 16246248 6437 177900 177800 5610 177882 178493 6537 177279 177890 9359 116444 116000 9907 115540 115540 10359 116567 116567 720 1680 515 744 1674 523 700 1682 518 6.281741 5.836473 4.420391 6.468168 5.890419 4.450124 6.283711 5.835448 4.421546 12403475 11967682 13588870 8692591 9992747 8590855 611 611 610 611 611 611 0 1 1 1 1 1 1272.5 1272.5 1360 912.5 895 912.5 12.403475 11.967682 13.58887 8.692591 9.992747 8.590855 77 A.4 A.4.1 Dataset Nor Uni Tig Query LPR-tree Action BLb Ins1 c Ins2 d Del1 e Del2 f Ins1 g Ins2 h BL Ins1 Ins2 Del1 Del2 Ins1 Ins2 BL Ins1 Ins2 Del1 Del2 Ins1 Ins2 Avg|OP | 1711.11 2852.05 3995.42 2854.49 1711.11 2852.05 3995.42 81719.47 136112.5 190476.5 136083.5 81719.47 136112.5 190476.5 6749.02 6749.02 6749.02 6749.02 6749.02 6749.02 6749.02 I/O's 12864 22082 18054 18040 13278 22550 18414 143384 231690 279534 279408 183310 272290 317856 17449 27996 23378 23158 19796 20379 24818 a OP is the number of rectangles returned b BL - Queries after bulk loading nor1.5 c Ins1 - Queries after inserting nor1 1 d Ins2 - Queries after inserting nor1 2 e Del1 - Queries after deleting nor1 1 f Del2 - Queries after deleting nor1 2 g Ins1 - Queries after inserting nor1 1 h Ins2 - Queries after inserting nor1 2 Time(sec) I/O per OP a 12.3 18.84 14.078 13.81 10.53 19.9 13.98 11 26.563 43.5 37.84 25.89 43.54 44.08 14.37 14.1 12.57 13.6 11.12 10.31 13.85 7.51792696 7.742501008 4.518673882 6.319867997 7.759875169 7.906593503 4.60877705 1.754587983 1.702194876 1.467551115 2.053209978 2.243161881 2.000477546 1.668741288 2.585412401 4.148157807 3.463910316 3.431312991 2.933166593 3.019549505 3.677274627 by the window query. 78 A.4.2 Dataset Nor Uni Tig a OP R∗ -tree Action BL Ins1 Ins2 Del1 Del2 Ins1 Ins2 BL Ins1 Ins2 Del1 Del2 Ins1 Ins2 BL Ins1 Ins2 Del1 Del2 Ins1 Ins2 Avg |OP | 1711.11 2852.05 3995.42 2854.49 1711.11 2852.05 3995.42 81719.47 136112.5 190476.5 136083.5 81719.47 136112.5 190476.5 6749.02 6749.02 6749.02 6749.02 6749.02 6749.02 6749.02 I/O's Time(sec) I/O per OP a 16 27 38 36 27 27 59 304 430 372 40.06755849 23.98870988 17.16290152 26.69303448 39.83262327 25.78426044 18.61906883 3.673861321 1.978789604 1.31078112 84970 86201 88053 78794 78794 77752 77752 42.5 37.5 40 35 40 37.5 35 12.58997603 12.77237288 13.04678309 11.6748802 11.6748802 11.52048742 11.52048742 68560 68417 68573 76195 68158 73538 74391 300226 269338 249673 is the number of rectangles returned by the window query. 79 80 Appendix B Brief Introduction to TPIE TPIE[4, 6] is a software environment (written in C++) that facilitates the implementation of I/O ecient algorithms. The goal of theoretical work in the area of external memory algorithms (also called I/O algorithms or out-of-core algorithms) has been to develop algorithms that minimize the I/O(i.e; transfer of data between the main memory and disk), performed when solving problems on very large data sets. The TPIE library consists of a kernel and a set of I/O-ecient algorithms and data structures implemented on top of the kernel. Most of the functionality is provided as templated classes and functions in C++. The following are some of the important structures of TPIE used in the implementation of the LPR-tree: AMI stream AMI stream is templated class to store a list of user dened objects to the external memory. This stream provides interfaces to read or write items. Dedicated streams such as the TwoDRectangleStream are AMI stream objects parameterized with the object type (such as TwoDRectangle) they hold. TPIE provides a sorting algorithm AMI sort for a stream. Given a comparison function or object, it sorts the stream using the external-memory merge sort algorithm. AMI block AMI block is a templated class that represents a logical block, which is the unit amount of data transferred between the external memory and main memory. A block can store data and hold links to other blocks. It also provides an information structure to hold information such as the number of items allocated. Given the type of data objects that have to be stored and the number of links, the maximum number of data objects that can be stored represents the capacity of the block. Blocks are used in the implementation of the LPR-tree to store priority nodes and the internal KD-node data structures. 81 AMI collection AMI collection is a class that represents a collection of AMI block objects. Any block can be identied and retrieved in a collection using the unique block identier associated with the block. This collection provide the convenience to control data layout strategies required by many IO ecient algorithms. The LPR-tree is in fact one such block collection. 82 Bibliography [1] Chuan-Heng Ang, S. T. Tan, and T. C. Tan. Bitmap R-trees. Informatica (Slovenia), 24(2), 2000. [2] Arge, Hinrichs, Vahrenhold, and Vitter. Ecient bulk operations on dynamic Rtrees. Algorithmica, 33, 2002. [3] Lars Arge, Mark de Berg, Herman J. Haverkort, and Ke Yi. The priority R-tree: A practically ecient and worst-case optimal R-tree. In SIGMOD, pages 347{358, 2004. [4] Lars Arge, Octavian Procopiuc, and Jerey Scott Vitter. Implementing I/O-ecient data structures using TPIE. In ESA, pages 88{100, 2002. [5] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The R*-tree: An ecient and robust access method for points and rectangles. In Hector Garcia-Molina and H. V. Jagadish, editors, SIGMOD, pages 322{331. ACM Press, 1990. [6] TPIE distribution. http://www.cs.duke.edu/tpie. [7] Yvan J. Garca, Mario A. Lopez, and Scott T. Leutenegger. A greedy algorithm for bulk loading R-trees. In ACM-GIS, pages 163{164, 1998. [8] Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In Beatrice Yormark, editor, SIGMOD, pages 47{57, Boston, Massachusetts, June 1984. [9] Ibrahim Kamel and Christos Faloutsos. On packing R-trees. In CIKM, pages 490{ 499, 1993. [10] Ibrahim Kamel and Christos Faloutsos. Hilbert R-tree: An improved R-tree using fractals. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, VLDB, Pro- ceedings of 20th International Conference on Very Large Data Bases, Santiago de Chile, pages 500{509. Morgan Kaufmann, 1994. [11] T. Sellis, N. Roussopoulos, and C. Faloustos. The R+ -tree: A dynamic index for multi-dimensional objects. In VLDB, pages 507{518, Brighton, England, 1987. 83 [12] A. Papadopoulo Y. Manolopoulos, A. Nanopoulos and Y. Theodoridis. R-trees: Theory and applications. Springer, 2006. 84