Download Eindhoven University of Technology MASTER An experimental

Document related concepts

Linked list wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Quadtree wikipedia , lookup

Binary tree wikipedia , lookup

Red–black tree wikipedia , lookup

Interval tree wikipedia , lookup

Binary search tree wikipedia , lookup

B-tree wikipedia , lookup

Transcript
Eindhoven University of Technology
MASTER
An experimental evaluation of the logarithmic priority-R tree
Abbas, U.
Award date:
2006
Disclaimer
This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student
theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document
as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required
minimum study period may vary in duration.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Download date: 08. May. 2017
TECHNISCHE UNIVERSITEIT EINDHOVEN
Department of Mathematics and Computer Science
An Experimental Evaluation of the
Logarithmic Priority-R tree
by
Ummar Abbas
Advisor
dr. Herman Haverkort
Review Committee
dr. Herman Haverkort
prof. dr. Mark de Berg
dr. ir. Huub van de Wetering
Eindhoven, November 2006
2
\No amount of experimentation can ever prove me right; a single experiment can
prove me wrong."
— Albert Einstein
3
4
Abstract
The Logarithmic PR-tree is an R-tree variant based on the PR-tree that maintains
the worst case query time while the tree structure is updated. This thesis is dedicated
to the experimental study of the LPR tree using the C++ template based I/O-ecient
library TPIE. It compares the performance of the LPR tree with the R∗ -tree, one of the
most popular dynamic R-tree structures.
5
6
Acknowledgements
I am very grateful to dr. Herman Haverkort, my advisor, for all the support and help
during this thesis. This work would not have been completed in time without the
numerous discussions with him. I am also indebted to him for spending the huge eort
reviewing the early versions of this document, in a very short time. My sincere thanks to
prof. dr. Mark de Berg, head of the Algorithms group, to provide me with an opportunity
to work here and to allow me to work part-time on this thesis, together with my job.
Special thanks to Micha Streppel, for the help and guidance in using TPIE.
I am thankful to my wife Shabana and son Suhail, for giving me all the encouragement
I needed, and providing me all the time that was necessary to complete this thesis.
Finally, I would like to express my deepest gratitude to my parents, who motivated me
to take up this course.
Ummar Abbas
October 25, 2006
7
8
Contents
1 Introduction
13
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Aim and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 R-Trees
2.1 Original R-tree . . . . . . . . . . . . . . . .
2.2 Dynamic Versions of R-trees . . . . . . . . .
2.2.1 Guttman's R-tree update algorithms
2.2.2 The R∗ -tree . . . . . . . . . . . . . .
2.2.3 The Hilbert R-tree . . . . . . . . . .
2.2.4 R+ -tree . . . . . . . . . . . . . . . .
2.2.5 Compact R-tree . . . . . . . . . . .
2.2.6 Linear Node Splitting . . . . . . . .
2.3 Static Versions of R-trees . . . . . . . . . .
2.3.1 The Hilbert Packed R-tree . . . . . .
2.3.2 TGS R-tree . . . . . . . . . . . . . .
2.3.3 Buer R-tree . . . . . . . . . . . . .
3 PR-tree Family
3.1 Priority R-tree . . . . .
3.1.1 Pseudo-PR-tree .
3.1.2 PR-tree . . . . .
3.2 LPR-tree . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.1 General Implementation Issues
4.2 Two Dimensional Rectangle . .
4.3 Pseudo-PR-tree . . . . . . . . .
4.3.1 Data Structures . . . . .
4.3.2 Construction Algorithm
4.3.3 Implementation Issues .
4.4 LPR-tree . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Design and Implementation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
17
17
18
18
19
21
23
23
24
24
24
25
26
27
27
27
30
30
33
33
35
36
36
38
44
45
4.4.1
4.4.2
4.4.3
4.4.4
Structure . . . . . . .
Insertion Algorithm .
Deletion Algorithm . .
Implementation Issues
5 Experiments
5.1 Experimental Setup .
5.2 Datasets . . . . . . . .
5.2.1 Real life data .
5.2.2 Synthetic data
5.3 Bulk Load . . . . . . .
5.4 Insertion . . . . . . . .
5.5 Deletion . . . . . . . .
5.6 Query . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
47
48
51
53
53
53
53
54
55
58
62
65
6 Conclusions
71
A Tables of Experimental Results
73
A.1 Bulk Load . . . . . . . . . . . .
A.1.1 LPR tree . . . . . . . .
A.1.2 R∗ tree . . . . . . . . . .
A.2 Insertion . . . . . . . . . . . . .
A.2.1 Insertion Time . . . . .
A.2.2 Insertion I/O's . . . . .
A.3 Deletion . . . . . . . . . . . . .
A.3.1 Deletion I/O's and time
A.4 Query . . . . . . . . . . . . . .
A.4.1 LPR-tree . . . . . . . .
A.4.2 R∗ -tree . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B Brief Introduction to TPIE
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
73
74
75
75
76
77
77
78
78
79
81
10
List of Figures
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18
5.19
5.20
5.21
Bulk Load CPU time - Uniform dataset. . . . . . . . . . . . . . . . . . . .
Bulk Load CPU time - Normal dataset. . . . . . . . . . . . . . . . . . . .
Bulk Load CPU time - TIGER dataset. . . . . . . . . . . . . . . . . . . .
Bulk Load I/O - Uniform dataset. . . . . . . . . . . . . . . . . . . . . . .
Bulk Load I/O - Normal dataset. . . . . . . . . . . . . . . . . . . . . . . .
Bulk Load I/O - TIGER dataset. . . . . . . . . . . . . . . . . . . . . . . .
Insertion CPU time - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . .
Insertion CPU time - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . .
Insertion average CPU time - LP R tree vs. R∗ tree. . . . . . . . . . . . .
Insertion I/O's - LP R tree. . . . . . . . . . . . . . . . . . . . . . . . . . .
Insertion I/O's - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Insertion Average I/O's - LP R tree vs. R∗ tree. . . . . . . . . . . . . . . .
Deletion CPU time - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . .
Deletion CPU time - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . .
Deletion average CPU time - LP R tree vs. R∗ tree. . . . . . . . . . . . . .
Delete I/O - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Deletion I/O - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Deletion average I/O's - LP R tree vs. R∗ tree. . . . . . . . . . . . . . . .
Query CPU time (in msec) per B rectangles output . . . . . . . . . . . . .
Query I/O's per B rectangles output. . . . . . . . . . . . . . . . . . . . .
Empirical Analysis - Theoretical vs. Experimental query I/O results for
LPR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
56
56
57
57
57
58
59
59
60
60
61
61
63
63
64
64
65
65
68
69
70
12
Chapter 1
Introduction
1.1
Context
Spatial data management is required in several areas such as computer aided design
(CAD), VLSI design, geo-data applications etc. Data objects in such a spatial database
are multi-dimensional (typically 2D or 3D). It is important to be able to search and
retrieve objects using their spatial position. Classical indexing mechanisms such as the
B-tree and its various variants are not suitable for multiple dimensions and for range
queries. The R-tree data structure introduced by Guttman[8] is considered as one of
the most ecient mechanism to handle multi-dimensional spatial data. It is aimed at
handling geometrical data, such as, points, line segments, surfaces, volumes and hyper
volumes. Since its introduction to the scientic community, various variants of this
structure have been proposed.
1.2
Aim and Approach
This thesis is dedicated to the experimental study of the LPR-tree, an R-tree variant that
claims to maintain the worst-case-optimal query guarantees made by its static variant
the PR-tree[3]. In particular, the thesis aims to achieve the following:
To verify experimentally, the theoretical worst case optimal query guarantees made
on the LPR-tree.
The performance comparison in terms of IO's and time of the LPR-tree against
state-of-art dynamic update algorithms such as the R∗ -tree and the Hilbert-Rtree.
Derive some conclusions on when and under what specic conditions or scenarios
the LPR-tree can (or cannot) out perform these R-tree variants.
We assume the following I/O model :
13
There is fast internal memory(main memory) that can hold M two-dimensional
rectangles and a single slow disk which holds data and the results of computation.
Data is transferred between the internal memory and the external memory in terms
of blocks. A block can hold B two-dimensional rectangles. When a block is read
from or written to the external memory, an I/O is said to have been performed.
Algorithms have full control over which block is evicted from the main memory
and written to the disk and vice-versa.
I/O's are most important bottle-neck while working with very large data/sets.
Therefore, the number of block reads/writes performed by (external memory) algorithms are considered to be the measure of the eciency of an algorithm.
To perform the above investigations comprehensively, the LPR-tree is implemented
using TPIE, an IO-ecient C++ library. Experiments are designed to observe the performance of the LPR-tree against the variation of several parameters like distribution
of the dataset and size. The same kind of experiments have to be repeated for other
R-tree variants to compare and analyze the results. Due to time constraints and the nonavailability of the implementation of other dynamic R-trees using the TPIE platform,
the thesis restricts its comparative study to the R∗ -tree. The experiments are designed
to cover the following areas:
Performance (in terms of I/O's and time) of the LPR-tree update algorithm. This
includes the cases where the tree gets rebuilt during updates.
Compare the performance for the update algorithms of the LPR-tree and the R∗ tree.
Compare the performance of the LPR query algorithm after bulk load against the
performance achieved when queries are interleaved with updates.
Compare the query results of the LPR-tree and R∗ -tree on similar data/sets and
under similar conditions.
1.3
Overview
The thesis is organized into the following chapters:
Chapter 2 introduces the original R-tree to the reader (in some detail) to provide
a context and show how various variants have been developed. This chapter also
presents a survey of some popular R-tree variations, that have based on dierent
heuristics to achieve good query performance while still having a good update
performance. Because of the inherent heuristic nature of these algorithms, they do
not give any asymptotical bounds on the worst-case query performance.
14
Chapter 3 describes the two-dimensional pseudo-PR tree, PR-tree and the LPRtree data structures. It also describes the update algorithms. The PR-tree is the
rst variant of an R-tree that guarantees optimal worst-case query time and has
its construction not based on heuristics. To achieve the same query performance
during updates, the LPR-tree introduces update algorithms for insertion and deletion.
Chapter 4 describes all the practical implementation issues together with the
pseudo-code of algorithms and data structures.
Chapter 5 describes the various experiments carried out on the LPR-tree, gives the
results and provides analysis of those results.
Chapter 6 gives the conclusions derived from the experimental study that answers
the questions raised in section 1.2.
Appendix A can be used to get the exact gures of various quantities measured
during the experiments.
Appendix B gives a brief introduction into TPIE, the IO-ecient library used for
the implementation of the LPR-tree.
15
16
Chapter 2
R-Trees
2.1
Original R-tree
An R-tree is a hierarchical data structure based on the B + -tree. A B + -tree is a height
balanced dynamic data structure used to index single-dimensional data and supports
ecient insertion and deletion algorithms. All data is stored in leaf nodes while internal
nodes contain keys and pointers to other nodes. While performing queries on this structure, the keys stored in the internal nodes help in traversing the tree structure from the
root down to the leaf node using binary search type comparisons. The R-tree is a similar
structure characterized by the parameters k and K where K is the maximum number of
entries that t in one node and k ≤ K2 represent the minimum number of such entries.
The R-tree has the following properties :
Each leaf node contains between k and K records where each record is of the form
(MBR, oid) where MBR is the minimum bounding box enclosing the spatial data
object identied by oid.
Each internal node, except for the root, has between k and K children represented
by records of the form (MBR, p) where p is the pointer to a child and MBR is the
minimum bounding box that spatially contains the MBRs in this child.
Based on this denition, an R-tree on N rectangles has a height of at most Θ(logk N ).
A query to nd a rectangle in a R-tree is called an exact match query. However, the
more common form of query is the window (or range) query. Given a rectangle Q, the
following forms a window query: nd all the data rectangles that are intersected by Q.
To answer such a query, we simply start at the root of the R-tree and recursively visit
all the nodes with minimum bounding boxes intersecting Q; when encountering a leaf l
we report all the rectangles in l that intersect Q. This results in the following algorithm:
Algorithm WindowQuery (ν, Q)
Input: Node ν in the R tree and a query rectangle Q.
Output: Set A containing all the rectangles in the tree
17
that intersect Q.
1. if ν is a leaf
2.
then Examine each data rectangle r in ν to check if its intersects Q. If it intersects
then A ←A ∪ frg.
3.
else (∗ ν is an internal node. ∗)
4.
Examine each entry to nd exactly those children whose minimum bounding
box intersects Q. Let the set of intersecting children nodes be Sc .
5.
for µ ∈ Sc
6.
A ←A ∪ WindowQuery (µ, Q).
7. return A.
Various R-tree variations[12] have been proposed, some of them adapting for specic
instances and environments. All the R-tree based data structures proposed in literature
can be classied into one of the two categories.
Dynamic Versions of R-trees : These are R-tree based data structures where the
objects are inserted or deleted on a one-by-one basis.
Static Versions of R-trees : These are R-tree based data structures that are built
using algorithms for bulk loading using a-priori known static data.
2.2
2.2.1
Dynamic Versions of R-trees
Guttman’s R-tree update algorithms
Guttman[8] provides insertion and deletion algorithms for the R-tree structure proposed
by him. The insertion and deletion algorithms use the bounding boxes from the nodes to
ensure that nearby elements are placed in the same leaf node. Inserting a rectangle into
the R-tree, basically involves adding the rectangle to a suitable leaf node. As rectangles
get inserted, at a certain point, the leaf node overows thus requiring a split. This creates
another child pointer in the parent node, which may cause the parent node to split, and
so on, eventually, in the worst case, splitting the root node. The insertion algorithm is
based on heuristics on the splitting process so that a good query performance can be
achieved. The algorithm can be summarized as follows:
Algorithm Insert (r)
Input: A rectangle r to
be inserted.
1. Descend through the tree to nd a leaf node L whose MBR requires least enlargement
to accommodate r.
2. if L does not have enough room to accomodate r
3.
then split the node L to obtain an additional leaf node LL containing r. Propogate the split upwards through the tree and if necessary splitting the root
node to create a new root.
4.
else
18
5.
Place r in L.
6. return.
In step 2, the leaf node where the rectangle has to be placed is chosen. This is done
by descending through the tree starting at the root node until a leaf is found. At each
step, the entry whose MBR requires least enlargement to include r is chosen. When
the leaf node is already full, the node is split resulting in the (K+1) rectangles being
redistributed in two leaf nodes, according to some splitting policy. If the newly added
leaf makes its parent overow, then the split has to be recursively propagated to the
upper levels (step 3).
There are three splitting techniques to split nodes in step 3, the linear split, the
quadratic split and the exponential split. Their names come from their complexity.
These three splitting techniques can be summarized as follows:
1. Linear Split This algorithm is linear in K and in the number of dimensions.
Conceptually this algorithm chooses two rectangles that are furthest apart as seeds.
The remaining rectangles are chosen in a random order and added to the node that
requires least enlargement of its MBR.
2. Quadratic Split The cost is quadratic in K and linear in the number of dimensions.
The algorithm picks two of the K + 1 entries to be the rst elements of the two
new groups by choosing the pair such that the area of a rectangle covering both
entries, minus the area of the entries themselves, would be greatest. The remaining
entries are then assigned to groups one at a time. At each step the area expansion
required to add each remaining entry to each group is calculated, and the entry
assigned is the one showing the greatest dierence beween the two groups.
3. Exponential Split All possible groupings are tested and the split that results in
the least overlap area of the two MBR's is chosen. However even for reasonably
small value of K , this strategy is very expensive as the number of possible groups
is approximately 2K−1 .
2.2.2
The R∗ -tree
Guttman's update algorithms are based completely on minimizing the overlap area of
MBR's. Insertion and deletions are intermixed with queries and there is no periodic
reorganization. The structure must allow overlapping rectangles, which means it cannot
guarantee that there is only one search path for an exact match query. The R∗ -tree is
a variant of R-tree proposed by Beckman et al[5] that strives to reduce the number of
search paths for queries by incorporating a combined optimization of several parameters.
The following is a summary of the parameters it tries to optimize:
Area covered by each MBR should be minimized Minimizing the dead space (area
covered by the MBR but not by the data rectangles) will improve query performance as decisions of which paths are to be traversed can be made at higher levels
of the tree.
19
Overlap between MBR's should be minimized
A larger overlap implies more
paths to be searched for a query. Hence this optimization also serves the purpose
of reducing search paths.
The perimeters of MBR's should be minimized
This optimization results in
MBR's that are more quadratic. Query rectangles that are quadratic benet the
most from this optimization. As quadratic rectangles can be packed more easily,
the bounding boxes at higher levels in the R-tree are expected to be smaller. In
fact this optimization will lead to lesser variance in lengths of bounding boxes,
indirectly achieving area reduction.
Storage utilization should be optimized Storage utilization is dened as the ratio
of the total number of rectangles in the R-tree to the maximum number of rectangles
(capacity) that can be stored across all the nodes of the R-tree. Low storage
utilization would mean searches in larger number of nodes during queries. As a
result query cost is very high with low storage utilization, especially when a large
part of the data set satises the query.
Optimization of the afore-mentioned parameters is not independent as they aect each
other in a very complex way. For instance to minimize dead space and overlap, more
freedom in choosing the shape is necessary which would cause rectangles to be less
quadratic. Also, minimization of perimeters may lead to reduced storage optimization.
Based on results of several experiments using these optimization criteria, Beckman et al
propose the following two strategies to obtain signicant gain in query performance:
A new node splitting algorithm that uses the rst three optimization criteria. This
algorithm is as follows:
Algorithm Split (ν)
Input: Node ν in the
R tree that contains maximum number K +1 of either data
rectangles or MBR's.
Output: Node ν and µ containing K + 1 entries.
1. for each axis
2.
Sort the entries along min then by max values on this axis.
3.
Determine the K − 2k + 2 distributions from the K +1 entries into two
groups such that each group contains a minimum of k entries.
4.
Compute σ, the minimum sum of the perimeters of the two MBR's
across all the distributions.
5. Choose the axis with the minimum σ. Split is performed perpendicular to this
axis.
6. From the K − 2k + 2 distributions along the chosen axis in line, choose the
distribution with minimum overlap. Resolve ties by choosing the distribution
with minimum dead space. The two groups of entries are collected in the nodes
ν and µ
20
7. return ν and µ
The split algorithm rst determines the axis along which the split will be performed.
To do this, it considers, for each axis in step 3, K − 2k + 2 distributions of the
K + 1 nodes into two groups, where the i-th distribution is determined by having
the rst group contain the rst (k − 1) + i entries sorted along that axis and the
second group having the remaining entries.
An insertion algorithm that uses the concept of forced reinsert to reinsert a fraction
of entries of an overowing node to re-balance the tree at certain steps.
Algorithm
Insert (r)
1. Invoke ChooseSubTree to determine the leaf node ν where the insertion has
to take place.
2. if ν contains less than K data rectangles.
3.
then Insert r in ν .
4.
else (∗ ν has maximum entries. Handle overow. ∗)
5.
if this is the rst time overow has occured in this level
6.
then Reinsert 30% of the rectangles of ν whose centroid distances
from the node centroid are the largest.
7.
else
8.
Invoke Split (ν). Propagate the split upwards if necessary.
9. return
Algorithm ChooseSubTree
1. ν ←root of the R∗ tree.
2. while Children nodes of ν are not leaves
3.
ν ←MBR requires least area enlargement
choosing the entry µ whose MBR has the
to include r. Resolve ties by
least area.
4. Choose the leaf node whose MBR requires least overlap enlargement to include
r. Resolve ties by choosing nodes that need least area enlargement.
The dynamic reorganization of the R-tree by the reinsertion strategy during insert
achieves a kind of tree re-balancing and signicantly improves query performance.
As reinsertion is a costly operation, the number of rectangles reinserted have been
experimentally tuned to 30% to yield best performance. Also reinsertion is restricted to be done once for each level in the tree.
2.2.3
The Hilbert R-tree
The Hilbert R-tree[10] is an R-tree variant which uses the notion of Hilbert value to
dene an ordering on the data rectangles. The Hilbert value of a n-dimensional point
is calculated using the n-dimensional Hilbert curve. The Hilbert R-tree constructed on
a data/set of n-dimensional rectangles uses the centroid of the rectangles to dene the
21
ordering. Such an ordering has been shown [10] to preserve the proximity of such spatial
objects quite well. The ordering on the data rectangles based on the Hilbert value, also
facilitates the Hilbert R-tree to use a conceptually dierent splitting technique known
as deferred splitting.
In addition to this variation in splitting algorithm, every internal node ν in the R-tree
structure stores, in addition to the usual MBR, the largest Hilbert value(LHV) of the
data rectangles that are stored in the subtree rooted at ν .
In the original R-tree, when a node a overows, a split is performed, as a result of
which, two nodes are created from a single node. This is referred to as a 1 to 2 splitting
policy. The Hilbert R-tree implements the concept of deferred splitting by using a 2 to
3 splitting policy. This means a split is not performed when a node overows and either
that node or one its sibling, also known as cooperating siblings can accommodate an
additional entry. In general there could be a s to s + 1 splitting policy.
Finally these concepts of node ordering according to Hilbert value and deferred splitting approach come together in the following insertion algorithm:
Algorithm Insert (r)
Input: A rectangle r to be inserted.
1. h ←Hilbert value of the centroid
of r.
2. Recursively descend through the tree, to select leaf node, ν , for insertion. At each
node, select the child node entry with the minimum LHV value greater than h.
3. if ν has an empty slot.
4.
then Insert r into ν .
5.
else
6.
HandleOverow (ν, r) that will a create a new leaf µ if split was inevitable.
7. Propogate the node split in line 6 upwards, adjusting MBR's and LHV of the nodes.
If an overow of the root caused a split, create a new root whose children are the
previous root and the new node created by the split.
8. return.
Algorithm HandleOverow (ν, r)
Input: A rectangle r to be inserted.
Output: A new node µ if a split occurred.
1. ² ←all the entries of ν and its s − 1 cooperating siblings.
2. ² ←² ∪ r.
3. if (|²| < s.K ) (∗ atleast one of the s − 1 cooperating siblings is not full. ∗)
4.
then
5.
Distribute ² among the s nodes respecting the Hilbert ordering.
6.
return null.
7.
else (∗ all the s − 1 siblings are full. ∗)
8.
then
9.
Create a new node µ and distribute ² among the s + 1 nodes respecting
Hilbert ordering.
22
the
10.
return µ.
The Hilbert tree acts like a B + tree for insertions and like an R tree for queries,
thereby achieving its acclaimed performance. However it is vulnerable, performancewise, to large objects. Performance is also penalized when the space dimensionality is
increased. In this case, proximity is not well preserved by the Hilbert curve, leading to
increased overlap of MBR's in internal nodes.
2.2.4
R+ -tree
The R+ -tree was proposed[11] as a variation to the R-tree structure to improve performance of exact match queries. The original R-tree suered from the problem that an
exact match query may lead to investigation of several paths from the root to the leaf,
especially in cases where the data rectangles are dense/clustered. To obtain better query
performance in such cases, the R+ -tree introduces a variation in the R-tree structure that
does not allow overlapping of MBR's at the same level of the tree. This is achieved by
duplicating the stored data rectangles in more than one node. Because of this structural
dierence, the following changes are made to the update algorithms:
Query
- Query algorithm is similar to the one used for the R-tree with the only
dierence of removing duplicate results.
Insertion - The insertion algorithm proceeds in the same way as the original R-tree
to nd the node whose MBR overlaps the rectangle r, to be inserted. Once such
a node is found, it is either inserted there, if there is enough space for r or the
node is split resulting in a, sometimes drastic, reorganization of the tree structure
which eventually duplicately stores r. Under certain extreme circumstances, this
can even lead to a dead-lock[12].
Deletion - Duplication of stored rectangles means that the deletion algorithm must
take care to delete all occurrences of the rectangle to be deleted. Deletion is followed
by a phase where the MBR's have to be adjusted. However deletion may reduce
storage utilization signicantly, requiring the tree to be periodically reorganized.
2.2.5
Compact R-tree
The Compact R-tree mechanism, focuses on the improving storage utilization to improve
query performance. A very simple heuristic is applied to improve space utilization during
insertions. When a node ν overows, K rectangles among the K+1 available rectangles,
are chosen such that MBR is the minimum possible. These rectangles are kept in ν
and the other rectangle is moved to one of its sibling provided it has the space and
whose MBR requires least enlargement. A split takes place only when all the siblings are
completely lled with K rectangles each. This heuristic, has experimentally been shown
to improve utilization to around 97% to 99%. Insertion performance is improved by the
23
fact that lesser splits are required. However performance of window queries is seen, not
to dier much from the Guttman's R-tree.
2.2.6
Linear Node Splitting
This technique proposed in [1] introduces a new algorithm for linear node splitting. This
algorithm essentially can be substituted for the splitting algorithm used in the original
R-tree. This technique splits nodes based on the following heuristics, which are used in
the order stated:
Distribute rectangles as evenly as possible.
Minimize overlapping area between the nodes.
Minimize the total perimeter of the two nodes.
This means when a node ν overows, each of the K+1 rectangles is assigned two of
four lists Lxmin , Lymin , Lxmax and Lymax . More precisely, for each rectangle r, it is determined whether this rectangle is closer to left or the right edge of the MBR of ν . The
rectangle is then assigned to the x-dimensional lists, Lxmin or Lxmax list. Then, according to the y dimensions of the rectangle, it is assigned to one of the y-dimensional lists,
Lymin or the Lymax list. The node is split along the x-dimension if MAX (|Lxmin |, |Lxmax |)
> MAX (|Lymin |, |Lymax |). If this is not true the split is performed along the y-dimension
unless the two lists are of the same size. In the latter case, the overlap of these sets is
considered. Finally if this turns out to be equal as well, the total coverage is considered. Experiments have shown that these heuristics result in R-trees that have better
characteristics and result in better performance for window queries in comparison with
quadratic algorithm proposed by Guttman.
2.3
2.3.1
Static Versions of R-trees
The Hilbert Packed R-tree
The Hilbert Packed R-tree[9] is a R-tree structure designed with the aim of achieving
100% space utilization with good query performance for applications where the R-tree
does not or most infrequently requires modications. In order to achieve very good query
performance the data rectangles that are in close proximity must be clustered together in
the same leaf. Similar to its dynamic counterpart, this structure uses the Hilbert curve
and the resulting Hilbert values of the centroid of the rectangles as a heuristic to cluster
rectangles. The tree is constructed in a bottom up manner, starting from the leaf level
and nishing at the root. The construction algorithm is outlined below:
Algorithm HilbertPack (S)
Input: Set S of data rectangles
to be organized into a R tree.
24
Output: R-tree, τ , packed with the
1. for each data rectangle r in S
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
data rectangles in the set S .
Calculate the Hilbert value of the centroid of r.
Sort the rectangles in S on ascending Hilbert values calculated in line 1.
while S 6= ∅
do
Create a new leaf node ν .
Add B rectangles to ν and remove them from S .
while there are > 1 nodes at level l
do
I ← MBRs of
while I 6= ∅
do
nodes in level l.
Create a new internal node µ.
Add B rectangles with child pointers to µ and remove them from I .
l ←l+1.
Set the root node of τ to the one node left.
return τ .
Experiments ([9]) showed that this variant of the R-tree, outperforms the original
R-tree with quadratic split and the R∗ -tree signicantly.
2.3.2
TGS R-tree
Unlike the Packed Hilbert R-tree that constructs the R-tree bottom up, Top-down Greedy
Splitting(TGS) method presented in [7] constructs the tree a top down manner while
using an aggressive approach that greedily constructs the various subtrees of the R-tree.
A top down approach minimizes the cost of the levels that allow a potentially bigger
reduction in the overall cost; i.e.; the top levels of the R-tree. Essentially, the algorithm
recursively partitions a set of N rectangles into two subsets by a cut orthogonal to an
axis. This cut must satisfy the following two conditions:
1. Cost of an objective function f (r1 , r2 ), where r1 and r2 are the MBR's of the
resulting two subsets, is minimized.
2. One subset has a cardinality of i.S for some i where S is xed per level so that the
resulting subtrees are packed i.e; the space utilization is 100%.
The algorithm for to perform a cut is summarized as follows:
Algorithm TGS (n, f )
Input:
n - number
of rectangles in the data set.
f - function f (r1, r2) that measures the cost of the split.
Output: Two split subsets that form two subtrees of a R-tree.
25
1. if n ≤ K return.
2. for each dimension d.
3.
for each ordering in this dimension.
n
4.
for i ←1 to M
-1
5.
r1 ←MBR of rst i.S rectangles.
6.
r2 ←MBR of other rectangles.
7.
Remember i if f (r1 , r2 ) is at its best position.
8. Split the input set at the best position found in line 7.
Orderings in each dimension that are considered in line 3 are based on min coordinate,
coordinate, both (min followed by max) and the centroid of the input rectangles.
The cutting process described above is repeated recursively for the resulting subsets
until a cut is no longer possible.
This binary split process can easily be extended to a K -ary split, where each internal
node has K -entries. This means, to build the root of a (subtree of) an R-tree on a given
set of rectangles, the algorithm repeatedly partitions the rectangles, into two sets, until
they are divided into K subsets of equal size. Each subsets bounding box is stored in
the root, and the subtrees are constructed recursively on each of the subsets.
max
2.3.3
Buffer R-tree
Buer R-tree[2] is not really a static R-tree, but provides ecient algorithms for bulk
updates. It achieves I/O eciency by exploiting the available main memory to cache
rectangles when a rectangle is inserted. More precisely, it attaches buers to every node
M
at blogB ( 4.B
)c-th level of the tree, where i = 1, 2, .., and M is the maximum number of
rectangles that can t in main memory, B is the block size. A node with an attached
buer is called a buer node.
In contrast with many other R-tree variations, the BR-tree does not split a node
immediately, when a node overows due to insertion. Instead it stores this rectangle in
the buer of the root node. When the number of items in this buer exceed M/4, a
specialized procedure is executed to free buer space. Essentially, this procedure, moves
rectangles from a full buer to next appropriate buer node at lower levels of the tree.
Such movements must respect various branching heuristics. When we reach the leaf,
the rectangle is inserted and split is performed when there is an overow. Evidently,
for some insertions, there are no I/O incurred. The BR-tree supports bulk insertions,
bulk deletions, bulk loading and batch queries. Experimental results show that BR-tree
requires smaller execution times to perform bulk updates and produces a good index for
query processing.
26
Chapter 3
PR-tree Family
3.1
Priority R-tree
The Priority R-tree[3], or PR-tree, is the rst R-tree variant that guarantees a worst
case performance that is provably asymptotically optimal. The name of the tree derives
from the use of priority rectangles to bulk load the tree. This algorithm makes use of
an intermediate data structure called the pseudo-PR-tree. In the next section this data
structure is described together with a construction algorithm. The exact pseudo-code
describing the implementation of the algorithm is presented in chapter 4. For simplicity,
the description is for the two-dimensional case. The discussion and the results can be
easily generalized to higher dimensions.
3.1.1
Pseudo-PR-tree
Let S = fR1 , .. RN g be a set of N input data rectangles in the plane. The
set is mapped to a set of four-dimensional points S ∗ where each element of S ∗ is obtained
from the corresponding rectangles in S using the following relation:
Definition
S ∗ (Ri ) ≡
((xmin (Ri ), ymin (Ri ), xmax (Ri ), ymax (Ri ))
where:
xmin (Ri ), ymin (Ri )
represent the coordinates of the left-bottom vertex of Ri .
xmax (Ri ), ymax (Ri )
represent the coordinates of the right-top coordinates of Ri .
A Pseudo-PR-tree T (S ) is dened recursively as follows:
The root has at most six children, namely four priority leaves and two so called
KD-nodes. The root also stores the MBR of each of the children nodes.
27
The rst priority leaf contains the B rectangles in S with smallest xmin coordinates, the second, the B rectangles among the remaining rectangles with smallest
ymin coordinates, the third, the B rectangles among the remaining rectangles with
largestxmax coordinates and nally the fourth, the B rectangles among the remaining rectangles with largest ymax coordinates.
The set Sr of remaining rectangles is divided into two sets S1 and S2 of approximately the same size.The KD-nodes are the roots of the pseudo-PR-trees constructed using S1 and S2 respectively. The division of rectangles is performed
using the xmin , ymin , xmax , ymax coordinates in a round robin fashion. This means,
the division performed at the root node, is based on the xmin -values, the division at
next level of recursion is based on the ymin -values, then based on the xmax -values,
then the ymax -values, then the xmin -values and so on. The split value used is stored
in the internal node in addition to the bounding boxes.
Since each leaf or node of the pseudo-PR-tree is stored in O(1) disk blocks, and since
atleast 4 of the six leaves contain a Θ(B) rectangles, the tree occupies O(N/B) disk
blocks.
It has
q been proved[3] that a window query on a pseudo-PR-tree with N rectangles
uses O( NB + BT ) I/O's in the worst case.
Following this denition, constructing the tree using a set of N rectangles in O( NB log N )
I/O's. However bulk loading the tree can be done I/O eciently using O( NB log MB NB )
I/O's. This is done using a four-dimensional grid that denes a partition of the four
dimensional space and stores the counts of rectangles that lie in a each unit of such a
partition. The construction algorithm recursively constructs Θ(log(M )) levels using this
grid. Essentially the grid helps in preventing I/O's that would have otherwise been required to split the input set correctly based on the appropriate dimension. The following
algorithm describes the construction of the Θ(log(M )) levels the tree. The algorithm is
then recursively used to construct the complete tree.
Algorithm Construct (S ∗ , ν, level)
Input:
S ∗ - Set of 4-dimensional
points representing the N rectangles in twodimensional space.
ν - Root of a pseudo-PR (sub)tree.
level - level of the node ν .
Output: Pseudo-PR Tree rooted at ν containing the points in S ∗ .
1. Construct four sorted lists Lxmin , Lymin , Lxmax , Lymax containing the points in S ∗
sorted by1 the respective
dimensions.
1
2. z ←α.M 4 (∗ Θ(M 4 ) : α >= 0 has to be chosen by the implementation ∗).
3. Using the sorted lists created in step 1, create a four dimensional grid using the
(kN/z)-th coordinate along each dimension.Keep the counts of points in each grid
cell, where 0 ≤ k ≤ z − 1.
28
4.
5.
6.
7.
8.
9.
10.
Split (S ∗ , level, ν)
Add priority leaves to all the nodes creates in the previous step.
while S ∗ has unprocessed points and subtree rooted at ν has space.
do
r ←next point in S ∗ .
Add r to the tree, so that all the properties of a pseudo-PR-tree are preserved
return
First the lists are sorted along each of the four dimensions. This helps in initializing
the rectangle counts in the grid. The algorithm Split is used in step 4 to create all
the internal KD-nodes by recursively splitting the grid until Θ(log M ) levels of the tree
are constructed. A nal step in the construction algorithm (step 6) distributes these
rectangles in the tree, respecting the properties of a pseudo-PR-tree. More precisely, we
ll the priority leaves by scanning S ∗ and ltering each point p, through the tree, one
by one, as follows: We start at the root ν , of the tree, and check its priority leaves νxmin ,
νymin , νxmax and νymax , one by one, in that order. If we encounter a non-full leaf we
simply place p there; if we encounter a full leaf νdim and p is more extreme than the least
extreme point p0 in νdim , we replace p0 with p and continue the ltering process with p0 .
After we check νymax , we continue to check (recursively) one of the KD-nodes of ν . The
KD-node to check is chosen by using the split value stored in ν .
The algorithm Split that creates all the internal nodes, is summarized below.
Algorithm Split (S, level, ν)
Input: S - set of 4-dimensional
to be split.
points, level - current level being constructed, ν - node
(∗ Constructs Θ(log M ) of the (sub)tree rooted at ν recursively. ∗)
1. if level > β. log M (∗ Θ(log M ) : β >= 0 has to be chosen by the implementation ∗)
2.
then return
3. d ←split dimension of level.
4. Using the grid in step 3 of Construct nd, using the split dimension, d, the exact
slice where the set can be divided into 2 sets S1 and S2 of roughly same size.
5. Create two nodes ν1 and ν2 whose parent is ν and store the split value used in ν .
6. Split (S1 , level + 1, ν1 )
7. Split (S2 , level + 1, ν2 )
8. return
At each recursive step, the appropriate dimension d, is used to split the grid (step 4.
Using the grid, it is easy to nd the approximate slice l, where the set of rectangles can
be divided into roughly two equal halves. Once this is achieved, the exact slice could
be found by scanning the sorted stream along dimension d. In order to do this, only
O(N/(Bz)) blocks have to be scanned as we have to scan only the rectangles that lie in
slice l. Once this is done, a new slice l0 containing O(z 3 ) grid cells is added to the grid,
basically splitting slice l. The rectangle counts in the grid cells belonging to these slices
are computed using the same O(N/(Bz)) blocks.
29
3.1.2
PR-tree
A two dimensional PR-tree is an R-tree with a fanout ofqΘ(B) constructed using a
Pseudo-PR-tree. It maintains the query performance of O( NB + BT ) I/O's in the worst
case. The following algorithm is used to construct a PR-tree, in a bottom-up manner,
on a set S of two dimensional rectangles:
Algorithm Construct (S)
Input: Set S of N rectangles in two dimensional space.
Output: PR Tree rooted at node ν
1. V0 ←Leaves from the set S with Θ(B) rectangles in each leaf.
2. i ←0.
3. while Number of MBR's of Vi ≥ B .
4.
do
5.
τVi ←Pseudo-PR-tree on the MBR's of Vi .
6.
Vi+1 ←leaves τVi .
7.
Match the MBR's of the nodes in Vi+1 with the rectangles in Vi
child pointers in Vi to the nodes in Vi+1 .
8.
i ←i + 1.
9. Construct the root node, ν , from MBR's of Vi and set its children.
10. return PR-tree rooted at ν .
and set the
It can be proved[3] that this algorithm bulk loads the PR-tree in O( NB log MB NB ) I/O's.
PR-tree can be updated using standard heuristic based R-tree update algorithms in
O(logB N ) I/O's in the worst-case but without maintaining the query eciency.
3.2
LPR-tree
Logarithmic Priority R-tree is an adaptation of the conventional R-tree structure, based
on the pseudo-PR-tree structure aimed to support the same worst case query performance
while the tree is updated. The adaptations in the structure are two-fold:
Internal nodes store additional information besides the MBR.
Leaf nodes are not at the same level.
The root of a LPR-tree has a number of subtrees of varying capacities. Each of these
subtrees, known as Annotated PR-trees (APR-trees), is a normal Pseudo-PR-tree where
each internal node ν of the tree, stores the following information:
Pointers to each of ν 's children, and the MBR of each child.
Split value that is used to cut ν in the four dimensional kd-tree.
30
For each priority leaf of ν , the least extreme value of the relevant coordinate of any
rectangle stored in that leaf.
A LPR-tree has up to dlog(N/B)e + 3 subtrees, τ0 , τ1 , τ2 , ..., τdlog(N/B)e+2 . τ0 can store at
most B rectangles, and τi for i > 0 has a capacity to store at most 2i−1 B rectangles.
Since each APR-tree has a dierent capacity, the leaf nodes are not at the same level.
In addition to these adaptations, the LPR-tree structure proposes a disk layout strategy for the nodes of the tree:
Each internal node of an APR-tree at depth i, such that i = 0 (mod blog Bc) and
its descendant internal nodes down to level i + blog Bc − 1 are stored in the same
block.
The smaller APR-trees τi for m ≥ i ≥ 0 where m = log MB are stored in the main
memory.
N
Of the larger APR-trees τi for dlog(N/B)e + 2 ≥ i ≥ l where l = log M
, the top i − l
levels are kept in main memory and the rest of the tree is stored on the disk.
The remaining APR-trees τi for l > i > m, are stored completely on the disk.
The LPR-tree is bulk loaded with a set S of N rectangles, by building the APR-tree
The rest of the trees are left empty.
Insertion is done using the following algorithm:
τdlog(N/B)e+2 .
Algorithm Insert (r)
Input: r - Rectangle to be inserted.
1. if number of insertions made so far ≥
bulk loaded.
number of rectangles with which the tree was
2.
then
3.
S ←{Rectangles from all the APR-trees} ∪ {r}.
4.
Reconstruct the LPR-tree using S .
5.
return.
6. if τ0 has less than B rectangles
7.
then
8.
Insert r in τ0 .
9.
else
10.
j ←1.
11.
for i ←1 to dlog(N/B)e + 2
12.
if τi is empty.
13.
then j ←i and break (∗ continues with step 14 ∗)
14.
S ←Rectangles from all trees τk where 1 ≤ k ≤ j − 1.
15.
Empty all trees, τk where 0 ≤ k ≤ j − 1.
16.
Build an APR-tree using S and store it as τj .
17.
Insert r in τ0 .
31
18. return.
N
)) I/O's amortized.
Insertion using this algorithm takes O( B1 (log MB NB )(log2 M
Deletion is done using the following algorithm:
Algorithm Delete (r)
Input: r - rectangle to be deleted from the LPR-tree.
1. if number of deletions made so far ≥ half the number
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
tree was bulk loaded.
then
S ←{Rectangles
of rectangles with which the
from all the APR-trees} \ {r}.
Reconstruct the LPR-tree using S .
return.
Search for r in each subtree.
if r is not found
then return .
L ←Priority leaf where r is found.
Delete r from L.
νp ←parent of L.
if L contains more than B/2 nodes
then return .
Replenish L with rectangles from the sibling priority nodes and from the children
priority nodes of the parent of L.
15. return.
The search algorithm in line 6 is quite trivial. It recursively searches the tree starting
at the root. It uses the annotated information stored at each internal node about its
priority leaves and the split value to determine the child node to continue the search.
Eventually it either locates the node in a leaf or will report failure when the search
rectangle is not found. If deleting a rectangle from a node L, causes the node to underow,
then this node has to be replenished with rectangles (step 14. This is done by moving
the most extreme B/2 rectangles among the sibling priority nodes that follow L and
the children priority nodes of sibling KD-nodes of L to L. It is possible that one of the
priority nodes from which rectangles were moved might underow. Such priority nodes
are also replenished recursively in the same manner as L. Chapter 4 will describe the
pseudo-code used in replenishing rectangles in detail.
Taking the rebuilding of the tree in step 1 of the Delete algorithm and the replenishing of nodes, it can be shown[3] that deleting a rectangle from a LPR-tree takes
N
N
log2 M
) I/O's amortized.
O(logB M
32
Chapter 4
Design and Implementation
I implemented the PR-tree family for two-dimensional rectangles. The sections that
follow will explain the design and implementation of the the data structures and the
algorithms. Section 4.1 describes some general implementation issues that were found
during the implementation and how they were handled. Section 4.2 will describe fundamental data structures used throughout the implementation. Sections 4.3 and 4.4
describe the data structures of the pseudo-PR-tree and the LPR-tree in terms of TPIE
concepts. These sections will also describe implementation issues related to the respective areas explicitly.
4.1
General Implementation Issues
Although the PR-tree had been implemented, the code could not be reused for implementing the update algorithms because the underlying TPIE library had undergone
major changes (for example, in caching mechanisms etc.). Most of the following issues
occurred at dierent places in the implementation. Almost all of them are related to the
usage of TPIE (See Appendix B).
Memory tracking
AMI block objects represent logical disk blocks in TPIE. These blocks have unique
identifers in an AMI collection. Very often these block objects are created in a
certain place in the code and have to be deleted at a dierent place (for instance,
due to caching). When blocks are not deleted, the available memory runs out
and TPIE simply aborts the application. Also when a deleted block is accessed,
exceptions occur at unpredictable places. To solve these range of errors, a memory
tracker class was created. Every time a new statement is executed, the memory
tracker is invoked to record the block id and the line number where the allocation
is made. Similarly, every time a delete statement is executed, the memory tracker
is invoked to release the allocation reference (block id). This memory tracker has
the following benets:
33
It can be used at dierent points in the code to check whether allocations and
de-allocations are matching or have occurred correctly.
– It can detect if blocks are re-allocated without getting de-allocated.
– At the end of the program, it can dump the allocations that have not yet been
de-allocated together with the line numbers where they were created, which
greatly helps in xing memory leaks. In fact, this trick was used to x the
R∗ -tree implementation that was included in the TPIE distribution.
–
I/O Counts
TPIE provides an interface to obtain the number block reads and writes for AMI stream
and AMI collection objects. However the interface to obtain these statistics for
the AMI stream does not work. Hence to work around this problem we take the
number of item reads and writes and divide it by the number of such items that
could t in a block. For the LPR-tree we know that this assumption will give
a good approximation as most of the block I/O for streams occur during sorting
when almost all blocks that are read or written are full.
Miscellaneous issues
There were several minor problems related to using TPIE, the following are some
of the most important ones which took some eort to trace out:
It is not possible to delete an item from an AMI stream. This is a problem for
the LPR-tree implementation, as it is required to lter out rectangle already
placed in a tree to start the next recursive step in the construction of the tree.
To do so, it would be nice to be able to delete the already placed rectangles
from the four streams that are sorted along the four respective directions. To
work around this problem, we lter out the rectangles to be placed in the tree
and sort them again along the four dimensions.
– Sorting an AMI stream that is already sorted returns an error code indicating
the stream is already sorted as opposed to a normal success scenario where
this return value would indicate success. Also in such a case, the stream that
is provided to store the sorted objects, would be empty as the user is expected
to reuse the original stream. This problem was easily worked around.
– When a block collection (usually representing a tree) is stored to a disk, TPIE
would also store a stack of free blocks in a separate le. Care should be taken
that while making a copy of the tree stored in the disk, the stack(.stk) le
should also be copied. Not doing so would results in crash in dierent places
in TPIE not easily revealing the actual problem.
–
34
4.2
Two Dimensional Rectangle
Data Structure
The two dimensional axis parallel rectangle is represented by the following data
structure:
DataStructure TwoDRectangle
Begin
double min [2]
double max [2]
AMI bid id
End
The arrays min and max contain the minimum and maximum values of the coordinates in the x and y direction respectively. id represents a unique identier in
a stream of rectangles. This data structure will simply be referred to as rectangle
in the subsequent discussion.
A two dimensional stream of rectangles, TwoDRectangleStream is an AMI stream
of TwoDRectangle objects.
Operations
During the various operations on the LPR Tree, the following two operations are
frequently performed on a TwoDRectangle :
–
Intersection of rectangles
Two rectangles are said to intersect if their edges intersect or if one rectangle
is completely contained in another. It should be noted that intersection of
rectangles in commutative. The following pseudo-code is used to determine
rectangle intersection:
PseudoCode Intersects (r1 , r2 )
Input: Two TwoDRectangle objects r1 and r2 .
Output: true if r1 and r2 intersect, false otherwise.
1. b1 ← r2xmin > r1xmax (∗ r2 lies to the right of r1 . ∗)
2. b2 ← r2 xmax < r1 xmin (∗ r2 lies to the left of r1 . ∗)
3. b3 ← r2 ymin > r1 ymax (∗ r2 lies above r1 . ∗)
4. b4 ← r2 ymax < r1 ymin (∗ r2 lies below r1 . ∗)
5. return not (b1 or b2 or b3 or b4 ).
Basically, the above algorithm checks if the rectangles do not intersect and
returns the negation of that result.
–
Computing the minimum bounding box
35
Given a list of rectangles the minimum bounding box is computed by linearly
traversing the list and keeping track of the most extreme coordinates in the
respective directions. The following pseudo code describes this procedure.
PseudoCode ComputeMinimumBoundingBox (r, n)
Input: List r of TwoDRectangle objects.
Output: TwoDRectangle that is the minimum bounding
box of all the rect-
angles in the specied list.
1. mbb ←r0 (∗ mbb is set to the rst rectangle in the list ∗)
2. for i ←1 to n
3.
do
4.
if mbbxmin > ri xmin
5.
then mbbxmin ← rixmin
6.
if mbbymin > ri ymin
7.
then mbbymin ← ri ymin
8.
if mbbxmax < ri xmax
9.
then mbbxmax ← ri xmax
10.
if mbbymax < riymax
11.
then mbbymax ← ri ymax
12. return mbb.
4.3
Pseudo-PR-tree
4.3.1
Data Structures
We rst describe the priority node and the internal node (KD-node) structure. The tree
itself is represented by the root node which is an internal node. The complete tree is
stored in the disk as a AMI collection.
PriorityNode
PriorityNode is an AMI block that can hold at most B rectangles. All these
rectangles are stored in the el eld of the block. In implementation terms, the
priority node class is derived from a AMI block. The info eld of the block, in
this case, contains the number of rectangles currently present in the block. The
priority node stores rectangles in a sorted order. The sorting order is determined
by the dimension the node represents. So for example, if the node is a xmin or a
ymin priority node, the rectangles are sorted in the ascending order according to
their coordinate value in that dimension. The sorting order is descending when the
node is a xmax or a ymax node.
A rectangle can be added to a priority node only when it contains less than B
rectangles. The rectangles stored are always maintained in a sorted order according
to the dimension the priority node represents in the pseudo-PR-tree. The insertion
36
pseudocode is described below. It follows a binary search pattern to locate the
position of insertion.
PseudoCode InsertRectangle (r, dim)
Input: TwoDRectangle r to be inserted
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
along the specied dimension dim.(∗ Assumption is the priority node has at least one position free. ∗)
insertV alue ← rdim
f irst ← 0 ; last ← (n − 1) ; mid ← 0
compare ←<
if IsMinDimension (dim)
then compare ←>
while f irst ≤ last
do
mid ← b(f irst + last)/2c
midRectangle ← elmid
midV alue ← midRectangledim
if compare (midV alue, insertV alue) = true
then last ← (mid − 1)
lastRectangle ← ellast
lastV alue ← lastRectangledim
if compare (lastV alue, insertV alue) 6= true
then break
else
if midV alue 6= insertV alue
then f irst ← (mid + 1)
else break
elmid ← r
When a rectangle gets inserted in step 21, all the rectangles from the position of
insertion are moved one position to the right. A similar procedure is followed for
deleting a rectangle.
When the priority node is already full, the most extreme rectangle, that is, the
rectangle at position (n − 1), will be replaced if the current rectangle is more
extreme than the rectangle that is being inserted.
Internal Node (KD-node)
The following data structure describes the internal node of the pseudo-PR-tree.
37
DataStructure InternalNode
Begin
TwoDRectangle minimumBoundingBoxes [6]
double
leastExtremeValues [4]
double
splitValue
priorityNodes [4]
AMI bid
AMI bid
subTreeIds [2]
int
subTreeIndices [2]
End
The internal node holds pointers (block id's) of the children priority nodes and the
roots of the recursive subtrees. The node itself is stored in a AMI block. It also
contains the annotated information, namely, the minimum bounding boxes of the
all the children nodes and the least extreme values of each of the priority nodes.
There can be at most 4 priority nodes and 2 internal nodes. To identify the children
KD-nodes completely, it is also necessary to store the location in the block where
the node could be found in addition to their block id's.
4.3.2
Construction Algorithm
The construction algorithm recursively builds the tree in a top-down fashion. I/O eciency in the construction is obtained using a Grid. The grid is a 4-dimensional structure dened by the coordinate axes, xmin , xmax , ymin , ymax . There are z 3-dimensional
slices dened orthogonal to each dimension d, using the input stream that has the fourdimensional points sorted on the coordinate values of dimension d. In particular, these
slices are dened such that the number of rectangles that lie in any slice dened by
adjacent slices are equal. The z slices orthogonal to each of four dimensions divide the
four-dimensional space dened by the set S ∗ into a grid that contains z 4 grid cells. Each
grid cell will hold the count of the number of rectangles that lie in that cell. As the grid
is kept in the main memory, splitting the internal nodes to create sibling KD-nodes can
be performed I/O eciently as only a limited number of blocks I/O's will be necessary
to perform such a split. Using the grid each recursive step builds a part of the tree in
memory and distributes the rectangles. The distribution ensures that the properties of
a pseudo-PR-tree are maintained correctly. We rst describe the structure of the grid
and some of the important operations on the grid followed by the pseudo code for the
construction algorithm itself.
Axis Segments
There are z axis segments
orthogonal to each of the four dimensions. As described
earlier, the grid will be split during the construction algorithm recursively, to create
the internal nodes of the pseudo-PR-tree. The grid helps in determining the slice
l, that requires a split. However in order to determine the exact position of split,
38
part of the input stream sorted along the dimension of the split and has rectangles
that lie in the slice l, needs to be accessed. To achieve this, for each axiss segment
orthogonal to a certain dimension d, we need to store the osets to the input stream
sorted along d. This is achieved by having four hash tables, AxisSegments[4], one
for each of the dimension, whose keys are the coordinate values dening an axis and
the values are the osets to the correct position in the sorted stream. To eciently
use memory, these hash tables are shared across the sub-grids that are created as a
result of splitting the grid. During the process of splitting, new axis segment gets
added to the hash table.
Grid
Given the memory constraints, the size of the grid is tuned to 16. The grid is
implemented as a collection of GridCell objects. To be able to eciently retrieve
cells from the grid given its address in terms of the dening coordinate values, the
grid cells are assigned id's. This is required for the operations on the grid that will
be described later. The grid is then a hash table of these cell id's to the actual grid
cells.
DataStructure GridCell
Begin
double axisBegin [4]
double axisEnd [4]
int
numberOfRectangles
int
id
End
The id's of the grid cell can be calculated from the index of the four coordinate
axis dening the grid cell using the following formula:
id = 163 ∗ Index(AxisSegmentsxmin , axisBeginxmin ) +
162 ∗ Index(AxisSegmentsymin , axisBeginymin ) +
16 ∗ Index(AxisSegmentsxmax , axisBeginxmax ) +
Index(AxisSegmentsymax , axisBeginymax )
Index(h, k) is a function that retrieves the sorted position of the key k ,
table h.
in the hash
At each step of construction, the grid is required to be split
across a specied dimension d, depending on the level at which the tree is being
constructed. The grid has to be split in such a way that the rectangles are approximately distributed evenly among the two halves. The split has to be performed
Splitting the Grid
39
at a coordinate value along dimension d, such that rectangles that are greater than
or equal to this value are on one side and rest on the other side. The eciency of
the grid can be seen here as the grid keeps the counts of all the rectangles. The
following pseudo-code describes the splitting of the grid:
PseudoCode SplitGrid (grid, d)
Input: Dimension d along which the grid has to split.
Output: Two grids each having approximately half the rectangles.
1. sortedStream ←Sorted stream along dimension d.
2. slices ←GetSortedSlices (grid, d)(∗ slices is a map of the coordinate value to
the rectangle count in that slice. ∗)
3. count ← 0
P
4. Choose smallest i such that count ← ij=0 slices[j] > StreamLength (sortedStream)/2
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
splitSlice ←Key(slices, i);
of f set ← AxisSegmentsd [splitSlice] (∗ Find precise coordinate ∗)
previousCount ← count − slices[i]
while ReadItem (sortedStream, r) and Contains (grid, r)
do
++previousCount;
oset
←CurrentPosition (sortedStream)
if previousCount ≥ count
then break (∗ goes to step 14 ∗)
splitV alue ← rd
Insert (AxisSegments, of f set, splitV alue) (∗ Insert a new axis ∗)
grid1 ← nil; grid2 ← nil (∗ Create two new empty grids ∗)
for each GridCell in grid. (∗ Distribute the cells into two grids ∗)
g ←CurrentItem (grid)
if g.axisBegind < splitSlice
then AddGridCell (grid1, g)
if g.axesBegind > splitSlice
then AddGridCell (grid2, g)
else (∗ Split the cell into two and add it to both grids. ∗)
g1, g2 ← g
if g1.axisEndd 6= splitV alue
then (∗ selectedSlice is not the exact split ∗)
g1.axisEndd ← splitV alue
AddGridCell (grid1, g1)
g2.axisBegind ← splitV alue
g2.numberOf Rectangles ← 0
AddGridCell (grid2, g2)
else
AddGridCell (grid2, g)
40
34. Seek (sortedStream, of f set)
(∗ Update the rectangle counts if a slice was split in step 26 ∗)
35. while ReadItem (sortedStream, r) and Contains (slice, r)
36.
do
37.
cellId ← GetCellId (r)
38.
g1 ← grid1[cellId]; g2 ← grid2[cellId];
39.
− − g1.numberOf Rectangles; ++g2.numberOfRectangles;
40. return grid1, grid2
We rst obtain the slice l, where the split should occur in step 4. We then (in
step 8) look up the sorted stream along dimension d by seeking to the correct oset
using the hash table AxisSegments[d]. We go through a small amount of rectangles
in this stream that lie in the selected slice. This gives a more precise split value that
would then become a new axis along dimension d. This axis and the oset to the
axis in the sorted stream are added to the hash table AxisSegments[d] (step 15).
Two new grids are then created. Grid cells that lie to left or to the right of the
split value are easily distributed. Care has to be taken when a grid cell lies in the
slice that was split (step 23). If the slice selected in step 4 was an exact split, we
simply add this slice to the second grid (step 33). However when a new axis is
introduced, we have to update the axis dening the boundaries of the grid cells
in step 26. When this happens, the rectangle counts are adjusted by scanning the
rectangles that lie in this slice in step 35.
Construction using the grid
The construction algorithm is implemented using a set of procedures that share
some data by being part of a class. The following class describes the necessary
data and operations to perform construction.
Class PseudoPRTree
Data
TwoDRectangleStream inputStream
InternalNodeBlock
cachedBlocks[]
PriorityNode
cachedPriorityNodes[]
Queue
leavesToConstruct
Operations
void Distribute(r, currentBlock, index, level, stream)
void Construct(rootBlock, index)
void SplitNode(currentBlock, grid, currentCount, level)
The construction algorithm constructs the tree recursively with each recursive step
constructing z nodes in the tree. The following pseudo code describes the sequence
of operations performed in one recursive step of the construction algorithm:
41
PseudoCode Construct (rootBlock, index)
Input: rootBlock and index identifying the rootN ode of the tree constructed.
Output: pseudo-PR-tree constructed using the input stream.
1. for each dimension in xmin , ymin , xmax , ymax
2.
AMI sort (inputStream, dimension) (∗ Sort in all dimensions ∗).
3. grid ← ConstructGrid (sortedStreams)(∗ Construct the initial grid. ∗)
4. SplitNode (rootBlock, grid, 0, 0)
5. remainingRectangles ←nil (∗ rectangles that could not be distributed are
tracked. ∗)
6. while ReadItem (inputStream, r) (∗ Distribute the stream ∗)
7.
do
8.
Distribute (r, rootBlock, 0, 0, remainingRectangles)(∗ end while ∗)
9. substreams[| leavesT oConstruct |] ← nil (∗ Filter the streams associated with
each leaf ∗)
10. Clear (cachedP riorityN odes) (∗ Writes the priority nodes to the disk ∗)
11. Clear (cachedBlocks) (∗ Write the internal nodes to the disk ∗)
12. while ReadItem (remainingRectangles, r)
13.
do
14.
for each leaf l in leavesT oConstruct
15.
if Contains (l, r)
16.
then AddItem (substreamsl , r)
17. for each leaf l in leavesT oConstruct
18.
tree ←CreateTree (substreamsl )
19.
Construct (l.rootBlock, l.index)
The operation Construct takes an internal node and constructs a subtree rooted at
that node in memory, containing at most z nodes. Each such construction phase,
uses the recursive function SplitNode (step 4) to construct all the internal nodes
as per the properties of the pseudo-PR-tree. Later a distribution phase (step 6
performed by the function Distribute, distributes rectangles across the created
internal nodes. The construction phase creates a grid (step 3) using the sorted
streams (created in step 1) that guides the splitting process. The last phase in the
construction algorithm involves, recursively creating subtrees for all the leaves that
could not be constructed completely because of memory constraints. The address
of such leaves will be stored in the queue leavesToConstruct. The pseudo-code
that constructs the internal nodes of the tree using the grid is described below:
PseudoCode SplitNode (currentBlock, grid, currentCount, level)
Input:
–currentBlock - Block to store internal nodes created.
–grid - Grid object to split the nodes evenly.
–currentCount - Number of internal nodes created so far.
–level - Current level being constructed.
1. currentN ode ← currentBlockindex
42
2. if currentCount < z (∗ z = 16, size of the grid ∗)
3.
then
4.
if NumberOfRectangles (grid) > 4*B (∗ If sucient rectangles are
available for a split ∗)
5.
then
6.
splitDimension ←level % 4
7.
g1 , g2 ←SplitGrid (grid, splitDimension)
8.
for g in g1 , g2
9.
if NumberOfRectangles (g) > 0
10.
then if currentBlock is full
11.
then currentBlock ←NewInternalNodeBlock ( )
12.
++currentCount
13.
AddInternalNode (currentBlock)
14.
SplitNode (currentBlock, g1, currentCount, level + 1)
15.
SplitNode (currentBlock, g2, currentCount, level + 1)
16. else (∗ Reached memory limit ∗)
17.
if NumberOfRectangles (grid) > 4*B
18.
then (∗ node needs to be split, add it to leavesT oConstruct ∗)
19.
Add (leavesT oConstruct, currentBlockId, level)
20.
else
21.
AddInternalNode (currentBlock)
The operation SplitNode uses the Grid to split nodes, at each level, in a KDtree fashion. Once a block is allocated, it is recursively used to store the internal
nodes created until it has no space left for more nodes. This disk layout strategy
achieves I/O eciency during queries. When the maximum number of internal
nodes z is created, the leaves that need further splits are stored in the queue,
leavesT oConstruct (step 19). The class PseudoPRTree also shows an internal
cache of priority nodes and blocks for internal nodes. These caches will be lled
when the internal nodes are created in step 11 and later used during the distribution
phase. The distribution phase is described by the following pseudo-code:
PseudoCode Distribute (r, currentBlock, index, level, remainingRectangles)
Input:
–r - TwoDRectangle object to be distributed in the tree.
–currentBlock - Block to store internal nodes created.
–index - index of the location of the node in the currentBlock .
–level - Level at which distribution occurs.
–remainingRectangles - Stream of rectangles that could not be distributed.
1. currentN ode ← currentBlockindex
2. inserted ←false
3. for each dimension in xmin , ymin , xmax , ymax
4.
if priorityN odedimension in currentN ode does not exist
43
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
then
if
Create priorityN odedimension
NumberOfRectangles (priorityN odedimension ) < B
then
AddRectangle (priorityN odedimension , r)
inserted ← true
break
else
rr ←GetRectangle (priorityN odedimension , (B − 1))
if ReplaceRectangle (priorityN odedimension , r) (∗ Replace rectangle if r is more extreme than rr along dimension ∗)
then r ← rr
if inserted = false
n ←NumberOfSubtrees (currentN ode)
if n > 0
then
splitDimension ←level % 4
subtreeBlock, index ←Choose subtree according to split value
and splitDimension
Distribute (r, subtreeBlock, index, level+1, remainingRectangles)
else
AddItem (remainingRectangles, r)
The operation Distribute takes a rectangle r that has to be placed in the subtree
rooted at the node currentBlockindex such that the properties of a pseudo-PR-tree
are not violated. In order to do this, the algorithm rst tries to nd a place in
the priority nodes xmin , ymin , xmax , ymax , in that order. This means if a priority
node p, has less than B rectangles, the rectangle r gets added in the correct sorted
position according to the dimension of the priority node (step 9). If this is not
the case, it is checked if r is more extreme than the least extreme rectangle rr in
p (step 12). If this is indeed the case, we replace rr by r in p and continue the
distribution process with rr. If none of the priority nodes could accommodate r
(or rr in case r is replaced in step 14), the search for the correct position continues
in the appropriate subtree, that is chosen using the dimension in which the tree
was split and the split value (step 21). If the tree is completely full, r is added
to the stream, remainingRectangles. This stream is later ltered for each of the
leaves that have to be recursively constructed.
4.3.3
Implementation Issues
Designing the structure of the grid
The Grid is a collection of grid cells. During the construction of the pseudo-PRtree, it is often desirable to quickly obtain the grid cell that contains a particular
44
rectangle. This happens for instance, when the grid cell has to be updated with
counts of rectangles. To be able to do this, the grid was implemented as a hash
table that contains a mapping of a unique key(id) that identies the grid cell, to
a grid cell itself. The formula to obtain the id from the coordinates of a rectangle
was already presented. The id of the grid cell should also be derivable from the
coordinates of the rectangle that lies in the grid cell. The solution to this problem
is to obtain the key from the index of the coordinates that dene the boundary
of the grid cell. This index can easily be obtained using the AxisSegments hash
table.
Maintaining vs. computing the minimum bounding box
Improving the algorithms to update a PriorityNode
The rectangles in a priority node are kept in a sorted order to easily obtain the least
extreme value, which is frequently required during the distribution of rectangles.
The minimum bounding box to be kept in the parent node could either be computed
when the tree has been constructed or maintained as the tree is being constructed.
Maintaining the bounding box was found to be more expensive because rectangles
enter and leave the priority nodes very often during its construction phase. Hence
the computation of the bounding box was chosen to be done after the construction
has completed.
To store rectangles in a sorted order along a particular dimension, it is important
to insert new rectangles in the correct position. A linear traversal of this list made
the construction algorithm very slow. Hence a binary search is performed to locate
the position of insertion or deletion before actually inserting or deleting a rectangle.
4.4
LPR-tree
4.4.1
Structure
The LPR-tree contains a sequence of pseudo-PR-trees (with annotated information).
The most important operations related to the LPRTree structure are the Insert, Delete
and the Query algorithms. The following class describes how these algorithms are implemented.
45
Class LPRTree
Data
InternalNodeBlock cachedBlocks[]
PriorityNode
cachedPriorityNodes[]
RootNodeBlock
rootBlock
outputStream
AMI collection
Operations
void Insert(r)
bool Delete(r)
void Query(queryWindow, outputSize, nodes,
void GetRectangles()
void CacheTree(treeIndex)
priorityNodes)
The root node of the LPR-tree contains links to the root node blocks of the children
pseudo-PR-trees. Depending on the size of these pseudo-PR-trees either the full tree or
a part of the tree is cached. This gives the benet to be I/O ecient while updating the
tree. The info eld of this root node block, contains the following elds:
numberOfRectangles
- This indicates the number of rectangles with which the
tree was last bulk loaded.
numberOfInsertions
- Indicates the number of insertions made to the tree since
the last time the tree was bulk loaded.
numberOfDeletions
- Indicates the number of insertions made to the tree since
the last time the tree was bulk loaded.
During updates, the tree is reconstructed by collecting all the rectangles. This happens
in the following two situations:
The number of insertions becomes equal to the number of rectangles with which
the tree was bulk loaded.
The number of deletions becomes equal to half the number of rectangles with which
the tree was bulk loaded.
When such a situation happens, the caches are emptied and the tree is bulk loaded
again, resetting these counters to 0 and setting the numberOfRectangles to the correct
number. The GetRectangles method, recursively traverses the entire tree and writes
all the rectangles to a stream, subsequently deleting all the blocks and nodes from the
tree. Bulk loading a LPR-tree involves constructing the τdlog(N/B)e+1 child pseudo-PRtree. This s followed by a phase in which a part or whole of the tree is cached. Given
the tree index, the number of levels to cache is described by the following pseudo code:
46
PseudoCode CacheTree (treeIndex)
Input: Index of the child pseudo-PR-tree to be cached.
1. block ← GetBlock (rootN odeBlock, treeIndex) (∗ Gets the root block ∗)
2. maxLevels ← −1; cache ← false
N
3. l ← log M
4. m ← log MB
5. if NumberOfItems (block) > 0
6.
then
7.
if treeIndex ≤ m
8.
then maxLevels ← −1
9.
else if l > (m + 1) and treeIndex < l
10.
then cache ←false
11.
else maxLevels ← (treeIndex − l; )
12.
if cache = true
13.
then Recursively cache all the priority nodes and internal nodes
depth of maxLevels.
4.4.2
up to a
Insertion Algorithm
The following pseudo code describes the high-level implementation details to insert a
rectangle in a LPR-tree:
PseudoCode Insert (r)
Input: TwoDRectangle object, r to be inserted.
1. if (numberOf Insertions + 1) = numberOf Rectangles
2.
then
3.
stream ←GetRectangles ( )
4.
AddItem (stream, r)
5.
Construct (stream)
6.
else
7.
t0Count ← NumberOfRectangles (t0Block)
8.
if t0Count >= B
9.
then
10.
stream ←nil
11.
for i ← 0 to n
12.
block ← GetBlock (rootN odeBlock, treeIndex) (∗ Gets the
root block ∗)
13.
if NumberOfRectangles (block) > 0
14.
then GetRectangles (i) (∗ Add all the rectangles to stream ∗)
15.
else break (continues with step 16)
16.
tree ←ConstructPseudoPRTree (stream, outputStream)
17.
CacheTree (tree)
18.
else (∗ τ0 block has space for r ∗)
47
19.
AddRectangle (t0Block, r)
When the number of rectangles inserted has reached the threshold, the tree is reconstructed by collecting all rectangles in a stream that also contains the rectangle to be
inserted. When this not the case, the rectangle is inserted in the τ0 tree. If τ0 is already
full, a search is made for the rst empty tree τi in step 12. All the rectangles in the
trees preceeding this tree are moved to a stream and the trees discarded. τi is then bulk
loaded with the stream of rectangles collected. Having now made sure that τ0 has space,
the rectangle is inserted into τ0 .
4.4.3
Deletion Algorithm
The deletion algorithm uses the least extreme values of priority nodes and the split
values of the nodes to search for the rectangle that has to be deleted. As a consequence
of the distribution strategy of rectangles in a pseudo-PR-tree, it is guaranteed to have
exactly one path in a pseudo-PR-tree to search for a specic rectangle for deletion. This
path may result in the rectangle being found or it is concluded that this rectangle is not
present in the tree. While deleting a rectangle, we take care to update the minimum
bounding boxes of the priority nodes. The internal nodes are updated after deletion by
keeping track of the path followed for deletion. In this section we describe the pseudocode for deletion followed by a description of the algorithm to replenish rectangles in
under-full nodes.
PseudoCode Delete (r, block, index, path)
Input:
r - TwoDRectangle object to
be deleted.
block - InternalN odeBlock representing a node in the tree (initially the root
block)
index - index of the position in block of the current node.
path - path to the deleted rectangle that is initially empty.
Output: true if r is found and deleted, false otherwise.
1. node ← blockindex
2. currentV alue ← rdimension
3. for each dimension in xmin , ymin , xmax , ymax
4.
lev ← node.leastExtremeV aluedimension
5.
if lev ≤ currentValue (* ≥ for max nodes *)
6.
then
7.
p ← node.priorityN odesdimension
8.
if RemoveRectangle (p, r)
9.
then
10.
Add (path, dimension)
11.
if NumberOfRectangles (p) < B/2
12.
Replenish this node and update bounding boxes
48
13.
else ComputeMinimumBoundingBox (p, node)
14.
return true
15.
else
16.
if lev = currentV alue
17.
goto step 3 with next dimension.
18.
else return false
19. if NumberOfSubtrees (node) > 0
20. then
21.
subtreeBlock, subtreeIndex ←Choose subtree according to split value
22.
Add (path, subtreeIndex)
23.
return Delete (r, subtreeBlock, subtreeIndex, path)
24. else return false
Given a rectangle r, the above pseudo code recursively goes through the tree. In
step 3 it checks each of the priority nodes based on the least extreme values stored in
the parent node. If r falls in this range, either it gets deleted or it is concluded that
the rectangle is not found in the tree except when the least extreme value is the same
as the coordinate value of the search rectangle (step 17). This is because, it may be
possible that more than one rectangle in the tree has the same coordinate value along
a specied dimension which also happens to be the least extreme value of the priority
node. If none of the priority nodes could be checked as the rectangle falls out of range or
when the special case of step 17 occurs, the sub trees are searched recursively in step 20.
The path to the deleted rectangle is stored to be able to correct the minimum bounding
boxes. This path stores the index of the child from the root to the priority node that
contained the deleted rectangle. More precisely index of the child is the index of the
subtree in case of an internal node(step 22), otherwise it is the dimension of the priority
node(step 10). Note that we don't have to store the id's of the blocks as this information
is already present in the internal nodes. As the recursion terminates when a priority
node is reached, the last index in this path must be the dimension of the priority node.
When deletion happens successfully and the rectangle count in the priority node falls
below (B/2), the node is replenished (step 12) with rectangles from its priority node
siblings or from the priority node children of its sibling KD-nodes. This is described by
the following pseudo code:
PseudoCode ReplenishNodes (p, d, node)
Input: PriorityNode p that is the d-th child
under-full.
node of the InternalNode node that is
1. stream ← nil (∗ Stream contains items of rectangles and their priority node address ∗)
2. underF lowStack ←nil
3. priorityN odes ←nil (∗ List of addresses of nodes from which rectangles are collected ∗)
4. for each dimension succeeding d in xmin , ymin , xmax , ymax
49
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
q ← node.priorityN odesdimension
rectangles ←GetRectangles (q)
AddItem (stream, rectangles)
Add (priorityN odes, Address(q))
for each sub-tree of node
subtreeN ode ← root of the sub-tree.
for each dimension in xmin , ymin , xmax , ymax
rectangles ←q ← subtreeN ode.priorityN odesdimension
GetRectangles (q)
AddItem (stream, rectangles)
Add (priorityN odes, Address(q))
n ←MIN(B/2, StreamLength (stream))
AMI sort (stream, d)
for i ← 0 to n
ReadItem (stream, r, q)(∗ q is priority node where the rectangle r
AddRectangle (p, r)
RemoveRectangle (q, r)
ComputeMinimumBoundingBox (p)
for i ←0 to Count (priorityN odes)
q ← priorityN odesi
if NumberOfRectangles (q) < B/2
Push (q, underF lowStack)
else ComputeMinimumBoundingBox (q)
while underF lowStack is not empty
q ← Pop (underF lowStack)
ReplenishNodes (q.priorityN ode, q.dimension, q.parent)
exists. ∗)
Rectangles are rst collected in an AMI stream from the sibling priority nodes in
step 9 and from the children priority nodes of the KD-nodes of the parent of p (step 16).
The stream is then sorted. The number of rectangles to be replenished is a minimum
of B/2 and the number of rectangles collected (step 17). This is required to ensure the
correctness of the LPR-tree after deletion. During this process of replenishing p, the
nodes from which rectangles were borrowed may run under-full. The addresses of these
priority nodes are collected in a stack (step 24). The address of a priority node, is dened
by the id of its AMI block, its dimension and the id of AMI block of its parent. These
underowing priority nodes are replenished recursively in step 29. Its important to note
that these rectangles are replenished in the reverse order. This means, for instance, if
while replenishing the xmin priority leaf, all its siblings run under-full, then those children
must be replenished in the order ymax , xmax and ymin . This is to ensure that the priority
node can be replenished with correct number of rectangles while still preserving all the
properties of a pseudo-PR-tree. Such a reverse order can easily be ensured by storing
the addresses of these under-owing nodes in a stack.
50
4.4.4
Implementation Issues
Correctness of LPR-tree structure
Implementing the update algorithms taking care of every precise detail is very
dicult. To easily nd and x implementation bugs, a small procedure was written
to check the correctness of the tree. When working with large datasets, this was a
good help in diagnosing problems. The correctness is checked by verifying certain
trivial facts about the structure of the LPR-tree. The following are some of these
rules:
Any priority node that is under-full (having less than B/2 rectangles) is a
node that has no succeeding sibling priority nodes and the parent of such a
node does not have any subtrees.
– The most extreme value of a priority node (of a certain dimension d) at level
i ( i > 0)is less extreme than the least extreme value of the corresponding
priority node in the same dimension of its parent at level i − 1.
– All priority nodes of the upper levels of a tree that are children of a node
having KD-nodes as children must be full.
–
Replenishing nodes A lot of eort was spent on correctly implementing the replenishing of nodes during deletion. The following were the two main problem
areas encountered:
When a priority node for instance, pxmin , underows due to deletion it get
replenished with sibling priority nodes or the children priority nodes of sibling
KD-nodes. It may be possible that some of these nodes which gave their
rectangles to pxmin may underow. In such a situation, these priority nodes
also need to get recursively replenished in the reverse order. Reverse order
means that rst the priority nodes of sibling KD-nodes have to replenished
then the sibling priority nodes in the order ymax , xmax , ymin , xmin . This order
is necessary to maintain the correctness of the tree. It also ensures that any
node that needs replenishment can be adequately replenished.
– It is important to replenish a underowing priority node with at most (B/2)
rectangles and not more to preserve the correctness of the tree. Before explaining the problem, the following observations are made.
Observation 1 : The least extreme rectangle of a priority node, in any
dimension, is more extreme than the most extreme value along that dimension across all the priority node children of the sibling KD-nodes.
This observation immediately follows from the structure of the LPR-tree.
Observation 2 : A priority node can underow to an extent much below
(B/2) and thereby requiring more than (B/2) rectangles to be completely
packed.
Consider the situation that atleast two of priority nodes (ν and µ) are
–
51
having (B/2) + 1 rectangles. When a rectangle from the rst of these
priority node ν , gets deleted, it requires replenishment. It is possible that
the most of these rectangles taken from µ. In this case, µ required more
than (B/2) rectangles to be completely packed.
Observation 3 : For a priority node, pd, in any dimension d, greater
than xmin , having n rectangles where n < B , the following is true: the n
rectangles in pd together with the rectangles of its priority node siblings,
may not have the most extreme B rectangles, along dimension d, in the
subtree rooted at its parent.
Any priority node having KD-siblings is completely full only after bulk
load. After bulk load, all priority nodes of a node ν , collectively have B
most extreme along dimension d in the subtree rooted at its parent.
Now assume that we completely pack the priority nodes that are underowing.
Also assume that ymin , xmax , ymax priority nodes, whose parent is η, have
exactly (B/2) + 1 rectangles. Also assume that η has only one sibling KD-node
φ as the other sibling KD-node ψ is deleted as all its rectangles in its subtree
have been deleted earlier. Now, deleting one rectangle from ymin may cause
xmax to underow below (B/2) (From Observation 2 ) and also assume that
no rectangles are taken from ymax . As a result, xmax requires replenishment
but not ymax as ymax has enough rectangles to just stay above the threshold.
Now replenishing xmax to the fullest extent may cause ymax to run completely
empty, while also borrowing rectangles from the priority nodes of the sibling
KD-node, φ. This ymax node will now be replenished with rectangles from the
children priority nodes of its KD-siblings. All priority nodes of φ provide the
most extreme B rectangles in the ymax direction. However it is possible that
these B rectangles are not the most extreme ymax rectangles in the subtree
rooted at φ(from Observation 3 ). Moving these rectangles to ymax node of η
violates Observation 1 which makes the LPR-tree inconsistent.
However moving a maximum of (B/2)rectangles will ensure that no priority
node can run empty while one its sibling priority nodes or the priority nodes
of the sibling KD-nodes have more than (B/2) rectangles. Thereby, the above
mentioned problem will not occur.
52
Chapter 5
Experiments
5.1
Experimental Setup
The LPR-tree is implemented in C++. The code has been developed using the Microsoft
Visual Studio 2003 compiler for the Windows XP platform. TPIE[4] is used as the library
to control block allocations and count I/O's (See Appendix B). Each rectangle has a size
of 40 bytes. The implementation uses a maximum possible block size of 1638 rectangles.
The experiments were run on a Pentium 4 CPU, 2.00 Ghz with 256 MB of RAM. The
amount of memory available for TPIE is restricted to 64 MB. This constraint is taken
from the experiments on (static) PR-trees performed earlier[3]. The experiments are
performed using the LPR-tree implementation presented in this thesis and the R∗ -tree
implementation included in the distribution of TPIE[4, 6].
5.2
Datasets
We use both real-life and synthetic datasets.
5.2.1
Real life data
As the real-life data we use the TIGER data of the geographical features in the United
States of America. As most of our test results are expected to vary with the size of the
datasets, we would like to have real life datasets of varying sizes. The TIGER dataset
is distributed over six CD-ROMs. We choose to experiment with road line segments of
the two of the CD-ROMs. We collect 17 million rectangles from this dataset and create
7 streams. Five of these streams contain one million rectangles and the rest is split into
two streams of 4 million and 8 million rectangles respectively.
53
5.2.2
Synthetic data
To investigate the query performance of these dynamic R-trees over various extreme
parameters and distribution characteristics, we use the following datasets.
1. Uniform Distribution : This dataset is designed to test the performance of R trees
over rectangles whose centers follow a uniform distribution in the unit square. The
width and height of the rectangles is also generated uniformly at random as a
number between 0 and 0.001. These gures are the same as the the experiments
on (static) PR-trees performed earlier[3]. In general, we refer to this distribution
as the UNIFORM dataset.
2. Normal Distribution :This dataset has xed sized squares of size 0.001. The centers of the squares follow a normal distribution whose mean is 0.50 and standard
deviation is 0.25. In general, we refer to this distribution as the NORMAL dataset.
The following table describes these datasets.
Dataset Identifier
UNIFORM(na )
NORMAL(n)
Rectangle - ρ
Center(ρ) = (x, y)
W idth(ρ) =
Center(ρ) = (x, y)
W idth(ρ) = Height(ρ) = 0.001.
where x and y are uniformly
generated at random such that 0 ≤ x, y ≤ 1.0,
with the current system time used as the seed b for
randomization.
UniformRandom (0, 0.001).
is also chosen in a similar manner.
Height
where x and y are generated
at random from a normal distribution with µ =
0.50, σ = 0.25, with the current system time used
as the seed for randomization.
a n - number of rectangles to generate
b note that the generator is seeded only
once for the generation of n rectangles.
Both these distributions are used to generate 17 streams similar to the TIGER
datasets.
For convenience the described datasets are given names as indicated below:
54
Dataset Identifier
Uni1 1 .. Uni1 5
Uni4
Uni8
Uni1.5
Nor1 1 .. Nor1 5
Nor4
Nor8
Nor1.5
Tig1 1 .. Tig1 5
Tig4
Tig8
Tig1.5
5.3
Description
Uniform distribution of 1 million rectangles.
Uniform distribution of 4 million rectangles.
Uniform distribution of 8 million rectangles.
Rectangles from Uni1 1 and half the rectangles from Uni1 2.
Normal distribution of 1 million rectangles.
Normal distribution of 4 million rectangles.
Normal distribution of 8 million rectangles.
Rectangles from Nor1 1 and half the rectangles from Nor1 2.
TIGER dataset of 1 million rectangles.
TIGER dataset of 4 million rectangles.
TIGER dataset of 8 million rectangles.
Rectangles from Tig1 1 and half the rectangles from Tig1 2.
Bulk Load
Experiment Description
The comparative bulk performance of the R∗ -tree and the
LPR-tree is consistent with their static variants[3] i.e; for instance, the performance of the R∗ -tree bulk loaded with the TIGER dataset using the Hilbert
construction algorithm and some R∗ -tree heuristics, is three times better than
the performance of LPR-tree.
Hypothesis 5.3.1.
The bulk load performance of the R∗ -tree linearly increases
with the number of rectangles in the dataset. The LPR-tree also shows a
similar behavior.
Hypothesis 5.3.2.
Procedure
Perform bulk load with Uni1, Uni4, Uni4 datasets. Repeat this with similar
datasets in the NORMAL and TIGER sets. It is expected that I/O's increase
linearly with data set size in case of the eastern data sets while the data distribution
of the synthetic datasets should have little or no eect on the I/O performance of
the algorithm.
Results
The CPU time taken to bulk load for the various datasets for the LPR and the
R∗ -tree is shown in gure 5.1, 5.2 and 5.3. The CPU time appears to increase in
a slightly super-linear fashion with the dataset size for both the LPR-tree and the
R∗ -tree. For the LPR-tree, this can be explained by the fact that the grid implementation creates two new grids with each split. The LPR-tree also shows negligible
time dierence between dierent distributions. In fact, the minor dierences could
55
be more attributed to the sorting time than to the nature of distribution. In contrast, the R∗ -tree is more sensitive to the distribution performing worst for normal
dataset and the best with about 25% less time for the uniform dataset.
Figure 5.1: Bulk Load CPU time - Uniform dataset.
Figure 5.2: Bulk Load CPU time - Normal dataset.
The I/O counts shown in gure 5.4, 5.5 and 5.6 seem to increase almost linearly
with dataset size.This is inline with expectations for the LPR-tree. For the R∗ -tree
I/O's are expected to scale linearly with dataset size within the same distribution.
The results seem to be independent of the type of distribution for LPR-tree whereas
for R* trees, there is variation, with almost 13% more I/O's incurred in the best
case compared to the worst case for the 8 million datasets. The same gure for
LPR-tree is less than 1.3
The most important observation is the large I/O dierence between the LPR and
the R∗ -trees. A large part of this I/O cost for the LPR-tree could be attributed to
sorting of streams that is done at each recursive step in the implementation.
56
Figure 5.3: Bulk Load CPU time - TIGER dataset.
Figure 5.4: Bulk Load I/O - Uniform dataset.
Figure 5.5: Bulk Load I/O - Normal dataset.
57
Figure 5.6: Bulk Load I/O - TIGER dataset.
5.4
Insertion
Experiment Description
The average number of I/O's incurred per rectangle inserted remain approximately the same until the LPR-tree is rebuilt. The same
is true for CPU time spent per insertion.
Hypothesis 5.4.1.
The average number of I/O's per insertion (amortized) on
the LPR-tree is better than the same average number of I/O's per insertion
(amortized) on the R∗ -tree, on similar datasets and under similar conditions.
Hypothesis 5.4.2.
The performance results of the insertion algorithm on
LPR-tree shows little or no variation (on average) to the nature of the distribution of the data rectangles.
Hypothesis 5.4.3.
Procedure
First the LPR-tree that was bulk loaded with Uni4 is loaded. We insert the
rectangles in U ni1 i where i²[1, 5]. After the insertion of every U ni1 i, we measure
the average number of I/O's required per insertion. We expect a peak in I/O's
when inserting U ni1 4. This is due to the cleanup/rebuilding of tree. We repeat
this experiment with NORMAL and TIGER datasets. Similar experiments are
carried out on the R∗ -tree.
Results
Figure 5.7 shows that the insertion CPU time for the LPR-tree, is almost the same
per rectangle. The graph shows the CPU time for every million rectangles inserted
in the tree. We do see a sudden increase when a million rectangles are inserted
the fourth time. This is due to the rebuilding of the tree that happens when the
insertion count becomes equal to the count of the number of rectangle. At this
58
point the tree is doubled in capacity and bulk loaded. The CPU time taken seems
to almost independent to variations in distribution of rectangles. Its important
to note that when the tree does get rebuilt, its capacity doubles as a result, the
expensive cost for rebuilding has to be incurred again only much later.
Figure 5.7: Insertion CPU time - LPR-tree.
Figure 5.8 shows that the insertion CPU time for the R∗ -tree is always above
600 seconds for every million rectangles inserted any time. The probability that
insertion time with the R∗ -tree is going to be greater than with the LPR-tree is
quite high. Figure 5.9 shows that the average CPU time for insertion is almost the
same with R∗ -tree having only a slightly better performance.
Figure 5.8: Insertion CPU time - R∗ tree.
Figure 5.10 shows the average I/O counts per 100 rectangles inserted for the LPRtree. The graph is plotted for every million rectangles inserted. The sudden increase
on inserting the fourth million rectangles is because of the rebuilding of the tree.
But on an average over the 5 million rectangles inserted we see in gure 5.12 that
there are around 23 I/O's per 100 rectangles inserted. There is a sharp deviation
59
Figure 5.9: Insertion average CPU time - LP R tree vs. R∗ tree.
of around +16 I/O's from the best case. However if we look at the insertion cost
per rectangle, this is not much. Once again there is negligible deviation in results
across dataset types with the TIGER dataset performing the worst. This is correct
as insertion algorithm basically performs bulk loading which is inert to variations
in dataset types.
Figure 5.10: Insertion I/O's - LP R tree.
Figure 5.11 shows the average I/O counts per 100 rectangles inserted for the R∗ tree. There is quite a bit of variation with nature of distribution with the TIGER
dataset giving the worst performance. The I/O counts comparison between LPR
and R∗ -tree shown in gure 5.12 is interesting. It clearly shows that insertion
outperforms the R∗ -tree by a really large amount. The poor R∗ -tree performance
could be attributed to the really dense datasets and the cost of re-inserting 30%
of the rectangles from the overowing seems to be quite high even though this is
done once per level.
60
Figure 5.11: Insertion I/O's - R∗ tree.
Figure 5.12: Insertion Average I/O's - LP R tree vs. R∗ tree.
61
5.5
Deletion
Experiment Description
The average number of I/O's incurred per rectangle deleted
in the LPR-tree reduces on average as rectangles get deleted. The same is true
for CPU time spent per deletion except when the tree gets bulk loaded during
which CPU time taken would be signicantly higher.
Hypothesis 5.5.1.
The average number of I/O's per deletion amortized) on
the LPR-tree is better than the same average number of I/O's per deletion
(amortized) on the R∗ -tree, on similar datasets and under similar conditions.
Hypothesis 5.5.2.
The results of the delete algorithm on the LPR-tree, shows
little or no variation (on average) to the nature of the distribution of the data
rectangles.
Hypothesis 5.5.3.
Procedure
We load the LPR-tree containing 4 million rectangles. We insert the rectangles in
U ni1 i where i²[1, 3]. This tree is saved and used as the base of the deletion experiments. We then successively delete rectangles U ni1 i where i²[1, 3]. We measure
the average number of I/O's after every deletion of U ni1 i. We expect deletions to
take more I/O's than insertion on average because of the reorganization resulting
from replenishing underfull nodes. Also a peak in I/O's is expected when half the
rectangles get deleted. This is due to the cleanup/rebuilding of tree. However
the average number of I/O's per deletion is expected to decrease as rectangles get
deleted. This is due to the fact that the tree is becoming smaller in size.
Results
The results of the CPU time measurements for LPR-tree are depicted in Figure 5.13.
In line with the other results, there seems to be very little eect of the variation
of the distribution of rectangles in datasets. The rst million rectangles deleted
take around 700 seconds. The second million rectangles take more CPU time as
cost is incurred in the reconstruction of the tree. The deletion of the third million
rectangles seem to go faster than the rst million. This can be explained by the
fact that the tree is smaller than the tree from which the rst million rectangles
were deleted.
Figure 5.14 shows the CPU time measurements for R∗ -tree. The CPU time taken to
delete 1 million rectangles is on the average 1200 seconds for the normal and tiger
datasets. An equivalent experiment on uniform datasets failed to complete even
after 4.5 hours. Hence the data shown here does not have statistics for this dataset.
In comparison to R∗ -tree, deletion CPU time on the LPR-tree over dierent type
of datasets has less variation. It also shows that, deletion in the LPR-tree is very
62
Figure 5.13: Deletion CPU time - LPR-tree.
often expected to be much faster compared to the R∗ -tree. An overall average
comparison of deletion times shown in Figure 5.15 shows the LPR-tree having an
advantage over the R∗ -tree.
Figure 5.14: Deletion CPU time - R∗ tree.
The I/O's incurred per deletion for the LPR-tree are shown in Figure 5.16. Interestingly the average counts per deletion go down with every million rectangles
deleted. Also the cost incurred in restructuring is not much. This is because the
tree is bulk loaded with only 2 million rectangles. Its hard to explain why there
is no variation in the I/O counts with variation in dataset distribution. The best
guess, one could make is that rectangles are very dense in the dataset and every
deletion seems to have equal amount of eect w.r.t to replenishing nodes. Deletion
is more expensive than insertion w.r.t I/O's and this is clear from the results.
The I/O's incurred per deletion for the R∗ -tree are shown in Figure 5.17. The R∗ tree apparently performs slightly better on the TIGER dataset compared to the
63
Figure 5.15: Deletion average CPU time - LP R tree vs. R∗ tree.
Figure 5.16: Delete I/O - LPR-tree.
64
NORMAL dataset. In contrast with LPR-tree, where cost is reduced for every million rectangles deleted (across datasets), the R∗ -tree does not show any predictable
behavior with I/O's staying at almost the same level.
The comparison of average I/O's(Figure 5.18) clearly shows that R∗ -tree needs twice
the number of I/O's on average compared to the LPR-tree, to perform deletion.
Figure 5.17: Deletion I/O - R∗ tree.
Figure 5.18: Deletion average I/O's - LP R tree vs. R∗ tree.
5.6
Query
Experiment Description
LPR-tree updates (insertions and deletions) should not affect the query performance. In other words, query performance in terms of
I/O's and CPU time should be close to the query performance after bulk load.
Hypothesis 5.6.1.
65
Hypothesis 5.6.2. The average number of I/O's incurred per output rectangle
reported, in a LPR-tree, is, better than the R∗ -tree on similar datasets and
under similar conditions.
The R∗ -tree performs best on the NORMAL dataset (squares)
and slightly worse on the UNIFORM and TIGER dataset.
Hypothesis 5.6.3.
Procedure
LPR-tree is bulk loaded with uni1.5. For each of the UNIFORM, NORMAL and
TIGER datasets, 1000 randomly generated window queries are performed on this
tree. Each rectangle representing a window query is generated as follows:
Dataset Identifier
UNIFORM
Window query - ρ
where x and y are uniformly
generated at random such that 0.3 ≤ x, y ≤ 0.7,
with the current system time used as the seed for
randomization.
– W idth(ρ) = UniformRandom (u, v), where u, v is
a number uniformly chosen at random such that,
0.2 ≤ u < v ≤ 0.4. Height is also chosen in a
similar manner.
– Center(ρ) = (x, y)
NORMAL
where x and y are generated
at random from a normal distribution with µ =
0.50, σ = 0.25, with the current system time used
as the seed for randomization.
– Center(ρ) = (x, y)
– W idth(ρ)
=
dom (0.001, 0.1)
TIGER
.
ρ =
Height(ρ)
=
UniformRan-
ComputeMinimumBoundingBox (s) where s, is
a set of rectangles chosen from the tig1P2 stream
i−1
jq ) +
such that si = tig1 2k+ji , where ji = ( q=0
U nif ormRandom(0, 10). k is chosen at uniformly distributed osets for each query rectangle.
Then uni 3 and uni 4 are inserted with the 1000 random queries being performed
in between these insertions. This is followed by deletion of uni 3 and uni 4. Once
66
again queries are performed before and after deletions. Finally, uni 3 and uni 4
are re-inserted with queries performed before and after the insertions. Similar
experiments are carried out with the NORMAL and UNIFORM datasets. The
same set of experiments are repeated for the R∗ -tree.
Results
Figure 5.19 shows the query CPU time per B rectangles output reported for the
LPR and the R∗ -trees. Clearly, the LPR-tree performs much better than the R∗ tree for all types of distributions with query times being twice to three times better
for the LPR-tree. The variation in distribution could be explained by the fact that
the queries for the dierent distributions return very dierent output sizes. For
instance, the uniform datasets reports very good CPU time per output,because
its output size on average is quite big (more than 100000) compared to normal
dataset (around 2000 - 3000). This shows that query cost gets amortized with a
large number of queries reporting large output sizes. This same variation w.r.t
datasets can be seen in the R∗ -tree. The implementation of deletion in the R∗ -tree
hangs while deleting the uniform dataset. So there are no statistics for this dataset.
A last interesting observation is that query CPU time seems to remain the same
over the interleaved insertions and deletions which was expected from the LPRtree. However in the case of the R∗ -tree, query time deteriorates after deletions
especially for the NORMAL dataset. This could be attributed to the same kind of
increase in I/O's for this dataset.
Figure 5.20 shows the I/O's per B rectangles reported for the LPR and R∗ -tree. As
with the time statistics, the LPR-tree outperforms the R∗ -tree by a good margin
across all datasets. There is very little eect of the updates on the query performance with the graph showing a near horizontal line. To answer the same queries
R∗ -tree requires three times to six times more I/O's compared to the LPR-tree.
The interleaved updates once again seem to make query costs expensive for the
R∗ -tree for the NORMAL dataset.
To verify the theoretical worst-case query guarantees of the LPR-tree experimentally,
is very dicult. However, given all the queries
q on the LPR-trees, performed across
all the datasets, I plot the theoretical value, NB + BT , using the average output size
and average number of I/O's to answer these queries (Figure 5.21). Then I plotted
the experimentally measured I/O's on a dierent scale on the same graph. Adjusting
the scale of experimental results, it was possible gure out the worst performing set of
queries which is point 11 on this graph. When the scale is adjusted, the experimental
curve completely lies below the theoretical curve. As we know that the R∗ -tree takes
more I/O's to answer these set of queries, the worst-case behavior of the LPR-tree is
more closer to the worst-case optimum number of I/O's than the R∗ -tree.
67
Figure 5.19: Query CPU time (in msec) per B rectangles output .
68
Figure 5.20: Query I/O's per B rectangles output.
69
Figure 5.21: Empirical Analysis - Theoretical vs. Experimental query I/O results for
LPR-tree .
70
Chapter 6
Conclusions
From the experiments and the results obtained the following conclusions can be made.
The R∗ -tree is more ecient in terms of I/O and time to construct static R-tree
structures. However query performance of the LPR-tree is much better than R∗ tree.
Update algorithms of the LPR tree outperform the R∗ -tree by a large amount in
terms of IO's. However insertion and deletion time are quite close on average.
Considering that LPR-tree performs the updates more faster most of the time with
the rebuilding cost getting amortized, one would still prefer the LPR-tree over the
R∗ tree.
Query performance for the LPR-tree remains very good with interleaved insertions
and deletions and here the LPR-tree denitely has an edge over the R∗ -tree.
Based on the above observations, LPR-tree should be preferred over the R∗ -tree
when indexing data that is frequently updated. For static data, if query performance is more important than indexing, LPR-tree is still preferable over the
R∗ -tree, otherwise the R∗ -tree is better. In other words, to have static R-tree
structures over relatively small datasets (less than 1 million data objects), R∗ -tree
is preferable.
Experimental verication of the query guarantees show a single set of queries that
have worst-case performance than all the remaining queries. As we know that the
R∗ -tree takes more I/O's to answer this query, the worst-case behavior of the LPRtree is more closer to the worst-case optimum number of I/O's than the R∗ -tree.
71
72
Appendix A
Tables of Experimental Results
A.1
Bulk Load
A.1.1
LPR tree
Dataset Blk Reads a Blk Writes b Strm Reads c Strm Writes d I/O's
Uni1 1
Uni4
Uni8
Nor1 1
Nor4
Nor8
Tig1 1
Tig4
Tig8
a Number
b Number
c Number
d Number
1062
4314
8678
1102
4311
8589
1072
4379
8642
1731
6960
12984
1783
6902
13776
1746
7085
13879
20892
122763
285043
20903
126850
288579
21999
132221
285393
10048
67626
160675
10054
68017
160596
10049
69765
159672
of AMI block objects read from the external memory
of AMI block objects written to the external memory
of block writes incurred while reading from an AMI stream object
of block reads incurred while writing to an AMI stream object
73
33733
201663
467380
33842
206080
471540
34866
213450
467586
Time(sec)
143
859
2247
138
882
2223
133
855
2047
A.1.2
R∗ tree
Dataset Blk Reads Blk Writes Strm Reads Strm Writes I/O's Time(sec)
Uni1 1
Uni4
Uni8
Nor1 1
Nor4
Nor8
Tig1 1
Tig4
Tig8
1035
4168
8361
1352
5472
11088
1233
4837
9845
2072
8338
16724
2706
10946
22178
2468
9676
19692
3053
12216
24431
3054
12216
24433
3054
12216
24433
74
1832
7331
14663
1833
7332
14665
1833
7332
14664
7992
32053
64179
8945
35966
72364
8588
34061
68634
65
262
588
67
302
809
58
314
694
A.2
Insertion
A.2.1
Dataset
lpr
uni4a
lpr nor4
lpr tig4
r* uni4 b
r* nor4
r* tig4
Insertion Time
Dataset inserted(mln) Time(sec)
uni1 1
uni1 2
uni1 3
uni1 4
uni1 5
nor1 1
nor1 2
nor1 3
nor1 4
nor1 5
tig1 1
tig1 2
tig1 3
tig1 4
tig1 5
uni1 1
uni1 2
uni1 3
uni1 4
uni1 5
nor1 1
nor1 2
nor1 3
nor1 4
nor1 5
tig1 1
tig1 2
tig1 3
tig1 4
tig1 5
363
529
397
2839
378
347
505
386
2795
338
371
531
408
3136
367
767
583
840
945
1010
593
655
1090
1212.5
565
782.5
850
787.5
787.5
767.5
a LPR-tree on the Uni4 dataset
b R∗ -tree on the Uni4 dataset
75
A.2.2
Insertion I/O’s
Dataset Dataset
inserted
lpr uni4 uni1 1
uni1 2
uni1 3
uni1 4
uni1 5
lpr nor4 nor1 1
nor1 2
nor1 3
nor1 4
nor1 5
lpr tig4 tig1 1
tig1 2
tig1 3
tig1 4
tig1 5
r* uni4 uni1 1
uni1 2
uni1 3
uni1 4
uni1 5
r* nor4 nor1 1
nor1 2
nor1 3
nor1 4
nor1 5
r* tig4 tig1 1
tig1 2
tig1 3
tig1 4
tig1 5
Blk Reads Blk Writes Strm Reads Strm Writes I/O per 100
9807
21725
31993
66509
76207
9918
22013
32385
66983
76859
9884
21995
32355
67051
76885
5878028
6098508
6521546
6828814
6735878
5409684
6162222
6723870
6155400
5971599
9127032
8311476
8633678
7828134
7477655
10297
22750
33588
69094
79359
10408
23059
33999
69488
79975
10370
23052
33976
69596
79997
5878420
6099128
6522337
6829608
6736691
5410228
6163002
6724688
6156271
5972418
9127874
8312339
8634539
7828961
7478448
76
45396
118024
168995
582407
627216
45327
119834
170680
587696
632260
45360
121660
173206
619578
665682
611
613
612
613
612
612
612
612
612
613
613
613
613
613
613
22962
61834
87616
324700
347357
22904
62764
88467
326224
348717
22925
62376
88046
330984
353791
1
2
2
2
2
2
1
2
2
2
2
3
3
2
2
8.8462
13.5871
9.7859
72.0518
8.7429
8.8557
13.9113
9.7861
72.486
8.742
8.8539
14.0544
9.85
75.9626
8.9146
1175.706
1219.8251
1304.4497
1365.9037
1347.3183
1082.0526
1232.5837
1344.9172
1231.2285
1194.4632
1825.5521
1662.4431
1726.8833
1565.771
1495.6718
A.3
A.3.1
Deletion
Deletion I/O’s and time
Dataset Dataset
deleted
lpr uni4 uni1 1
uni1 2
uni1 3
lpr nor4 nor1 1
nor1 2
nor1 3
lpr tig4 tig1 1
tig1 2
tig1 3
r* uni4 uni1 1
uni1 2
uni1 3
r* nor4 nor1 1
nor1 2
nor1 3
r* tig4 tig1 1
tig1 2
tig1 3
Blk I/O's Strm Reads Strm Writes Time(sec) I/O per 100
6265945
11823870
16244805
6452651
12065165
16514678
6266815
11825313
16246248
6437
177900
177800
5610
177882
178493
6537
177279
177890
9359
116444
116000
9907
115540
115540
10359
116567
116567
720
1680
515
744
1674
523
700
1682
518
6.281741
5.836473
4.420391
6.468168
5.890419
4.450124
6.283711
5.835448
4.421546
12403475
11967682
13588870
8692591
9992747
8590855
611
611
610
611
611
611
0
1
1
1
1
1
1272.5
1272.5
1360
912.5
895
912.5
12.403475
11.967682
13.58887
8.692591
9.992747
8.590855
77
A.4
A.4.1
Dataset
Nor
Uni
Tig
Query
LPR-tree
Action
BLb
Ins1 c
Ins2 d
Del1 e
Del2 f
Ins1 g
Ins2 h
BL
Ins1
Ins2
Del1
Del2
Ins1
Ins2
BL
Ins1
Ins2
Del1
Del2
Ins1
Ins2
Avg|OP |
1711.11
2852.05
3995.42
2854.49
1711.11
2852.05
3995.42
81719.47
136112.5
190476.5
136083.5
81719.47
136112.5
190476.5
6749.02
6749.02
6749.02
6749.02
6749.02
6749.02
6749.02
I/O's
12864
22082
18054
18040
13278
22550
18414
143384
231690
279534
279408
183310
272290
317856
17449
27996
23378
23158
19796
20379
24818
a OP is the number of rectangles returned
b BL - Queries after bulk loading nor1.5
c Ins1 - Queries after inserting nor1 1
d Ins2 - Queries after inserting nor1 2
e Del1 - Queries after deleting nor1 1
f Del2 - Queries after deleting nor1 2
g Ins1 - Queries after inserting nor1 1
h Ins2 - Queries after inserting nor1 2
Time(sec) I/O per OP a
12.3
18.84
14.078
13.81
10.53
19.9
13.98
11
26.563
43.5
37.84
25.89
43.54
44.08
14.37
14.1
12.57
13.6
11.12
10.31
13.85
7.51792696
7.742501008
4.518673882
6.319867997
7.759875169
7.906593503
4.60877705
1.754587983
1.702194876
1.467551115
2.053209978
2.243161881
2.000477546
1.668741288
2.585412401
4.148157807
3.463910316
3.431312991
2.933166593
3.019549505
3.677274627
by the window query.
78
A.4.2
Dataset
Nor
Uni
Tig
a OP
R∗ -tree
Action
BL
Ins1
Ins2
Del1
Del2
Ins1
Ins2
BL
Ins1
Ins2
Del1
Del2
Ins1
Ins2
BL
Ins1
Ins2
Del1
Del2
Ins1
Ins2
Avg
|OP |
1711.11
2852.05
3995.42
2854.49
1711.11
2852.05
3995.42
81719.47
136112.5
190476.5
136083.5
81719.47
136112.5
190476.5
6749.02
6749.02
6749.02
6749.02
6749.02
6749.02
6749.02
I/O's
Time(sec) I/O per OP a
16
27
38
36
27
27
59
304
430
372
40.06755849
23.98870988
17.16290152
26.69303448
39.83262327
25.78426044
18.61906883
3.673861321
1.978789604
1.31078112
84970
86201
88053
78794
78794
77752
77752
42.5
37.5
40
35
40
37.5
35
12.58997603
12.77237288
13.04678309
11.6748802
11.6748802
11.52048742
11.52048742
68560
68417
68573
76195
68158
73538
74391
300226
269338
249673
is the number of rectangles returned by the window query.
79
80
Appendix B
Brief Introduction to TPIE
TPIE[4, 6] is a software environment (written in C++) that facilitates the implementation of I/O ecient algorithms. The goal of theoretical work in the area of external
memory algorithms (also called I/O algorithms or out-of-core algorithms) has been to
develop algorithms that minimize the I/O(i.e; transfer of data between the main memory and disk), performed when solving problems on very large data sets. The TPIE
library consists of a kernel and a set of I/O-ecient algorithms and data structures implemented on top of the kernel. Most of the functionality is provided as templated classes
and functions in C++.
The following are some of the important structures of TPIE used in the implementation of the LPR-tree:
AMI stream
AMI stream is templated class to store a list of user dened objects to the external
memory. This stream provides interfaces to read or write items. Dedicated streams
such as the TwoDRectangleStream are AMI stream objects parameterized with
the object type (such as TwoDRectangle) they hold.
TPIE provides a sorting algorithm AMI sort for a stream. Given a comparison
function or object, it sorts the stream using the external-memory merge sort algorithm.
AMI block
AMI block is a templated class that represents a logical block, which is the unit
amount of data transferred between the external memory and main memory. A
block can store data and hold links to other blocks. It also provides an information
structure to hold information such as the number of items allocated. Given the
type of data objects that have to be stored and the number of links, the maximum
number of data objects that can be stored represents the capacity of the block.
Blocks are used in the implementation of the LPR-tree to store priority nodes and
the internal KD-node data structures.
81
AMI collection
AMI collection is a class that represents a collection of AMI block objects. Any
block can be identied and retrieved in a collection using the unique block identier
associated with the block. This collection provide the convenience to control data
layout strategies required by many IO ecient algorithms. The LPR-tree is in fact
one such block collection.
82
Bibliography
[1] Chuan-Heng Ang, S. T. Tan, and T. C. Tan. Bitmap R-trees. Informatica (Slovenia), 24(2), 2000.
[2] Arge, Hinrichs, Vahrenhold, and Vitter. Ecient bulk operations on dynamic Rtrees. Algorithmica, 33, 2002.
[3] Lars Arge, Mark de Berg, Herman J. Haverkort, and Ke Yi. The priority R-tree:
A practically ecient and worst-case optimal R-tree. In SIGMOD, pages 347{358,
2004.
[4] Lars Arge, Octavian Procopiuc, and Jerey Scott Vitter. Implementing I/O-ecient
data structures using TPIE. In ESA, pages 88{100, 2002.
[5] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The
R*-tree: An ecient and robust access method for points and rectangles. In Hector
Garcia-Molina and H. V. Jagadish, editors, SIGMOD, pages 322{331. ACM Press,
1990.
[6] TPIE distribution. http://www.cs.duke.edu/tpie.
[7] Yvan J. Garca, Mario A. Lopez, and Scott T. Leutenegger. A greedy algorithm for
bulk loading R-trees. In ACM-GIS, pages 163{164, 1998.
[8] Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In
Beatrice Yormark, editor, SIGMOD, pages 47{57, Boston, Massachusetts, June
1984.
[9] Ibrahim Kamel and Christos Faloutsos. On packing R-trees. In CIKM, pages 490{
499, 1993.
[10] Ibrahim Kamel and Christos Faloutsos. Hilbert R-tree: An improved R-tree using
fractals. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, VLDB, Pro-
ceedings of 20th International Conference on Very Large Data Bases, Santiago
de Chile, pages 500{509. Morgan Kaufmann, 1994.
[11] T. Sellis, N. Roussopoulos, and C. Faloustos. The R+ -tree: A dynamic index for
multi-dimensional objects. In VLDB, pages 507{518, Brighton, England, 1987.
83
[12] A. Papadopoulo Y. Manolopoulos, A. Nanopoulos and Y. Theodoridis. R-trees:
Theory and applications. Springer, 2006.
84