Download 2 Different approaches

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Lattice model (finance) wikipedia , lookup

Red–black tree wikipedia , lookup

Binary search tree wikipedia , lookup

Quadtree wikipedia , lookup

B-tree wikipedia , lookup

Interval tree wikipedia , lookup

Transcript
The Problem of Indexing Mobile Objects
by
Hans Jørgen Ekker
Per Gunnar Ulveseth
Mattis Vidnes
Term paper in GIS 300 –
Geographical Information Systems
Department of Mapping Science (IKF)
Agricultural University of Norway (NLH)
Autumn 2001
Table of Contents
1
INTRODUCTION ....................................................................................................................................... 3
2
DIFFERENT APPROACHES .................................................................................................................... 4
2.1
TIME-PARAMETERIZED R-TREE (TPR-TREE) ......................................................................................... 4
2.2
THE B-TREE ........................................................................................................................................ 17
2.2.1
The B+-tree ................................................................................................................................... 18
2.3
THE KD-TREE APPROACH ................................................................................................................... 20
2.4
VARIANTS OF THE KD-TREE ............................................................................................................... 21
2.4.1
The hB-tree .................................................................................................................................. 21
2.4.2
How to make the hB-tree a -tree ................................................................................................ 22
2.5
THE LSD-TREE ................................................................................................................................... 23
2.6
THE QUADTREE APPROACH ................................................................................................................. 25
3
COMPARISON ......................................................................................................................................... 26
4
DISCUSSION ............................................................................................................................................. 27
5
REFERENCES .......................................................................................................................................... 29
2
1 Introduction
In traditional database management systems (DBMSs) data is stored in the database and
remain constant unless explicitly modified through an update. While this approach is well
suited for many applications where data change in discrete steps, it is not suitable for
applications with constantly changing data. The storage of mobile objects in a spatio-temporal
database is one such application. Mobile objects are objects that continuously change their
location (e.g. cars, trains, air-traffic control, mobile communication systems, and military
units). Thus with a traditional DBMS the database would have to update its information at
every unit of time. This is of course not efficient due to an enormous update overhead. The
simulation of an object's location as a function of time f(t) is thought to be a much better
approach. Doing this the database only has to be updated every time the parameters of the
function f(t) change (e.g. a change of speed or direction). This approach will lower the update
overhead. However, the approach will also introduce new problems (e.g. how to find
appropriate data models and optimalization techniques) since the database is storing a
function, not data values.
It considers databases that keep track of mobile objects moving in one and two dimensions.
The objects are modelled as points that move with a constant velocity, starting from a specific
location at a specific time instant. Given this information, the location of an object can be
computed at any time in the future, as long as its movement characteristics remain the same.
The objects are responsible for updating their motion information every time their direction or
speed changes. Furthermore, the objects can move inside a finite terrain, but when an object
reaches the limits of the terrain, it has to issue an update (either because it is deleted or
reflected). Finally, the system is dynamic, i.e. it is allowed to insert new objects and to delete
old ones. Figure 1 gives a geometric representation of the problem by showing the trajectories
and query in space-time (t, y) plane.
ymax
O1
O2
O3
yq2
yq1
O4
0
t1
t2
t
Figure 1: Trajectories and query in space-time (t , y) plane.
3
Generally, an object may move anywhere in the 3-dimensional space with complex motions.
However, the term paper limits the treatment to objects in 1- and 2-dimensional space and
whose location is described by a linear function of time. The motivation for such an approach
is dual [KGT99]. There is a strong motivation for such an approach based on real-world
applications (e.g. cars move in networks of highways which can be approximated by
connected straight-line segments on a plane). Secondly, solving simple 1- and 2-dimensional
problems may provide intuition for how to address the more difficult problem of indexing
general multidimensional functions.
This term paper focuses on the problem of indexing mobile objects. We chose to describe a
selection of methods of indexing that have been studied in the context of spatio-temporal
databases. Our selection includes variants of several tree-structures (e.g. R-tree, B-tree, kdtree, and quad-tree). After describing the methods of indexing we discuss the performance of
some of them. Finally, we address the current trend in indexing mobile objects.
2 Different approaches
2.1 Time-parameterized R-tree (TPR-tree)
Index structure
The TPR-tree is a balanced, multi-way tree with the structure of an R-tree. Entries in leaf
nodes are pairs of position of a moving point object and a pointer to the moving point object,
and entries in internal nodes are pairs of a pointer to a subtree and a rectangle that bounds the
positions of all moving point objects of other bounding rectangles in that subtree [SJLL99].
The position of a moving point object is represented by a reference position and a
corresponding velocity vector (x,v) in the one-dimensional case, where x = x(tref). We choose
tref to be equal to the index load time, tl [SJLL99].
4
In the TPR-tree, the bounding rectangles in the tree are functions of time, as are the moving
points being indexed. Intuitively, the bounding rectangles are capable of continuously
following the enclosed data points or other rectangles as these move. Like the R-trees, the
new index is capable of indexing points in one-, two-, and tree-dimensional space [SJLL99].
To bound a group of d-dimensional moving points, d-dimensional bounding rectangles are
used that are also time-parameterized, i.e., their coordinates are functions of time. A timeparameterized bounding rectangle bounds all enclosed points or rectangles at all times not
earlier than the current time [SJLL99].
A trade-off exists between how tightly a bounding rectangle bounds the enclosed moving
points or rectangles across time and the storage needed to capture the bounding rectangles. It
would be ideal to employ time-parameterized bounding rectangles that are always minimum,
but the storage cost appears to be excessive [SJLL99].
Instead of using true, always minimum bounding rectangles, the TPR-tree employs
"conservative" bounding rectangles, which are minimum at some point, possibly (and most
likely!) not at later times. In the one-dimensional case, the lower bound of a conservative
interval is set to move with the minimum speed of the enclosed points, while the upper bound
is set to move with the maximum speed of the enclosed points (speeds are negative or positive
depending on the direction). This ensures that conservative bounding intervals are indeed
bounding for all times considered [SJLL99].
Figure 2: Conservative (Dashed) Versus Always Minimum (Solid) Bounding Intervals
Figure 2 illustrates conservative bounding intervals. The left hand side of the conservative
interval in the figure starts at the position of object A at time 0 and move left at the speed of
object B, and the right hand side of the interval starts at the position of object A at time 0 and
moves right at the speed of object A. The figure also shows that, in the worst case, such an
5
interval may grow to become larger than its corresponding minimum bounding interval by up
to twice its initial length. In the figure, the conservative bounding interval at time 3 has length
11, which is exactly twice its length at time 0 plus the length of its corresponding minimum
bounding interval at time 3. It is worth noting that conservative bounding intervals never
shrink. In the best case, when all of the enclosed points have the same velocity vector, a
conservative bounding interval has constant size, although it may move [SJLL99].
Following the representation of moving points, we let tref = tl and capture a one-dimensional
time-parameterized bounding interval [x├ (t), x┤ (t)] = [x├ (tl) + v├ (t-tl), x┤ (tl) + v┤ (t-tl)] as
(x├, x ┤, v├, v ┤), where
x├ = x├(tl) = mini{oi. x├(tl)}
x ┤ = x ┤(tl) = maxi{oi.x ┤(tl)}
v├ = mini{oi.v├}
v ┤ = maxi {oi.v ┤}
Here the oi range over the bounding rectangles to be enclosed. If instead the bounding interval
being defined is to bound moving points, the oi range over these points, oi. x├(tl) and oi.x ┤(tl)
are replaced by oi.x(tl), oi.v├ and oi.v ┤ are replaced by oi.v [SJLL99].
The rectangles defined above are termed load-time bounding rectangles and are bounding for
all times not before tl. Because the rectangles never shrink, but may actually grow too much,
it is desirable to be able to adjust them occasionally. As the index is only queried for times
greater or equal to the current time, it follows that it is attractive to adjust the bounding
rectangles every time any of the moving points or rectangles that they bound are updated. The
following formulas specify the adjustments to the bounding rectangles that may be made
during updates.
x├ = mini{oi. x├(tupd)} - v├ (tupd - tl)
x ┤ = maxi{oi.x ┤(tupd)} - v ┤ (tupd - tl)
Here, tupd is the time of the update, and the formulas may be restricted to apply to the
bounding of points rather than intervals, as above. Each formula involves five terms, which
may differ by orders of magnitude. Special care must be taken to manage the rounding errors
that may occur in the finite-precision floating-point arithmetic used for implementing the
formulas.
6
We call these rectangles update-time bounding rectangles. The two types of bounding
rectangles are shown in Figure 3. The bold top and bottom lines capture the load-time, timeparameterized bounding interval for the four moving objects represented by the four lines. At
time tupd, a more narrow and thus better update-time interval is introduced that is bounding
from tupd and onwards [SJLL99].
Figure 3: Load-Time (Bold) and Update-Time
Figure 4: Intersection of a Interval and a
(Dashed) Bounding Intervals for Four
Query
Moving Points
Querying
Queries may retrieve objects based on their positions at any future time points. But because
the positions as predicted at query time becomes less and less accurate as queries move into
the future, and because updates not known at query time may occur, queries far in the future
are likely to be of little value [SJLL99].
The queries retrieve all points with positions within specified regions. We distinguish between
three kinds of queries, based on the regions they specify.
Type 1 timeslice query: Q = (R,t) specifies a hyper-rectangle R located at time point t.
Type 2 window query: Q = (R,t├, t┤) specifies a hyper-rectangle R that covers the interval
[t├,t┤]. Stated differently, this query retrieves points with trajectories in (x,t)-space
7
crossing the (d+1)-dimensional hyper-rectangle ([a├1, a┤1], [a├2, a┤2], ... , [a├d, a┤d],
[t├, t┤]).
Type 3 moving query: Q = (R1,R2,t├, t┤) specifies the (d+1)-dimensional trapezoid obtained
by connecting R1 at time t├ to R2 at time t┤.
Answering a timeslice query proceeds as for the regular R-tree, the only difference being that
all bounding rectangles are computed for the time tq specified in the query before intersection
is checked. Thus, a bounding interval specified by (x├, x ┤, v ├, v ┤) satisfies a query (([a├,
a┤]), tq), if and only if a├ ≤ x┤ + v┤(tq - tl)  a├  x├ + v ├(tq - tl).
To answer window queries and moving queries, we need to be able to check if, in (x,t)-space,
the trapezoid of a query intersects with the trapezoid formed by the part of the trajectory of a
bounding rectangle that is between the start and end times of the query. With one spatial
dimension, this is relatively simple. For more dimensions, generic polyhedron-polyhedron
intersection tests may be used, but due to the restricted nature of this problem, a simpler and
more efficient algorithm may be deviced [SJLL99].
Specifically, we provide an algorithm for checking if a d-dimensional time-parameterized
bounding rectangle R given by parameters (x├1, x┤1, x├2, x┤2, ... ,x├d, x┤d, v├1, v┤1, v├2, v┤2, ...
, v├d, v┤d) intersects a moving query Q = (([a├1, a┤1], [a├2, a┤2], ... , [a├d, a┤d], [w├1, w┤1],
[w├2, w┤2], ... , [w├d, w┤d]), t├, t┤). This formulation of a moving query as a timeparameterized rectangle with starting and ending times is more convenient than the definition
given earlier. The velocities w are obtained by subtracting R2 from R1 in the earlier definition
and then normalising them with the length of interval [t├, t┤] [SJLL99].
The algorithm is based on the observation that for two moving rectangles to intersect,
there has to be a time point when their extents intersect in each dimension. Thus, for
each dimension j (j = 1,2, ... ,d), the algorithm computes the time interval I j = [t├j, t┤j]
 [t├, t┤] when the extents of the rectangles intersect in that dimension. If I = ∩dj=1 Ij =
, the moving rectangles do not intersect and an empty result is returned; otherwise,
the algorithm provides the time interval I when the rectangles intersect. The intervals
for each dimension are computed according to the following formulas:
8
To see how t├j and t┤j are computed, consider the case where Q is below R at t┤. Then
Q must not be below R at t├, as otherwise Q is always below R and there is no
intersection (the case of no intersection is already accounted for). This means that the
line a┤j + w┤j(t - t├) intersects the line x├j(t├) + v├j(t - t├) within the time interval [t├,t
┤
]. Solving for t gives the desired intersection time [SJLL99].
Heuristics for Tree Organization
The values of three problem parameters affect the indexing problem and the qualities
of a TPR-tree. The first specifies exactly how far into the future queries may reach.
The second specifies for how long the index is to remain functional. The third is
simply the sum of the first two.
 Querying window (W): how far queries can "look" into the future. Thus, is(Q) ≤
t ≤ is(Q) + W, for Type 1 queries, and is(Q) ≤ t├ ≤ t┤ ≤ is(Q) + W for queries of
Type 2 and 3.
9
 Index usage time (U): the time interval during which an index will be used.
Thus, tl ≤ is(Q) ≤ tl + U, where tl is the index creation or bulkloading time. After
tl +U, the index is considered obsolete.
 Time horizon (H): the time interval from which all times (t, t├, t┤) specified in
queries are drawn. The time horizon for an index is the index usage time plus
the querying window.
Figure 5: Time Horizon H, Index Usage Time U, and Querying Window W
As a precursor to designing the (dynamic) insertion and bulkloading algorithms for the
TPR-tree, we discuss how to group moving objects into nodes so that the tree most
efficiently supports timeslice queries when assuming a time horizon H. The objective
is to identify principles, or heuristics, that apply to both dynamic insertions and
bulkloading, and to any number of dimensions. The goal is to obtain a versatile index.
Assuming the timeslice queries to be uniformly distributed between time tl (the
bulkloading time) and tl + H and keeping in mind that the moving points are
represented by linear trajectories, it seems intuitive to bulkload the index based on the
projected positions of he moving points at time tl + H/2. Indeed, this would be a
promising approach if always-minimum bounding rectangles were employed, but it does not
work for conservative bounding rectangles. This insight is illustrated for one-dimensional
space in Figure 4, where points A and B are placed in the same bounding interval, due to their
proximity at time H/2.
10
Although the idea presented above is not useful for bulkloading, one may wonder if it is
useful in the tree's insertion algorithm. The idea would be to compute the area, margin, and
other characteristics of bounding rectangles relevant to the insertion algorithm as of H/2 time
units after the time of the insertion. However, this approach has the problem that for more
than one dimension, the area of a bounding rectangle does not grow linearly with time. A
different approach is necessary [SJLL99].
It is clear that when H is close to zero, the tree may simply use the usual R-tree insertion and
bulk-loading algorithms. The movement of the point objects and the growth of the bounding
rectangles become irrelevant - only their initial positions and extents matter. In contrast, when
H is large, grouping the moving points according to their velocity becomes important because
it is desirable that the bounding rectangles are as small as possible at all times in [tl,tl + H],
and how fast a bounding rectangle grows depends on its "velocity extents". (In onedimensional space, the velocity extent of a bounding interval is equal to v├-v┤.)
This leads to the following general approach. The insertion and bulkloading algorithms of the
R*-tree, which we consider extending to moving points, aim to minimize objective functions
such as the areas of the bounding rectangles, their margins (perimeters), and the overlap
among the bounding rectangles. In our context, these functions are time dependent, and we
should consider their evolution in [tl,tl+H]. Specifically, given an objective function A(t), the
following integral should be minimized.
If A(t) is area, the integral computes the area of the trapezoid that represents part of the
trajectory of a bounding rectangle in (x,t)-space. We have so far assumed that the times tq of
the timeslice queries (R,tq) are distributed uniformly across [tl,tl+H]. Perhaps most
prominently, this occurs if the times when queries are issued are uniformly distributed and the
tq's of the queries are always equal to the times when they are issued, i.e., W = 0. If queries
are issued uniformly, but W > 0, or window queries and moving queries are issued, the
probability that a time point in the middle of interval [tl,tl+H] will be in a query is higher than
that of a time point near tl or tl + H. If p(t) is the probability that time point t is in some query,
then the integral ∫tltl+H p(t)A(t)dt should be minimized [SJLL99].
11
Insertion and Deletion
The insertion algorithm of the R*-tree employs functions that compute the area of a bounding
rectangle, the intersection of two bounding rectangles, the margin of a bounding rectangle
(when splitting a node), and the distance between the centers of two bounding rectangles
(used when doing forced reinsertions). The TPR-tree's insertion algorithm is the same as that
of the R*-tree, with one exception: instead of the functions mentioned here, integrals as in
Formula 1 of those functions are used [SJLL99].
Computing the integrals of the area, margin, and distance are relatively straightforward. The
algorithm that computes the integral of the intersection of two time-parameterized rectangles
is an extension of the algorithm for checking if such rectangles overlap. At each time point
when the rectangles intersect, the intersection area is a rectangle and, in each dimensions, the
upper (lower) bound of this rectangle is defined by the upper (lower) bound of one of the two
intersecting rectangles [SJLL99].
The algorithm thus divides the time interval returned by the overlap-checking algorithm into
consecutive time intervals so that, during each of these, the intersection is defined by a timeparameterized rectangle. The intersection area integral is thus computed as a sum of area
integrals [SJLL99].
The parameter H = U + W was defined based on a static setting, and for static data. In a
dynamic setting, W remains a component of H, which is the length of the time period where
integrals are computed in the insertion algorithm. How large the other component of H should
be depends on the update frequency. If this is high, the effect of an insertion on the tree will
not persist long and, thus H should not exceed W by much [SJLL99].
The introduction of the integrals is the most important step in rendering the R*-tree insertion
algorithm suitable for the TPR-tree, but one more aspect of the R*-tree algorithm must be
revisited. The R*-tree split algorithm selects one distribution of entries between two nodes
from a set of candidate distributions, which are generated based on sortings of point positions
along each of the coordinate axes. In the TPR-tree split algorithm, moving point (or rectangle)
positions at different time points are used when sorting. With load-time bounding rectangles,
12
positions at tl are used, and with update-time bounding rectangles, positions at the current
time are used [SJLL99].
Finally, in addition to sortings along spatial dimensions, the split algorithm is extended to
consider also sortings along the velocity dimensions, i.e., sortings obtained by sorting on the
coordinates off the velocity vectors. The rationale is that distributing the moving points based
on the velocity dimensions may result in bounding rectangles with smaller "velocity extents"
and which consequently grow more slowly. Deletions in the TPR-tree are performed as in the
R*-tree. If a node gets underfull, it is eliminated and its entries are reinserted [SJLL99].
Bulkloading the Tree
The bulkloading algorithm presented here attempts to minimize the area integrals of the tree's
time-parameterized bounding rectangles across [tl, tl+H]. Without loss of generality, we let tl
= 0. We also assume one-dimensional, uniform moving point data. More precisely, if the onedimensional moving points are represented as two-dimensional points in (x(tref),v)-space, we
assume that they are uniformly distributed in a rectangular region with extents S and V.
Packing these points into tree nodes corresponds to partitioning this region into bounding
rectangles. Due to the uniformity, we choose all bounding rectangles to be equal. The
important parameter, which we need to determine, is then the ratio between the velocity
extents and the reference-position (or spatial) extents of the bounding rectangles. For
example, Figure 6 illustrates two different partitionings of a region [SJLL99].
Figure 6: Two Subdivisions of a Data Region in (x(tref),v)-Space and the Evolution of the Corresponding
Intervals in (x,t)-Space.
13
Partitioning a) equally prioritizes position and velocity, while Partitioning b) completely
ignores velocity and packs data points according to position only. To compare the two
partitionings with respect to different values for H, we consider the trapezoids in (x,t)-space
that correspond to the bottom-left and bottom partitions in the two partitionings. For H = 1,
the areas of the trapezoids are 2.25 versus 1.5 for the two partitionings. But for H = 5, the
areas are 16.25 versus 17.5. It is not difficult to see that although the trapezoids that
correspond to other partitions are different, their areas are equal to those of the two partitions
considered. Thus, for small values of H, Partitioning b) is best, and for large values of H,
Partitioning a) is best [SJLL99].
The partitioning parameter - the velocity-space aspect ratio,  - is a function of H. To
determine (H), let the extents of a bounding rectangle of a partition in (x(tref),v)-space be (s,
s). Then, if the number of rectangles in a partitioning (which is also the number of nodes in
the leaf level of a tree) is k, the equation k=(S/s)/ s) holds, meaning that s=√((SV)/(k)).
Knowing s, the length of a bounding interval at time t is A(t)=s + s∙t=s(1+t).
To find  we express the ara integral of the interval. When we solve it we get that =2/H.
This result confirms that the larger the time horizon H, the smaller  should be, i.e., the
narrower the bounding rectangles should be in the velocity dimension. Note also that  is
independent of parameters such as the extents of the data set and the number of nodes
[SJLL99].
To actually achieve bounding rectangles that have a velocity-space aspect ratio close to , we
use an adapted version of the STR-tree packing algorithm [LEL97].
Performance Experiments - Experiment Setup And Workload Generation
The implementation of the TPR-tree used in the experiments is based on the Generalized
Search Tree Package, GiST [HNP95]. The page size (and tree node size) is set to 4k bytes,
which results in 204 and 146 entries per leaf-node for two- and three-dimensional data,
respectively. A page buffer of 200k bytes, i.e.,50 pages, is used [LL98], where the root of a
tree is pinned and the least-recently-used page replacement policy is employed. The nodes
changed during an index operation are marked as “dirty” in the buffer and are written to disk
at the end of the operation or when they otherwise have to be removed from the buffer. The
performance studies are based on workloads that intermix queries and update operations on
14
the index, thus simulating index usage across a period of time. In addition, each workload
initially bulkloads the index [SJLL99].
As moving point data where the positions and velocities of the objects are uniformly
distributed seems to be rather unrealistic, we attempt to generate more realistic twodimensional data by simulating a scenario where the objects, e.g., cars, move in a network of
routes, e.g., roads, connecting a number of destinations, e.g., cities. In addition to simulating
cars moving between cities, the scenario is also motivated by the fact that usually, even if
there is no underlying infrastructure, moving points tend to have destinations. For example,
fishing boats follow schools of fish or return to ports [SM99].
The TPR-tree insertion algorithm depends on the parameter H, which is equal to W plus some
duration that is dependent on the frequency of updates. Figures 7 and 8 show the results. The
horizontal axes correspond to the part of parameter H that should depend on the frequency of
updates. Curves are shown for experiments with different querying windows W. The graphs
demonstrate a pattern, namely that the best values of H lie between UI/2 + W and UI + W.
This is not surprising. In UI/2 time units, approximately half of the entries of each leaf node in
the tree are updated, and after UI time units, almost all entries are updated. Note also the
difference in average search disk access numbers in Figures 9 and 10. A higher update rate (a
smaller UI) means tighter bounding rectangles and, thus, better query performance [SJLL99].
Figure 7: Search Performance For UI = 60
Figure 8: Search Performance For UI =
and Varying Settings of H
120 and Varying Settings of H
15
A set of experiments with varying workloads were performed in order to compare the relative
performance of the R-tree, the TPR-tree with load-time bounding rectangles, and the TPR-tree
with update-time bounding rectangles [SJLL99].
Figure 9 shows the average number of I/O operations per query for the three indices when the
number of destinations in the simulation is varied. As shown, increased skew leads to a
decrease in the numbers of I/Os for all three approaches, especially for the TPR-tree. This is
expected because when there are more objects with similar velocities, it is easier to pack them
into bounding rectangles that have small velocity extents and also are not too big in the spatial
dimensions.
The figure also demonstrates that the TPR-tree is an order of magnitude better than the R-tree
[SJLL99].
Figure 9: Search Performance For Varying
Figure 10: Search Performance for Varying W
Numbers of Destinations and Uniform Data
The study indicates quite clearly that the TPR-tree indeed is capable of supporting queries on
moving objects quite efficiently and that it outperforms its competitors so far. The study also
demonstrates that the tree does not degrade severely as time passes. See Figure 11.
16
Figure 11: Degradation of Search Performance with Time
Summary
The TPR-tree is a versatile adoption of the R*-tree that supports the efficient querying of the
current and anticipated future locations of moving points in one-, two-, and three-dimensional
space. Whereas the R*-tree's algorithms use functions that compute the areas, margins, and
overlaps of bounding rectangles, the TPR-tree employs integrals of these functions, thus
taking into consideration the values of these functions across the time when the tree is
queried.
2.2 The B-tree
The B-tree is a widely used technique for organizing a file and its index. In fact, it is the
standard organization for indices in a database system [Com79]. There are several variants of
the B-tree (e.g. B*-trees, B+-trees, Prefix B+-trees, Virtual B+-trees), and Kollios' [KGT99]
query approximation algorithm is based on the use of several B+-trees.
In the late 1960s computer manufactorers and independent research groups competively
developed general purpose file systems and so-called "access methods" for their machines.
Building on this work, Bayer and McCreight in 1972 proposed an external index mechanism
17
with relatively low cost for most basic operations and called it the B-tree [Com79]. The B-tree
is a generalization of the binary search tree in which more than two paths leave a given node
[Com79[RSV02]. The beauty of B-trees lies in the methods for inserting and deleting records.
The result is a balanced tree with all leaves at the same depth.
2.2.1 The B+-tree
In a B+-tree, all keys reside in the leaves. The upper levels, which are organized as a B-tree,
consists only of an index to enable rapid location of the index and key parts [Com79]. The
result is a separation of the index and the key parts. Of course index nodes and leaf nodes may
have different formats or sizes. Furthermore, leaf nodes are usually linked together left-toright as a sequence set (Figure 12). This sequence set allow easy sequential processing
[Com79].
random search
sequential search
index: B-tree
keys: the sequence set
Figure 12: A B+-tree with separate index and key parts. Operations "by key" begin at the root as in the B-tree –
sequential processing begins at the leftmost leaf.
(From [Com79])
B-trees which support low-cost find, insert, and delete operations, may require log n accesses
to secondary storage to process a next operation. The B+-tree implementation retains the
logarithmic cost properties for operations by key [Com79][RSV02], but gains the advantage
of requiring at most 1 access to satisfy the a next operation [Com79]. Furthermore, during the
18
sequential processing of a file, no node will be accessed more than once, so space for only 1
node need to be available in the main memory. Therefore, B+-trees are well suited to
applications which entail both random and sequential processing [Com79].
2.2.1.1.1 Query approximation method (Hough-Y transformation)
Kollios et al. [KGT99] uses a query approximation algorithm with multiple B+-trees to index
moving objects in the Dual transform model. The query region is defined by the intersection
of two half-plane queries (Figure 13). Since access methods are more efficient for rectangle
queries, the article approximates the simplex query with a rectangular one. Thus, the query
area is enlarged by the area E = E1 + E2. E should be minimized since it represents a
measure of the extra I/Os that the algorithm will have to perform for solving a onedimensional MOR query.
n
1/ v
min
E1
E2
1/
vmax
0
tq1
tq2
Figure 13: Query in the dual Hough-Y plane.
Kollios et al. [KGT99] propose to keep c indices (where c is a small constant) at equidistant
yr's. All c indices contain the same information about the objects, but uses different yr's.
If a query is executed at a single observation index, area E becomes large. In order to bound
E, each subterrain is indexed. Each of the c subterrains indices records the time interval when
a moving object was in the subterrain. In this way, the query is decomposed into a collection
of smaller subqueries. A given one-dimensional MOR query will be forwarded to the indices
that minimize E. Since all 2-dimensional approximate queries have the same rectangle side
(Figure 13), the rectangle range search is equivalent to a simple range search on the b
coordinate axis. Thus, each of the c "observations" indices can simply be a B+-tree.
19
2.3 The KD-tree approach
The KD-tree is a binary tree, specially designed for indexing multidimensional points. It was
designed by Bentley in 1975. Its structure is k-dimensional, thus the name kd-tree.
2.3.1.1.1 Point access method (Hough-X transformation)
The Hough-X transformation (Figure 14) [KGT99] is using index structures based on R-trees
or even better; kd-trees, for indexing moving objects. By using an algorithm to answer
simplex range queries, created by Goldstein et.al. [GRSY97], we can answer the MOR query
in the dual space.
a
yq2
yq1
O5
O7
0
vmin
O8
vmax
v
O6
Figure 14: Query in the dual Hough-X plane.
Kollios et.al. [KGT99] argue that an index structure based on kd-trees is more suitable than a
method based on R-trees, because a kd-tree based method will use both dimensions to split,
while the R-tree will only split in one dimension. So the kd-tree will perform better for the
MOR query.
There are several advantages of using multi-attribute trees, like the kd-tree [Sal91]:

Good space utilization in both index and data nodes

High fan-out (the index should be significantly smaller than the data collection)

Fast exact match search (given the coordinates, the data should be obtained quickly)

Fair clustering in data pages by all attributes for good range search performance

Easy integration with the query, locking, and recovery systems of existing DBMSs

Simple design for incremental growth and shrinkage (insertion and deletion
algorithms)
20
But there are also some common drawbacks which are typical for multidimensional binary
trees [HSW89]:

Multidimensional binary trees may become unbalanced, i.e. may contain long paths
with almost no branches

No suitable method for paging a multidimensional binary tree is known
2.4 Variants of the KD-tree
We introduce two different KD-trees; the LSD-tree and the hB-tree. Both of them have been
introduced as suitable for indexing mobile objects [KGT99].
2.4.1 The hB-tree
The hB-tree was introduced by Evangelidis et al. [ELS97]. It is a combination of the hB-tree,
a multi-attribute index, and the -tree, an abstract index which offers efficient concurrency
and recovery methods. The reasons for combining these trees was the need for an efficient
tree-structure suitable for indexing multi-dimensional applications, which would perform well
for all kinds of data distributions, and the fact that the hB-tree is fairly insensitive to
increases in dimension [ELS97].
A kd-tree node always stores the value of exactly one attribute. Thus, the size of a kd-tree
node (and, consequently the size of the kd-trees that reside in the hB-tree nodes) does not
depend on the number of indexing attributes. However, the hB-tree node stores its own
boundaries and an increase in the number of dimensions will affect the space required to store
a node’s boundaries. But this additional space is not significant for large page sizes. With a
page size of 1K bytes and larger, there is almost no effect on the size of the hB-tree and the
node space utilization as the dimensions increase. In contrast, in the R-tree, the size of the
index is proportional to the dimension of the space. Experiments with various type and
distributions of data show us that even the most restrictive versions of the hB-tree, that do
not offer worst case storage utilization and index term size guarantees, perform very well in
21
terms of storage utilization, index size, exact-match and range searching [ELS97]. To
understand the structure of the hB-tree, we have to know more about the -tree and the hBtree.
2.4.1.1 Facts about the -tree
The -tree can be directly responsible for some part of the space, but it can also delegate
responsibility for part of the space to the sibling nodes. Its pointers to sibling nodes are called
side pointers. In the -tree it is possible for a node to be referred to by more than one parent,
then the child is called a multi-parent node.
2.4.1.2 Facts about the hB-tree
The hB-tree consists of index nodes and data nodes [LS90]. Index nodes contain kd-trees with
information about children on the next lower level of the hB-tree and about regions which
have been extracted and transferred to siblings on the same sibling level. Data nodes contain
the actual data records. They may contain kd-trees. Each node stores a description of the
space it is responsible for, by using two attributes, called the boundaries of the hB-tree.
Unlike other multi-attribute indexes that split nodes by hyperplanes, the hB-tree can use more
than one attribute to describe the extracted region.
2.4.2 How to make the hB-tree a -tree
We make the hB-tree into a -tree by using some transformations [ELS97]. We adopt side
pointers from the -tree. This eases the searching. We also modify the splitting algorithm, for
splitting a node at its kd-tree root. Shortly, when there is a split at the root, we keep the kdtree root in the original node and we simply extract the appropriate kd-subtree which again
becomes the kd-tree of the new hB-tree node. To be able to support node consolidation, we
make further structural changes. We make it possible to determine the containment order of
the children of N and whether a child node of N is multi-parent or not.
The main innovations of the hB-tree compared to the hB-tree are the introduction of sidepointers and the fact that node splitting and index term posting are performed by separate
actions. There has not been very much research on the performance for the hB-tree used for
indexing mobile objects. Kollios et.al. [KGT99] is
22
2.5 The LSD-tree
The Local Split Decision tree (LSD-tree) was introduced in 1989 by Henrich et.al. [HSW89],
as a data structure supporting efficient spatial access to geometric objects. It performs well for
all reasonable data distributions, cover quotients, and bucket capacities, and remains
multidimensional points as well as arbitrary geometric points. The LSD-tree is extremely
suitable for the implementation of spatial access paths in geometric databases [HSW89].
As other tree structures, the LSD-tree partitions the data space into pairwise disjoint cells with
associated buckets of fixed size. However, in contrast to the grid file and other structures, the
LSD structure is not grid oriented. Splits occur at arbitrary positions, in what [HSW89] calls
locally optimal positions. The split positions are optimal with respect only to the cell to be
split and independent from other existing cell boundaries.
After a certain number of insertions the initial bucket has been filled, and an attempt to insert
an additional object causes the need for a bucket split. To this purpose, a split line is
determined and the objects on one side of the split line are stored in one bucket, while those
on the other side are stored in another bucket. After some further insertions, the capacity of
another bucket will be exceeded. In this case, a split line in the corresponding bucket region is
determined, thus splitting this region into two subregions. This process is repeated each time
the capacity of a bucket is exceeded.
The split lines of the LSD-tree are maintained in a directory which is a generalized kd-tree.
For each split, a new node containing the position and the dimension of the split line is
inserted into the directory tree. The leaves of the directory tree reference the buckets in which
the actual objects are stored. When the directory grows up to a point where it can not be kept
in main memory any longer, subtrees of the directory are stored on secondary memory, while
the part of the directory near the root remains in main memory [Hen96].
The LSD-tree overcomes the typical drawbacks for multidimensional data by using a special
paging algorithm. When a LSD-tree grows up to a size where it cannot be kept in main
memory any longer, the paging algorithm determines a subtree to be paged on secondary
storage. The main memory will then be emptied and made ready for additional input.
23
For storing rectangles in the LSD-tree, we use a transformation technique. Thus 2D-rectangles
are stored as 4D-points. This is a simple idea, used in all kd-trees, but some problems do arise.
First, there is a strong correlation between the lower and upper bounds in the rectangle,
because the upper bound of a rectangle is always greater than (or equal to) the lower bound.
Therefore all the points are located in a triangular shaped subspace. Furthermore, since the
rectangles usually are small compared to the data space, the points are located in a small strip
above the diagonal. The LSD-tree overcomes the drawbacks of the transformation technique
by using one out of two bucket split strategies; the data dependent split strategy and the space
dependent strategy.
Figure 15: Split positions achieved by two basic split strategies
The data dependent strategy chooses the average over all the coordinates in the bucket to be
split, including the object to be inserted, as split position.
The space dependent strategy is based on two assumptions. The first one is that all rectangles
are degenerated to points, i.e. the upper and the lower bounds coincide for each dimension.
Hence, all points are located on the diagonal. A suitable split position will split the data into
two cells, containing equally long parts of the diagonal. In figure 15, the split position is
named SP1. The second assumption is that all points are uniformly distributed over the
triangular subspace. A suitable split strategy halves the data cell in two cells of equal areas. In
figure 15, the split position is named SP2. The final split position SP is calculated as a
weighted sum of the positions from the two split strategies.
A performance study made by Henrich et.al [HSW89] show that the size of the directory does
not depend on the data distribution but on the split strategy (and of course on the size of the
24
data set and the bucket capacity). With insertion of unsorted data, the data dependent strategy
performs significantly better than the distribution dependent, while with sorted data, the
distribution dependent solution is best.
2.6 The quadtree approach
The idea common to all quadtree variations is the recursive decomposition of indexed space.
When a quadrant is split, four sub-quadrants are created. Our main interest is the region
quadtree, and particularly the PMR quadtree, which is the quadtree-based indexing structure
for line segments. The idea of the PMR quadtree is to store information about a line segment
in every quadrant of the underlying space that it crosses [TUW98]. The data space is
partitioned until no more than B lines cross a single quadrant. B is called the bucket size, and
will typically be equal to the number of data records that fit in a single disk space. The PMR
tree is quite similar to the point region (PR) quadtree, the difference is in the semantics of a
data point which makes the split involve more than a simple distribution of points over the
four subquadrants.
In the quadtree approach, as well as in the other ways of indexing mobile objects, the
indexing of mobile objects is based on an equation of motion, f(t) = at + b, where a is the
speed of the object (the slope) and b is the intercept. Thus index records consist of an object
ID, the intercept b and the slope a. When a data page overflows (at the B + 1th insertion), a
page or bucket split takes place. Shortly, a bucket split involves to insert the corresponding
<ID, a, b> record in every crossed subquadrant.
Figure 16: Example of overflow in the PMR-structure
25
Obviously, a bucket split leads to duplication of index elements. The same trajectory which
was represented by a single point before the split becomes represented by three points after
the split. This might lead to significant storage overhead. However, performance experiments
by Tayeb, Ulusoy and Wolfson [TUW98] show that the PMR- index perform very well for
what they define as instantaneous queries that averages two disk accesses per query.
3 Comparison
In their model, Kollios et al. [KGT99] compare cB+-trees (where c = 4,6 and 8), a hB-tree,
and an R* approach. In figure 17 the results for the average number of I/Os per query for the
different methods are presented. The approximation method using several B+-trees where
better than the hB-tree approach. The R* approach had the worst performance.
Figure 17: Query performance.
(From [KGT99])
The space consumption and the average number of I/Os per update are plotted in Figure 18.
The update and space performance of the hB-tree is better than the cB+-tree. This because
the objects are stored only once and better clustered than the R*-tree.
26
Figure 18: Space consumption.
(From [KGT99])
Although the query approximation algorithm is efficient, later work by Chon et al. [CAE01]
have tested and compared the Dual transform model in the article with their SV model. Chon
et al. [CAE01] compared Kollios et al.’s [KGT99] cB+-tree and hB-tree approach with a
SS-tree approach. By generated realistic data, a performance test showed that the SV model
has significantly lower overhead when compared to the Dual transform model.
4 Discussion
We have introduced and described a few different approaches for indexing mobile objects.
There has not been much research on this topic. We have been in touch with Hae Don Chon,
the author of [CAE01] and one of the most up-to-date researchers on the topic, and he
confirms that currently, there are just a few people in the world doing research on mobile
objects. Several research groups have proposed variants of the R-tree, and a few have created
some kd-tree structures or B-tree structures. However, most of these articles have just been
theoretically introduced, and their models or index structures have hardly ever become
adopted in an DBMS. According to Chon, R-tree structures, like the TPR-tree, are mostly
used in DBMSs, not because they perform better than other structures, but because they are
easier to implement. There is still research going on using some of the tree structures
described in this paper. Some of them have not been subjects for any research since they were
introduced for first time, while others are still developed and investigated. According to B.
Salzberg, one of the authors of [ELS97], there has not been any further research using the
hB-tree for tempo-spatial queries after 1997. However, currently there is some work on this
subject which is not published yet, mainly for temporal data. The LSD-tree is suitable for
27
indexing mobile objects, but research using this tree is mainly on accessing feature vectors, as
a LSDh-tree. The TPR-tree is the most appropriate R-tree, and is probably the most used
method for indexing mobile objects.
28
5 References
[CAE01] Chon, H., D, Agrawal, D., and El Abbadi, A. Storage and Retrieval of moving Objects.
[Com79] Comer, D. 1979. The ubiquitous B-tree. Computing Surveys, Vol. 11(2): 121-137.
[ELS97] Evangelidis, G., Lomet, D., and Salzberg, B. 1997. The hB  -tree: A Multi-attribute Index Supporting
Concurrency, Recovery and Node Consolidation.
[GRSY97] Goldstein, J. Ramakrishnan, R, Shaft, U. and Yu, J. B. 1997. Processing Queries By Linear
Constraints. In: Proc. 16th AMC PODS Symposium on Principles of Database Systems, Tuscon, Arizona.
[Hen96] Henrich A. 1996. A Hybrid Split Strategy for kd-tree Based Access Structures. Proceedings of the 4th
ACM Workshop on Advances in Geographic Information Systems (GIS '96),
S. 1-8, Rockville, Maryland, USA, November 1996
[HNP95] J.M. Hellerstein, J.F. Naughton, and A. Pfeffer. Generalized Search Trees for Database Systems. In
Proc. of the VLDB Conf., pp. 562-573 (1995).
[HSW89] Henrich, A. Six, H-W and Widmayer, P. 1989. The LSD-tree: Spatial Access to Multidimensional
Point and Non-point Objects.
[KGT99] Kollios, G., Gunopolos, D., and Tsotras, V. J. 1999. On Indexing Mobile Objects.
[LEL97] S. T. Leutenegger, J. M. Edgington, and M. A. Lopez. STR: A Simple and Efficient Algorithm for RTree Packing. ICDE’97. pp. 497-506.
[LL98] S.T. Leutenegger and M.A. Lopez. The Effect on Buffering on the Performance of R-trees. In Proc. of
the ICDE Conf., pp. 164-171 (1998).
[RSV02] Rigaux, R., Scholl, M., and Voisard, A. 2002. Spatial Databases With Application to GIS. Academic
Press, USA.
[Sal91] Salzberg, B. 1991. Practical spatial database access methods. In: Proceedings of the Symposium on
Applied Computing, pp. 82-90.
[SJLL00] Simonas Saltenis, Christian S. Jensen, Scott T. Leutenegger, and Mario A. Lopez. Indexing the
Positions of Continuously Moving Objects.
[SM99] J.-M. Saglio and J. Pereira. Oporto: A Realistic Scenario Generator for Moving Objects. In Proc. of
DEXA Workshops, pp. 426-432 (1999).
[TUW98] Tayeb, J. Ulusoy, O. and Wolfson, O. 1998. A Quadtree Based Dynamic Attribute Indexing Method.
The Computer Journal, 41(3):185-200.
29