Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Problem of Indexing Mobile Objects by Hans Jørgen Ekker Per Gunnar Ulveseth Mattis Vidnes Term paper in GIS 300 – Geographical Information Systems Department of Mapping Science (IKF) Agricultural University of Norway (NLH) Autumn 2001 Table of Contents 1 INTRODUCTION ....................................................................................................................................... 3 2 DIFFERENT APPROACHES .................................................................................................................... 4 2.1 TIME-PARAMETERIZED R-TREE (TPR-TREE) ......................................................................................... 4 2.2 THE B-TREE ........................................................................................................................................ 17 2.2.1 The B+-tree ................................................................................................................................... 18 2.3 THE KD-TREE APPROACH ................................................................................................................... 20 2.4 VARIANTS OF THE KD-TREE ............................................................................................................... 21 2.4.1 The hB-tree .................................................................................................................................. 21 2.4.2 How to make the hB-tree a -tree ................................................................................................ 22 2.5 THE LSD-TREE ................................................................................................................................... 23 2.6 THE QUADTREE APPROACH ................................................................................................................. 25 3 COMPARISON ......................................................................................................................................... 26 4 DISCUSSION ............................................................................................................................................. 27 5 REFERENCES .......................................................................................................................................... 29 2 1 Introduction In traditional database management systems (DBMSs) data is stored in the database and remain constant unless explicitly modified through an update. While this approach is well suited for many applications where data change in discrete steps, it is not suitable for applications with constantly changing data. The storage of mobile objects in a spatio-temporal database is one such application. Mobile objects are objects that continuously change their location (e.g. cars, trains, air-traffic control, mobile communication systems, and military units). Thus with a traditional DBMS the database would have to update its information at every unit of time. This is of course not efficient due to an enormous update overhead. The simulation of an object's location as a function of time f(t) is thought to be a much better approach. Doing this the database only has to be updated every time the parameters of the function f(t) change (e.g. a change of speed or direction). This approach will lower the update overhead. However, the approach will also introduce new problems (e.g. how to find appropriate data models and optimalization techniques) since the database is storing a function, not data values. It considers databases that keep track of mobile objects moving in one and two dimensions. The objects are modelled as points that move with a constant velocity, starting from a specific location at a specific time instant. Given this information, the location of an object can be computed at any time in the future, as long as its movement characteristics remain the same. The objects are responsible for updating their motion information every time their direction or speed changes. Furthermore, the objects can move inside a finite terrain, but when an object reaches the limits of the terrain, it has to issue an update (either because it is deleted or reflected). Finally, the system is dynamic, i.e. it is allowed to insert new objects and to delete old ones. Figure 1 gives a geometric representation of the problem by showing the trajectories and query in space-time (t, y) plane. ymax O1 O2 O3 yq2 yq1 O4 0 t1 t2 t Figure 1: Trajectories and query in space-time (t , y) plane. 3 Generally, an object may move anywhere in the 3-dimensional space with complex motions. However, the term paper limits the treatment to objects in 1- and 2-dimensional space and whose location is described by a linear function of time. The motivation for such an approach is dual [KGT99]. There is a strong motivation for such an approach based on real-world applications (e.g. cars move in networks of highways which can be approximated by connected straight-line segments on a plane). Secondly, solving simple 1- and 2-dimensional problems may provide intuition for how to address the more difficult problem of indexing general multidimensional functions. This term paper focuses on the problem of indexing mobile objects. We chose to describe a selection of methods of indexing that have been studied in the context of spatio-temporal databases. Our selection includes variants of several tree-structures (e.g. R-tree, B-tree, kdtree, and quad-tree). After describing the methods of indexing we discuss the performance of some of them. Finally, we address the current trend in indexing mobile objects. 2 Different approaches 2.1 Time-parameterized R-tree (TPR-tree) Index structure The TPR-tree is a balanced, multi-way tree with the structure of an R-tree. Entries in leaf nodes are pairs of position of a moving point object and a pointer to the moving point object, and entries in internal nodes are pairs of a pointer to a subtree and a rectangle that bounds the positions of all moving point objects of other bounding rectangles in that subtree [SJLL99]. The position of a moving point object is represented by a reference position and a corresponding velocity vector (x,v) in the one-dimensional case, where x = x(tref). We choose tref to be equal to the index load time, tl [SJLL99]. 4 In the TPR-tree, the bounding rectangles in the tree are functions of time, as are the moving points being indexed. Intuitively, the bounding rectangles are capable of continuously following the enclosed data points or other rectangles as these move. Like the R-trees, the new index is capable of indexing points in one-, two-, and tree-dimensional space [SJLL99]. To bound a group of d-dimensional moving points, d-dimensional bounding rectangles are used that are also time-parameterized, i.e., their coordinates are functions of time. A timeparameterized bounding rectangle bounds all enclosed points or rectangles at all times not earlier than the current time [SJLL99]. A trade-off exists between how tightly a bounding rectangle bounds the enclosed moving points or rectangles across time and the storage needed to capture the bounding rectangles. It would be ideal to employ time-parameterized bounding rectangles that are always minimum, but the storage cost appears to be excessive [SJLL99]. Instead of using true, always minimum bounding rectangles, the TPR-tree employs "conservative" bounding rectangles, which are minimum at some point, possibly (and most likely!) not at later times. In the one-dimensional case, the lower bound of a conservative interval is set to move with the minimum speed of the enclosed points, while the upper bound is set to move with the maximum speed of the enclosed points (speeds are negative or positive depending on the direction). This ensures that conservative bounding intervals are indeed bounding for all times considered [SJLL99]. Figure 2: Conservative (Dashed) Versus Always Minimum (Solid) Bounding Intervals Figure 2 illustrates conservative bounding intervals. The left hand side of the conservative interval in the figure starts at the position of object A at time 0 and move left at the speed of object B, and the right hand side of the interval starts at the position of object A at time 0 and moves right at the speed of object A. The figure also shows that, in the worst case, such an 5 interval may grow to become larger than its corresponding minimum bounding interval by up to twice its initial length. In the figure, the conservative bounding interval at time 3 has length 11, which is exactly twice its length at time 0 plus the length of its corresponding minimum bounding interval at time 3. It is worth noting that conservative bounding intervals never shrink. In the best case, when all of the enclosed points have the same velocity vector, a conservative bounding interval has constant size, although it may move [SJLL99]. Following the representation of moving points, we let tref = tl and capture a one-dimensional time-parameterized bounding interval [x├ (t), x┤ (t)] = [x├ (tl) + v├ (t-tl), x┤ (tl) + v┤ (t-tl)] as (x├, x ┤, v├, v ┤), where x├ = x├(tl) = mini{oi. x├(tl)} x ┤ = x ┤(tl) = maxi{oi.x ┤(tl)} v├ = mini{oi.v├} v ┤ = maxi {oi.v ┤} Here the oi range over the bounding rectangles to be enclosed. If instead the bounding interval being defined is to bound moving points, the oi range over these points, oi. x├(tl) and oi.x ┤(tl) are replaced by oi.x(tl), oi.v├ and oi.v ┤ are replaced by oi.v [SJLL99]. The rectangles defined above are termed load-time bounding rectangles and are bounding for all times not before tl. Because the rectangles never shrink, but may actually grow too much, it is desirable to be able to adjust them occasionally. As the index is only queried for times greater or equal to the current time, it follows that it is attractive to adjust the bounding rectangles every time any of the moving points or rectangles that they bound are updated. The following formulas specify the adjustments to the bounding rectangles that may be made during updates. x├ = mini{oi. x├(tupd)} - v├ (tupd - tl) x ┤ = maxi{oi.x ┤(tupd)} - v ┤ (tupd - tl) Here, tupd is the time of the update, and the formulas may be restricted to apply to the bounding of points rather than intervals, as above. Each formula involves five terms, which may differ by orders of magnitude. Special care must be taken to manage the rounding errors that may occur in the finite-precision floating-point arithmetic used for implementing the formulas. 6 We call these rectangles update-time bounding rectangles. The two types of bounding rectangles are shown in Figure 3. The bold top and bottom lines capture the load-time, timeparameterized bounding interval for the four moving objects represented by the four lines. At time tupd, a more narrow and thus better update-time interval is introduced that is bounding from tupd and onwards [SJLL99]. Figure 3: Load-Time (Bold) and Update-Time Figure 4: Intersection of a Interval and a (Dashed) Bounding Intervals for Four Query Moving Points Querying Queries may retrieve objects based on their positions at any future time points. But because the positions as predicted at query time becomes less and less accurate as queries move into the future, and because updates not known at query time may occur, queries far in the future are likely to be of little value [SJLL99]. The queries retrieve all points with positions within specified regions. We distinguish between three kinds of queries, based on the regions they specify. Type 1 timeslice query: Q = (R,t) specifies a hyper-rectangle R located at time point t. Type 2 window query: Q = (R,t├, t┤) specifies a hyper-rectangle R that covers the interval [t├,t┤]. Stated differently, this query retrieves points with trajectories in (x,t)-space 7 crossing the (d+1)-dimensional hyper-rectangle ([a├1, a┤1], [a├2, a┤2], ... , [a├d, a┤d], [t├, t┤]). Type 3 moving query: Q = (R1,R2,t├, t┤) specifies the (d+1)-dimensional trapezoid obtained by connecting R1 at time t├ to R2 at time t┤. Answering a timeslice query proceeds as for the regular R-tree, the only difference being that all bounding rectangles are computed for the time tq specified in the query before intersection is checked. Thus, a bounding interval specified by (x├, x ┤, v ├, v ┤) satisfies a query (([a├, a┤]), tq), if and only if a├ ≤ x┤ + v┤(tq - tl) a├ x├ + v ├(tq - tl). To answer window queries and moving queries, we need to be able to check if, in (x,t)-space, the trapezoid of a query intersects with the trapezoid formed by the part of the trajectory of a bounding rectangle that is between the start and end times of the query. With one spatial dimension, this is relatively simple. For more dimensions, generic polyhedron-polyhedron intersection tests may be used, but due to the restricted nature of this problem, a simpler and more efficient algorithm may be deviced [SJLL99]. Specifically, we provide an algorithm for checking if a d-dimensional time-parameterized bounding rectangle R given by parameters (x├1, x┤1, x├2, x┤2, ... ,x├d, x┤d, v├1, v┤1, v├2, v┤2, ... , v├d, v┤d) intersects a moving query Q = (([a├1, a┤1], [a├2, a┤2], ... , [a├d, a┤d], [w├1, w┤1], [w├2, w┤2], ... , [w├d, w┤d]), t├, t┤). This formulation of a moving query as a timeparameterized rectangle with starting and ending times is more convenient than the definition given earlier. The velocities w are obtained by subtracting R2 from R1 in the earlier definition and then normalising them with the length of interval [t├, t┤] [SJLL99]. The algorithm is based on the observation that for two moving rectangles to intersect, there has to be a time point when their extents intersect in each dimension. Thus, for each dimension j (j = 1,2, ... ,d), the algorithm computes the time interval I j = [t├j, t┤j] [t├, t┤] when the extents of the rectangles intersect in that dimension. If I = ∩dj=1 Ij = , the moving rectangles do not intersect and an empty result is returned; otherwise, the algorithm provides the time interval I when the rectangles intersect. The intervals for each dimension are computed according to the following formulas: 8 To see how t├j and t┤j are computed, consider the case where Q is below R at t┤. Then Q must not be below R at t├, as otherwise Q is always below R and there is no intersection (the case of no intersection is already accounted for). This means that the line a┤j + w┤j(t - t├) intersects the line x├j(t├) + v├j(t - t├) within the time interval [t├,t ┤ ]. Solving for t gives the desired intersection time [SJLL99]. Heuristics for Tree Organization The values of three problem parameters affect the indexing problem and the qualities of a TPR-tree. The first specifies exactly how far into the future queries may reach. The second specifies for how long the index is to remain functional. The third is simply the sum of the first two. Querying window (W): how far queries can "look" into the future. Thus, is(Q) ≤ t ≤ is(Q) + W, for Type 1 queries, and is(Q) ≤ t├ ≤ t┤ ≤ is(Q) + W for queries of Type 2 and 3. 9 Index usage time (U): the time interval during which an index will be used. Thus, tl ≤ is(Q) ≤ tl + U, where tl is the index creation or bulkloading time. After tl +U, the index is considered obsolete. Time horizon (H): the time interval from which all times (t, t├, t┤) specified in queries are drawn. The time horizon for an index is the index usage time plus the querying window. Figure 5: Time Horizon H, Index Usage Time U, and Querying Window W As a precursor to designing the (dynamic) insertion and bulkloading algorithms for the TPR-tree, we discuss how to group moving objects into nodes so that the tree most efficiently supports timeslice queries when assuming a time horizon H. The objective is to identify principles, or heuristics, that apply to both dynamic insertions and bulkloading, and to any number of dimensions. The goal is to obtain a versatile index. Assuming the timeslice queries to be uniformly distributed between time tl (the bulkloading time) and tl + H and keeping in mind that the moving points are represented by linear trajectories, it seems intuitive to bulkload the index based on the projected positions of he moving points at time tl + H/2. Indeed, this would be a promising approach if always-minimum bounding rectangles were employed, but it does not work for conservative bounding rectangles. This insight is illustrated for one-dimensional space in Figure 4, where points A and B are placed in the same bounding interval, due to their proximity at time H/2. 10 Although the idea presented above is not useful for bulkloading, one may wonder if it is useful in the tree's insertion algorithm. The idea would be to compute the area, margin, and other characteristics of bounding rectangles relevant to the insertion algorithm as of H/2 time units after the time of the insertion. However, this approach has the problem that for more than one dimension, the area of a bounding rectangle does not grow linearly with time. A different approach is necessary [SJLL99]. It is clear that when H is close to zero, the tree may simply use the usual R-tree insertion and bulk-loading algorithms. The movement of the point objects and the growth of the bounding rectangles become irrelevant - only their initial positions and extents matter. In contrast, when H is large, grouping the moving points according to their velocity becomes important because it is desirable that the bounding rectangles are as small as possible at all times in [tl,tl + H], and how fast a bounding rectangle grows depends on its "velocity extents". (In onedimensional space, the velocity extent of a bounding interval is equal to v├-v┤.) This leads to the following general approach. The insertion and bulkloading algorithms of the R*-tree, which we consider extending to moving points, aim to minimize objective functions such as the areas of the bounding rectangles, their margins (perimeters), and the overlap among the bounding rectangles. In our context, these functions are time dependent, and we should consider their evolution in [tl,tl+H]. Specifically, given an objective function A(t), the following integral should be minimized. If A(t) is area, the integral computes the area of the trapezoid that represents part of the trajectory of a bounding rectangle in (x,t)-space. We have so far assumed that the times tq of the timeslice queries (R,tq) are distributed uniformly across [tl,tl+H]. Perhaps most prominently, this occurs if the times when queries are issued are uniformly distributed and the tq's of the queries are always equal to the times when they are issued, i.e., W = 0. If queries are issued uniformly, but W > 0, or window queries and moving queries are issued, the probability that a time point in the middle of interval [tl,tl+H] will be in a query is higher than that of a time point near tl or tl + H. If p(t) is the probability that time point t is in some query, then the integral ∫tltl+H p(t)A(t)dt should be minimized [SJLL99]. 11 Insertion and Deletion The insertion algorithm of the R*-tree employs functions that compute the area of a bounding rectangle, the intersection of two bounding rectangles, the margin of a bounding rectangle (when splitting a node), and the distance between the centers of two bounding rectangles (used when doing forced reinsertions). The TPR-tree's insertion algorithm is the same as that of the R*-tree, with one exception: instead of the functions mentioned here, integrals as in Formula 1 of those functions are used [SJLL99]. Computing the integrals of the area, margin, and distance are relatively straightforward. The algorithm that computes the integral of the intersection of two time-parameterized rectangles is an extension of the algorithm for checking if such rectangles overlap. At each time point when the rectangles intersect, the intersection area is a rectangle and, in each dimensions, the upper (lower) bound of this rectangle is defined by the upper (lower) bound of one of the two intersecting rectangles [SJLL99]. The algorithm thus divides the time interval returned by the overlap-checking algorithm into consecutive time intervals so that, during each of these, the intersection is defined by a timeparameterized rectangle. The intersection area integral is thus computed as a sum of area integrals [SJLL99]. The parameter H = U + W was defined based on a static setting, and for static data. In a dynamic setting, W remains a component of H, which is the length of the time period where integrals are computed in the insertion algorithm. How large the other component of H should be depends on the update frequency. If this is high, the effect of an insertion on the tree will not persist long and, thus H should not exceed W by much [SJLL99]. The introduction of the integrals is the most important step in rendering the R*-tree insertion algorithm suitable for the TPR-tree, but one more aspect of the R*-tree algorithm must be revisited. The R*-tree split algorithm selects one distribution of entries between two nodes from a set of candidate distributions, which are generated based on sortings of point positions along each of the coordinate axes. In the TPR-tree split algorithm, moving point (or rectangle) positions at different time points are used when sorting. With load-time bounding rectangles, 12 positions at tl are used, and with update-time bounding rectangles, positions at the current time are used [SJLL99]. Finally, in addition to sortings along spatial dimensions, the split algorithm is extended to consider also sortings along the velocity dimensions, i.e., sortings obtained by sorting on the coordinates off the velocity vectors. The rationale is that distributing the moving points based on the velocity dimensions may result in bounding rectangles with smaller "velocity extents" and which consequently grow more slowly. Deletions in the TPR-tree are performed as in the R*-tree. If a node gets underfull, it is eliminated and its entries are reinserted [SJLL99]. Bulkloading the Tree The bulkloading algorithm presented here attempts to minimize the area integrals of the tree's time-parameterized bounding rectangles across [tl, tl+H]. Without loss of generality, we let tl = 0. We also assume one-dimensional, uniform moving point data. More precisely, if the onedimensional moving points are represented as two-dimensional points in (x(tref),v)-space, we assume that they are uniformly distributed in a rectangular region with extents S and V. Packing these points into tree nodes corresponds to partitioning this region into bounding rectangles. Due to the uniformity, we choose all bounding rectangles to be equal. The important parameter, which we need to determine, is then the ratio between the velocity extents and the reference-position (or spatial) extents of the bounding rectangles. For example, Figure 6 illustrates two different partitionings of a region [SJLL99]. Figure 6: Two Subdivisions of a Data Region in (x(tref),v)-Space and the Evolution of the Corresponding Intervals in (x,t)-Space. 13 Partitioning a) equally prioritizes position and velocity, while Partitioning b) completely ignores velocity and packs data points according to position only. To compare the two partitionings with respect to different values for H, we consider the trapezoids in (x,t)-space that correspond to the bottom-left and bottom partitions in the two partitionings. For H = 1, the areas of the trapezoids are 2.25 versus 1.5 for the two partitionings. But for H = 5, the areas are 16.25 versus 17.5. It is not difficult to see that although the trapezoids that correspond to other partitions are different, their areas are equal to those of the two partitions considered. Thus, for small values of H, Partitioning b) is best, and for large values of H, Partitioning a) is best [SJLL99]. The partitioning parameter - the velocity-space aspect ratio, - is a function of H. To determine (H), let the extents of a bounding rectangle of a partition in (x(tref),v)-space be (s, s). Then, if the number of rectangles in a partitioning (which is also the number of nodes in the leaf level of a tree) is k, the equation k=(S/s)/ s) holds, meaning that s=√((SV)/(k)). Knowing s, the length of a bounding interval at time t is A(t)=s + s∙t=s(1+t). To find we express the ara integral of the interval. When we solve it we get that =2/H. This result confirms that the larger the time horizon H, the smaller should be, i.e., the narrower the bounding rectangles should be in the velocity dimension. Note also that is independent of parameters such as the extents of the data set and the number of nodes [SJLL99]. To actually achieve bounding rectangles that have a velocity-space aspect ratio close to , we use an adapted version of the STR-tree packing algorithm [LEL97]. Performance Experiments - Experiment Setup And Workload Generation The implementation of the TPR-tree used in the experiments is based on the Generalized Search Tree Package, GiST [HNP95]. The page size (and tree node size) is set to 4k bytes, which results in 204 and 146 entries per leaf-node for two- and three-dimensional data, respectively. A page buffer of 200k bytes, i.e.,50 pages, is used [LL98], where the root of a tree is pinned and the least-recently-used page replacement policy is employed. The nodes changed during an index operation are marked as “dirty” in the buffer and are written to disk at the end of the operation or when they otherwise have to be removed from the buffer. The performance studies are based on workloads that intermix queries and update operations on 14 the index, thus simulating index usage across a period of time. In addition, each workload initially bulkloads the index [SJLL99]. As moving point data where the positions and velocities of the objects are uniformly distributed seems to be rather unrealistic, we attempt to generate more realistic twodimensional data by simulating a scenario where the objects, e.g., cars, move in a network of routes, e.g., roads, connecting a number of destinations, e.g., cities. In addition to simulating cars moving between cities, the scenario is also motivated by the fact that usually, even if there is no underlying infrastructure, moving points tend to have destinations. For example, fishing boats follow schools of fish or return to ports [SM99]. The TPR-tree insertion algorithm depends on the parameter H, which is equal to W plus some duration that is dependent on the frequency of updates. Figures 7 and 8 show the results. The horizontal axes correspond to the part of parameter H that should depend on the frequency of updates. Curves are shown for experiments with different querying windows W. The graphs demonstrate a pattern, namely that the best values of H lie between UI/2 + W and UI + W. This is not surprising. In UI/2 time units, approximately half of the entries of each leaf node in the tree are updated, and after UI time units, almost all entries are updated. Note also the difference in average search disk access numbers in Figures 9 and 10. A higher update rate (a smaller UI) means tighter bounding rectangles and, thus, better query performance [SJLL99]. Figure 7: Search Performance For UI = 60 Figure 8: Search Performance For UI = and Varying Settings of H 120 and Varying Settings of H 15 A set of experiments with varying workloads were performed in order to compare the relative performance of the R-tree, the TPR-tree with load-time bounding rectangles, and the TPR-tree with update-time bounding rectangles [SJLL99]. Figure 9 shows the average number of I/O operations per query for the three indices when the number of destinations in the simulation is varied. As shown, increased skew leads to a decrease in the numbers of I/Os for all three approaches, especially for the TPR-tree. This is expected because when there are more objects with similar velocities, it is easier to pack them into bounding rectangles that have small velocity extents and also are not too big in the spatial dimensions. The figure also demonstrates that the TPR-tree is an order of magnitude better than the R-tree [SJLL99]. Figure 9: Search Performance For Varying Figure 10: Search Performance for Varying W Numbers of Destinations and Uniform Data The study indicates quite clearly that the TPR-tree indeed is capable of supporting queries on moving objects quite efficiently and that it outperforms its competitors so far. The study also demonstrates that the tree does not degrade severely as time passes. See Figure 11. 16 Figure 11: Degradation of Search Performance with Time Summary The TPR-tree is a versatile adoption of the R*-tree that supports the efficient querying of the current and anticipated future locations of moving points in one-, two-, and three-dimensional space. Whereas the R*-tree's algorithms use functions that compute the areas, margins, and overlaps of bounding rectangles, the TPR-tree employs integrals of these functions, thus taking into consideration the values of these functions across the time when the tree is queried. 2.2 The B-tree The B-tree is a widely used technique for organizing a file and its index. In fact, it is the standard organization for indices in a database system [Com79]. There are several variants of the B-tree (e.g. B*-trees, B+-trees, Prefix B+-trees, Virtual B+-trees), and Kollios' [KGT99] query approximation algorithm is based on the use of several B+-trees. In the late 1960s computer manufactorers and independent research groups competively developed general purpose file systems and so-called "access methods" for their machines. Building on this work, Bayer and McCreight in 1972 proposed an external index mechanism 17 with relatively low cost for most basic operations and called it the B-tree [Com79]. The B-tree is a generalization of the binary search tree in which more than two paths leave a given node [Com79[RSV02]. The beauty of B-trees lies in the methods for inserting and deleting records. The result is a balanced tree with all leaves at the same depth. 2.2.1 The B+-tree In a B+-tree, all keys reside in the leaves. The upper levels, which are organized as a B-tree, consists only of an index to enable rapid location of the index and key parts [Com79]. The result is a separation of the index and the key parts. Of course index nodes and leaf nodes may have different formats or sizes. Furthermore, leaf nodes are usually linked together left-toright as a sequence set (Figure 12). This sequence set allow easy sequential processing [Com79]. random search sequential search index: B-tree keys: the sequence set Figure 12: A B+-tree with separate index and key parts. Operations "by key" begin at the root as in the B-tree – sequential processing begins at the leftmost leaf. (From [Com79]) B-trees which support low-cost find, insert, and delete operations, may require log n accesses to secondary storage to process a next operation. The B+-tree implementation retains the logarithmic cost properties for operations by key [Com79][RSV02], but gains the advantage of requiring at most 1 access to satisfy the a next operation [Com79]. Furthermore, during the 18 sequential processing of a file, no node will be accessed more than once, so space for only 1 node need to be available in the main memory. Therefore, B+-trees are well suited to applications which entail both random and sequential processing [Com79]. 2.2.1.1.1 Query approximation method (Hough-Y transformation) Kollios et al. [KGT99] uses a query approximation algorithm with multiple B+-trees to index moving objects in the Dual transform model. The query region is defined by the intersection of two half-plane queries (Figure 13). Since access methods are more efficient for rectangle queries, the article approximates the simplex query with a rectangular one. Thus, the query area is enlarged by the area E = E1 + E2. E should be minimized since it represents a measure of the extra I/Os that the algorithm will have to perform for solving a onedimensional MOR query. n 1/ v min E1 E2 1/ vmax 0 tq1 tq2 Figure 13: Query in the dual Hough-Y plane. Kollios et al. [KGT99] propose to keep c indices (where c is a small constant) at equidistant yr's. All c indices contain the same information about the objects, but uses different yr's. If a query is executed at a single observation index, area E becomes large. In order to bound E, each subterrain is indexed. Each of the c subterrains indices records the time interval when a moving object was in the subterrain. In this way, the query is decomposed into a collection of smaller subqueries. A given one-dimensional MOR query will be forwarded to the indices that minimize E. Since all 2-dimensional approximate queries have the same rectangle side (Figure 13), the rectangle range search is equivalent to a simple range search on the b coordinate axis. Thus, each of the c "observations" indices can simply be a B+-tree. 19 2.3 The KD-tree approach The KD-tree is a binary tree, specially designed for indexing multidimensional points. It was designed by Bentley in 1975. Its structure is k-dimensional, thus the name kd-tree. 2.3.1.1.1 Point access method (Hough-X transformation) The Hough-X transformation (Figure 14) [KGT99] is using index structures based on R-trees or even better; kd-trees, for indexing moving objects. By using an algorithm to answer simplex range queries, created by Goldstein et.al. [GRSY97], we can answer the MOR query in the dual space. a yq2 yq1 O5 O7 0 vmin O8 vmax v O6 Figure 14: Query in the dual Hough-X plane. Kollios et.al. [KGT99] argue that an index structure based on kd-trees is more suitable than a method based on R-trees, because a kd-tree based method will use both dimensions to split, while the R-tree will only split in one dimension. So the kd-tree will perform better for the MOR query. There are several advantages of using multi-attribute trees, like the kd-tree [Sal91]: Good space utilization in both index and data nodes High fan-out (the index should be significantly smaller than the data collection) Fast exact match search (given the coordinates, the data should be obtained quickly) Fair clustering in data pages by all attributes for good range search performance Easy integration with the query, locking, and recovery systems of existing DBMSs Simple design for incremental growth and shrinkage (insertion and deletion algorithms) 20 But there are also some common drawbacks which are typical for multidimensional binary trees [HSW89]: Multidimensional binary trees may become unbalanced, i.e. may contain long paths with almost no branches No suitable method for paging a multidimensional binary tree is known 2.4 Variants of the KD-tree We introduce two different KD-trees; the LSD-tree and the hB-tree. Both of them have been introduced as suitable for indexing mobile objects [KGT99]. 2.4.1 The hB-tree The hB-tree was introduced by Evangelidis et al. [ELS97]. It is a combination of the hB-tree, a multi-attribute index, and the -tree, an abstract index which offers efficient concurrency and recovery methods. The reasons for combining these trees was the need for an efficient tree-structure suitable for indexing multi-dimensional applications, which would perform well for all kinds of data distributions, and the fact that the hB-tree is fairly insensitive to increases in dimension [ELS97]. A kd-tree node always stores the value of exactly one attribute. Thus, the size of a kd-tree node (and, consequently the size of the kd-trees that reside in the hB-tree nodes) does not depend on the number of indexing attributes. However, the hB-tree node stores its own boundaries and an increase in the number of dimensions will affect the space required to store a node’s boundaries. But this additional space is not significant for large page sizes. With a page size of 1K bytes and larger, there is almost no effect on the size of the hB-tree and the node space utilization as the dimensions increase. In contrast, in the R-tree, the size of the index is proportional to the dimension of the space. Experiments with various type and distributions of data show us that even the most restrictive versions of the hB-tree, that do not offer worst case storage utilization and index term size guarantees, perform very well in 21 terms of storage utilization, index size, exact-match and range searching [ELS97]. To understand the structure of the hB-tree, we have to know more about the -tree and the hBtree. 2.4.1.1 Facts about the -tree The -tree can be directly responsible for some part of the space, but it can also delegate responsibility for part of the space to the sibling nodes. Its pointers to sibling nodes are called side pointers. In the -tree it is possible for a node to be referred to by more than one parent, then the child is called a multi-parent node. 2.4.1.2 Facts about the hB-tree The hB-tree consists of index nodes and data nodes [LS90]. Index nodes contain kd-trees with information about children on the next lower level of the hB-tree and about regions which have been extracted and transferred to siblings on the same sibling level. Data nodes contain the actual data records. They may contain kd-trees. Each node stores a description of the space it is responsible for, by using two attributes, called the boundaries of the hB-tree. Unlike other multi-attribute indexes that split nodes by hyperplanes, the hB-tree can use more than one attribute to describe the extracted region. 2.4.2 How to make the hB-tree a -tree We make the hB-tree into a -tree by using some transformations [ELS97]. We adopt side pointers from the -tree. This eases the searching. We also modify the splitting algorithm, for splitting a node at its kd-tree root. Shortly, when there is a split at the root, we keep the kdtree root in the original node and we simply extract the appropriate kd-subtree which again becomes the kd-tree of the new hB-tree node. To be able to support node consolidation, we make further structural changes. We make it possible to determine the containment order of the children of N and whether a child node of N is multi-parent or not. The main innovations of the hB-tree compared to the hB-tree are the introduction of sidepointers and the fact that node splitting and index term posting are performed by separate actions. There has not been very much research on the performance for the hB-tree used for indexing mobile objects. Kollios et.al. [KGT99] is 22 2.5 The LSD-tree The Local Split Decision tree (LSD-tree) was introduced in 1989 by Henrich et.al. [HSW89], as a data structure supporting efficient spatial access to geometric objects. It performs well for all reasonable data distributions, cover quotients, and bucket capacities, and remains multidimensional points as well as arbitrary geometric points. The LSD-tree is extremely suitable for the implementation of spatial access paths in geometric databases [HSW89]. As other tree structures, the LSD-tree partitions the data space into pairwise disjoint cells with associated buckets of fixed size. However, in contrast to the grid file and other structures, the LSD structure is not grid oriented. Splits occur at arbitrary positions, in what [HSW89] calls locally optimal positions. The split positions are optimal with respect only to the cell to be split and independent from other existing cell boundaries. After a certain number of insertions the initial bucket has been filled, and an attempt to insert an additional object causes the need for a bucket split. To this purpose, a split line is determined and the objects on one side of the split line are stored in one bucket, while those on the other side are stored in another bucket. After some further insertions, the capacity of another bucket will be exceeded. In this case, a split line in the corresponding bucket region is determined, thus splitting this region into two subregions. This process is repeated each time the capacity of a bucket is exceeded. The split lines of the LSD-tree are maintained in a directory which is a generalized kd-tree. For each split, a new node containing the position and the dimension of the split line is inserted into the directory tree. The leaves of the directory tree reference the buckets in which the actual objects are stored. When the directory grows up to a point where it can not be kept in main memory any longer, subtrees of the directory are stored on secondary memory, while the part of the directory near the root remains in main memory [Hen96]. The LSD-tree overcomes the typical drawbacks for multidimensional data by using a special paging algorithm. When a LSD-tree grows up to a size where it cannot be kept in main memory any longer, the paging algorithm determines a subtree to be paged on secondary storage. The main memory will then be emptied and made ready for additional input. 23 For storing rectangles in the LSD-tree, we use a transformation technique. Thus 2D-rectangles are stored as 4D-points. This is a simple idea, used in all kd-trees, but some problems do arise. First, there is a strong correlation between the lower and upper bounds in the rectangle, because the upper bound of a rectangle is always greater than (or equal to) the lower bound. Therefore all the points are located in a triangular shaped subspace. Furthermore, since the rectangles usually are small compared to the data space, the points are located in a small strip above the diagonal. The LSD-tree overcomes the drawbacks of the transformation technique by using one out of two bucket split strategies; the data dependent split strategy and the space dependent strategy. Figure 15: Split positions achieved by two basic split strategies The data dependent strategy chooses the average over all the coordinates in the bucket to be split, including the object to be inserted, as split position. The space dependent strategy is based on two assumptions. The first one is that all rectangles are degenerated to points, i.e. the upper and the lower bounds coincide for each dimension. Hence, all points are located on the diagonal. A suitable split position will split the data into two cells, containing equally long parts of the diagonal. In figure 15, the split position is named SP1. The second assumption is that all points are uniformly distributed over the triangular subspace. A suitable split strategy halves the data cell in two cells of equal areas. In figure 15, the split position is named SP2. The final split position SP is calculated as a weighted sum of the positions from the two split strategies. A performance study made by Henrich et.al [HSW89] show that the size of the directory does not depend on the data distribution but on the split strategy (and of course on the size of the 24 data set and the bucket capacity). With insertion of unsorted data, the data dependent strategy performs significantly better than the distribution dependent, while with sorted data, the distribution dependent solution is best. 2.6 The quadtree approach The idea common to all quadtree variations is the recursive decomposition of indexed space. When a quadrant is split, four sub-quadrants are created. Our main interest is the region quadtree, and particularly the PMR quadtree, which is the quadtree-based indexing structure for line segments. The idea of the PMR quadtree is to store information about a line segment in every quadrant of the underlying space that it crosses [TUW98]. The data space is partitioned until no more than B lines cross a single quadrant. B is called the bucket size, and will typically be equal to the number of data records that fit in a single disk space. The PMR tree is quite similar to the point region (PR) quadtree, the difference is in the semantics of a data point which makes the split involve more than a simple distribution of points over the four subquadrants. In the quadtree approach, as well as in the other ways of indexing mobile objects, the indexing of mobile objects is based on an equation of motion, f(t) = at + b, where a is the speed of the object (the slope) and b is the intercept. Thus index records consist of an object ID, the intercept b and the slope a. When a data page overflows (at the B + 1th insertion), a page or bucket split takes place. Shortly, a bucket split involves to insert the corresponding <ID, a, b> record in every crossed subquadrant. Figure 16: Example of overflow in the PMR-structure 25 Obviously, a bucket split leads to duplication of index elements. The same trajectory which was represented by a single point before the split becomes represented by three points after the split. This might lead to significant storage overhead. However, performance experiments by Tayeb, Ulusoy and Wolfson [TUW98] show that the PMR- index perform very well for what they define as instantaneous queries that averages two disk accesses per query. 3 Comparison In their model, Kollios et al. [KGT99] compare cB+-trees (where c = 4,6 and 8), a hB-tree, and an R* approach. In figure 17 the results for the average number of I/Os per query for the different methods are presented. The approximation method using several B+-trees where better than the hB-tree approach. The R* approach had the worst performance. Figure 17: Query performance. (From [KGT99]) The space consumption and the average number of I/Os per update are plotted in Figure 18. The update and space performance of the hB-tree is better than the cB+-tree. This because the objects are stored only once and better clustered than the R*-tree. 26 Figure 18: Space consumption. (From [KGT99]) Although the query approximation algorithm is efficient, later work by Chon et al. [CAE01] have tested and compared the Dual transform model in the article with their SV model. Chon et al. [CAE01] compared Kollios et al.’s [KGT99] cB+-tree and hB-tree approach with a SS-tree approach. By generated realistic data, a performance test showed that the SV model has significantly lower overhead when compared to the Dual transform model. 4 Discussion We have introduced and described a few different approaches for indexing mobile objects. There has not been much research on this topic. We have been in touch with Hae Don Chon, the author of [CAE01] and one of the most up-to-date researchers on the topic, and he confirms that currently, there are just a few people in the world doing research on mobile objects. Several research groups have proposed variants of the R-tree, and a few have created some kd-tree structures or B-tree structures. However, most of these articles have just been theoretically introduced, and their models or index structures have hardly ever become adopted in an DBMS. According to Chon, R-tree structures, like the TPR-tree, are mostly used in DBMSs, not because they perform better than other structures, but because they are easier to implement. There is still research going on using some of the tree structures described in this paper. Some of them have not been subjects for any research since they were introduced for first time, while others are still developed and investigated. According to B. Salzberg, one of the authors of [ELS97], there has not been any further research using the hB-tree for tempo-spatial queries after 1997. However, currently there is some work on this subject which is not published yet, mainly for temporal data. The LSD-tree is suitable for 27 indexing mobile objects, but research using this tree is mainly on accessing feature vectors, as a LSDh-tree. The TPR-tree is the most appropriate R-tree, and is probably the most used method for indexing mobile objects. 28 5 References [CAE01] Chon, H., D, Agrawal, D., and El Abbadi, A. Storage and Retrieval of moving Objects. [Com79] Comer, D. 1979. The ubiquitous B-tree. Computing Surveys, Vol. 11(2): 121-137. [ELS97] Evangelidis, G., Lomet, D., and Salzberg, B. 1997. The hB -tree: A Multi-attribute Index Supporting Concurrency, Recovery and Node Consolidation. [GRSY97] Goldstein, J. Ramakrishnan, R, Shaft, U. and Yu, J. B. 1997. Processing Queries By Linear Constraints. In: Proc. 16th AMC PODS Symposium on Principles of Database Systems, Tuscon, Arizona. [Hen96] Henrich A. 1996. A Hybrid Split Strategy for kd-tree Based Access Structures. Proceedings of the 4th ACM Workshop on Advances in Geographic Information Systems (GIS '96), S. 1-8, Rockville, Maryland, USA, November 1996 [HNP95] J.M. Hellerstein, J.F. Naughton, and A. Pfeffer. Generalized Search Trees for Database Systems. In Proc. of the VLDB Conf., pp. 562-573 (1995). [HSW89] Henrich, A. Six, H-W and Widmayer, P. 1989. The LSD-tree: Spatial Access to Multidimensional Point and Non-point Objects. [KGT99] Kollios, G., Gunopolos, D., and Tsotras, V. J. 1999. On Indexing Mobile Objects. [LEL97] S. T. Leutenegger, J. M. Edgington, and M. A. Lopez. STR: A Simple and Efficient Algorithm for RTree Packing. ICDE’97. pp. 497-506. [LL98] S.T. Leutenegger and M.A. Lopez. The Effect on Buffering on the Performance of R-trees. In Proc. of the ICDE Conf., pp. 164-171 (1998). [RSV02] Rigaux, R., Scholl, M., and Voisard, A. 2002. Spatial Databases With Application to GIS. Academic Press, USA. [Sal91] Salzberg, B. 1991. Practical spatial database access methods. In: Proceedings of the Symposium on Applied Computing, pp. 82-90. [SJLL00] Simonas Saltenis, Christian S. Jensen, Scott T. Leutenegger, and Mario A. Lopez. Indexing the Positions of Continuously Moving Objects. [SM99] J.-M. Saglio and J. Pereira. Oporto: A Realistic Scenario Generator for Moving Objects. In Proc. of DEXA Workshops, pp. 426-432 (1999). [TUW98] Tayeb, J. Ulusoy, O. and Wolfson, O. 1998. A Quadtree Based Dynamic Attribute Indexing Method. The Computer Journal, 41(3):185-200. 29