Download Study and Optimization of T-tree Index in Main Memory Database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linked list wikipedia , lookup

Quadtree wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Red–black tree wikipedia , lookup

Binary tree wikipedia , lookup

Interval tree wikipedia , lookup

B-tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
Study and Optimization of T-tree Index in Main Memory Database
Fengdong.Sun1, a, Quan.Guo2,b, Lan.Wang3,c
1,2,3
Department Of Computer Science and Technology , DaLian Neusoft University of
Information,Dalian,China
a
[email protected],[email protected],[email protected]
Keywords: T-tree,rotate,balance,T-tail,TTB-tree
Abstract
The bottleneck is not the disk I/O but CUP clock speed faster than the memory speed in main
memory database .In order to achieve high performance in main memory database ,it is a good
approach to design new index structures to improve the memory access speed .This chapter presents
a T-tree index structure and its algorithms in main memory database firstly .Then presents two
results on Optimization of T-tree index ,including T-tail tree and TTB-tree. Our results indicate that
the T-Tree provides good overall performance in main memory.
1.Introduction
In a main memory database ,the bottleneck that influences the high performance is disk I/O
.However, the bottleneck in a main memory database is not disk I/O ,but CPU clock speed has been
increased at a much faster rate than memory speed, the memory access speed can’t meet the CPU
processing speed[1].So it is necessary to design new index structures in a main memory database to
improve the memory access speed.
Index structures designed for main memory are different from those designed for disk-based
systems. The primary goals for a disk-oriented index structure are to minimize the number of disk
accesses and to minimize disk space. A main memory oriented index structure is contained in main
memory, hence there are no disk accesses to minimize. Thus, the primary goals of a main memory
index are to reduce overall computation time while using as little memory as possible.
Researchers have actively sought to design and develop new index architectures for improving
the performance in a main memory database,such as T-tree ,B-tree and their variants[2].The T-Tree
is an order-preserving tree index structure designed specifically for use in MMDB ,and it has been
used widely.
This paper will describe the T-tree index structure and its operation algorithm detailed, then
will describe T-tree variants ,including T-tail tree and TTB tree.
2.T-tree index
2.1T-tree structure
T-tree is proposed by Tobin J.Lehman and Michael J.Carey in 1986.Tt is a type of binary
tree data structure that is used by main memory database.T-tree is evolved from AVL Tree and
B-Tree, it obtains the advantages of AVL and B-tree[3]. A T-tree is a balanced index tree data
structure optimized for cases where both the index and the actual data are fully kept in memory, just
as a B-tree is an index structure optimized for storage on block oriented secondary storage devices
like hard disks. T-tree seeks to gain the performance benefits of in-memory tree structures such as
The work is supported by two National Natural Science Foundations of China under Grant No.61170168 and
No.61170169.
AVL trees while avoiding the large storage space overhead which is common to them.
T-tree does not keep copies of the indexed data fields within the index tree nodes themselves.
Instead, It takes advantage of the fact that
the actual data is always in main memory
together with the index so that it just
contains pointers to the actual data fields.
A T-tree node usually consists of pointers
to the parent node, the left and right child
node, an ordered array of data pointers and
some extra control data. Fig.1 shows the
T-Tree node structure.
Nodes
with
two subtrees are
called internal
nodes,
nodes
without subtrees are called leaf nodes and
nodes
with
only
one subtree are
named half-leaf nodes. A node is called
the bounding node for a value if the value is
between the node's current minimum and
maximum value, inclusively.For each
internal node, leaf or half leaf nodes exist
that contain the predecessor of its smallest
data value (called the greatest lower bound)
and one that contains the successor of its
largest data value (called the least upper
bound). Leaf and half-leaf nodes can contain
any number of data elements from one to the
maximum size of the data array. Internal nodes keep their occupancy between predefined minimum
and maximum numbers of elements[4]. Fig.2 shows the T-Tree structure.
2.2 T-tree operation
In main memory database, T-tree index is mainly used for data search, insertion, and deletion
.The insertion and deletion are based on data search[5].
(1)Search
1)Search starts at the root node.
2)If the current node is the bounding node for the search value then search its data array. Search
fails if the value is not found in the data array.
3)If the search value is less than the minimum value of the current node then continue search in
its left subtree. Search fails if there is no left subtree.
4)If the search value is greater than the maximum value of the current node then continue search
in its right subtree. Search fails if there is no right subtree.
From those,search an element needs the number of comparisons is log2N + (1/2) × log2N / K
,and its time complexity is O (log2N + (1/2) × log2N / K), where N is the number of elements in
T-tree, K is the number of elements in the node.
(2)Insertion
1)Search for a bounding node for the new value.
2)If such a node exist then check whether there is still space in its data array, if so then insert the
new value and finish。
3)if no space is available in the node, then remove the minimum value from the node's data array
and insert the new value. Now proceed to the node holding the greatest lower bound for the node
that the new value was inserted to. If the removed minimum value still fits in there then add it as the
new maximum value of the node, else create a new right subnode for this node.
4)If no bounding node was found then insert the value into the last node searched if it still fits
into it. In this case the new value will either become the new minimum or maximum value. If the
value doesn't fit anymore then create a new left or right subtree.
5)If a new node was added then the tree might need to be rebalanced, as described below.
(3)Deletion
1) Search for the node that bounds the delete value. Search for the delete value within this node,
reporting an error and stopping if it is not found.
2) If the delete will not cause an underflow (if the node has more than the minimum allowable
number of entries prior to the delete), then simply delete the value and stop. Else, if this is an
internal node, then delete the value and borrow the greatest lower bound of this node from a leaf or
half-leaf to bring this node’s element count back up to the minimum. Else, this is a leaf or a
half-leaf, so just delete the element. (Leaves are oermitted to underflow, and half-leaves are handled
in step(3).
3) If the node is a half-leaf and can be merged with a leaf ,coalesce the two nodes into one node
(a leaf) and discard the other node. Proceed to step (5).
4) If the current node (a leaf) is not empty, then stop. Else, free the node and proceed to step (5)
to rebalance the tree.
5) For every node along the path from the leaf up to the root, if the two subtrees of the node
differ in height by more than one ,perform a rotation operation. Since a rotation at one node may
create an imbalance for a node higher up in the tree, balance-checking for deletion must examine all
of search path until a node of even balance is discovered.
2.3 Rotation and balancing
T-tree has good update and storage characteristics, but it need to rotate and balance the tree for
insertion and deletion operations typically.
A T-tree is implemented on top of an underlying
self-balancing binary search tree. T-tree balances like an
AVL tree: it becomes out of balance when a node's child
trees differ in height by at least two levels. This can happen
after an insertion or deletion of a node. After an insertion or
deletion, the tree is scanned from the leaf to the root. If an
imbalance is found, one tree rotation or pair of rotations is
performed, which is guaranteed to balance the whole tree
.After a rotation, the side of the rotation increases its height
by 1 whilst the side opposite the rotation decreases its height
similarly. Therefore, one can strategically apply rotations to
nodes whose left child and right child differ in height by
more than 1. Self-balancing binary search trees apply this
operation automatically. When the rotation results in an
internal node having fewer than the minimum number of
items, items from the node's new child(ren) are moved into
the internal node.
In the case of an insertion, at most one rotation is needed to rebalance the tree, so processing
stops after one rotation. In the case of a deletion, a rotation on one node may trigger an imbalance
for a node higher up in the tree, so processing continues after a rotation until an evenly balanced
node is found.
These are four types of rotations used to rebalance a T-tree .The type of rotations(LL,LR,RR,and
RL)are derived from the child of the node that causes the imbalance.The LL rotation is caused by
left subtree of left child,the LR rotation is caused by right subtree of left child, and so on. The
algorithms for the RR and RL rotations are symmetrical to the LL and LR rotations. Fig.3 shows
these rotations for the case of an insertion. There are 4 cases in all ,choosing which one is made by
seeing the direction of the first 2 nodes from the unbalanced node to the newly inserted node and
matching them to the top most row .Root is the initial parent before a rotation and pivot is the child
to take the root’s place.
The rebalancing rotations for deletion are identical to rebalancing after insertion, except that the
cause of the imbalance in the tree is that a subtree has grown shorter rather than longer.
3.T-tail Tree
3.1T-tail tree structure
In order to reduce the number of rotation and balancing the T-tree, references[7] proposes a
T-tail tree structure .When the insertion of a key results a T-tree node overflow ,the T-tail tree
needn’t rotate to balance the tree ,but generates a new pointer entry than point a new T-tree node
which stores the new insert keywords . In this way ,it can delay the rotation and balancing the T-tree
operation, thereby reducing the number of balance rotational operation.
Fig.4 shows a complete T-tail tree node structure. Fig.4 (a) is similar with a T-tree node ,the
difference is that T-tail node has a tail_pointer entry which points to the node of the tail node (if the
tail node exists); Fig.4 (b) also similar with a T-tree node except that the node has no parent node
pointer in the control information, and has no additional information such as the subtree pointer, but
only stored keywords. In the T-tail tree implementation, one node of the tree has one tail node at
most.
3.2 T-tail Algorithm
Search.In the T-tail tree, search algorithm is similar
to the classic T-tree search algorithm. The difference is
that when search a keyword in a T node, assuming the
current search node is A, if the keyword does not exist
in the T node,while the tail_pointer is not null ,then
finds the A’s tail node ,search is performed in the tail
node. If T node A has not tail node, the search algorithm
is same as the classical T-tree search algorithm.
Insertion.When inserting a new keyword , it
determines the boundary node by search algorithm
firstly,this operation is same as the classic T-tree insertion. If the boundary node is full ,it generates
the tail node, and the minimum keyword in the boundary node will be moved to the tail node ,the
new keyword will be inserted into the boundary node. Thus, the subsequent insertion operation is
perform in the tail node.
Deletion.When deleting a keyword,it is firstly determines the boundary node by search operation
.Then, if the deleting operation results boundary node underflow, the minimum keyword in the tail
node will be moved to the boundary node .After moving operation, if the tail node is null ,then
delete the tail node.
3.3 T-tail performance analysis
By analyzing the T-tail tree structure definition and operation algorithm ,it can see, if a node has
tail node ,the keywords are compared not only in the node ,but also in the tail node .In surface ,it
increases the number of comparisons compared with T-tree.
In addition ,there are a few pointer operations in T-tail tree ,which are not exists in T-tree
.Because the pointer operations are very fast in memory and compared to the T-tree itself time
overhead ,the T-tail tree pointer overhead is very little ,so the average search length in T_tail tree is
same as in T-tree .In insertion and deletion operations, some operations are related with tail node in
T-tail tree ,for example ,by moving keywords keeping the keywords sequence and orderly in the
node and the tail node. However, this overhead is relatively small ,and part of these operations are
needed in the T-tree. Therefore, the insertion and deletion in T-tail tree is same with in T-tree.
Compared with the T-tree, T-tail tree only added a pointer entry tail_pointer, this increase is
negligible. Therefore, T-tail tree and T-tree has the same time complexity and space complexity.
4.TTB-tree
In order to reduce the number of data overflow in data insertion and deletion in T-tree
,references[8] prompts a TTB-tree structure .Fig.5 shows the TTB-tree node structure and Fig.6
shows the TTB-tree structure.
TTB-tree node structure is similar with the of the classic T-tree node structure, the difference is
that each TTB-tree node has a successor node pointer which points the successor node. So it can
handle data overflow when inserting and deleting data easily . At the same time, search operation does not need
to travel the entire tree ,and avoid searching some useless data.
TTB-tree can quickly find its successor nodes by the successor pointer and can effectively solve
the problem of data overflow.
TTB-tree algorithms of searching and inserting and deleting are similar with the T-tree operation
algorithms .The difference is that: in the insertion operation ,when need to balance the tree ,it can
insert the new maximum boundary to the corresponding location as the minimum node of its right
subtree ,rather than the biggest node of its left subtree by the successor pointer .Thus ,it can avoid
the data overflow. in the deletion operation ,when deleting a node leads to data underflow ,it can
directly use the minimum of its right subtree by successor pointer and avoid traveling the entire
tree.
Serach.Search starts at the root node.If the current node is the bounding node , search the current
node using the binary search method . Search fails if the value is not found in the current node.If the
search value is less than the minimum value of the current node then continue search in its left
subtree. Search fails if there is no left subtree.If the search value is greater than the maximum value
of the current node then search the successor node pointed by the successor pointer.
Insertion.Search for a bounding node for the new value.If such a node exists then check whether
there is still space in the node, if so then insert the new value and finish.If bounding node exists ,but
no space is available in the node, generates a new leaf node ,and inserts the new value into the new
node as the minimum value .Then update bounding node successor pointer to point the new node
and rotate to rebalance the tree.IF the new values is larger than the maximum in the boundary node
,inserting the new value into the successor node ,and as the minimum in the successor node ,then
updating the maximum and minimum in the boundary node and the successor node.IF the new
values is less then the maximum in the boundary node ,moving the maximum in the boundary node
into the successor node as the minimum ,then updatintg the maximum and minimum in the
boundary node and the successor node.
Deletion.Search for the boundary node ,if the boundary node does not exist ,then the deletion
fails.If the boundary node exists ,but the value to delete does not exist in the boundary node ,then
the deletion fails.If the boundary node exists, the value to delete is in the boundary node ,and the
deletion operation will not cause an underflow ,then the operation succeed.if the deletion causes an
underflow ,it needs to merge nodes and rebalance the tree .This process is familar with the classic
T-tree.
5.Conclusion
In main memory database all data is stored in memory, rather than on disk, so traditional indexes
designed on disk-based storage are not suitable for the main memory database environment. It is
important to design index and improve the performance in main memory database environment
.This paper introduces the T-tree index structure and its algorithms detailly, also describes the
rotation and balance of the T-tree .Base on T-tree ,the chapter presents T-tail and TTB index
structures which are the optimization of the T-tree index .At the same time ,the chapter analyzes and
compares these index structures and algorithms.
6.References
[1] Jun Rao,Kenneth A. Ross. Cache Conscious Indexing for Decision-Support in Main
Memory. Journal of Computer Science and Technology, 2009,24(4):708−722.
[2] Lu Hongjun, Yuet Yeung Ng, Tian Zengping. T-tree or B-tree: Main Memory Database Index
Structure Reviewed. Australasian Database Conference,2000: 65-73
[3] Tobin J. Lehman, Michael J. Carey.A Study of Index Structures for Main Memory Database
Management Systems. Proceedings of the Twelfth International Conference on Very Large
Data Bases, 1986: 294-303
[4] T-tree .http://en.wikipedia.org/wiki/T-tree.2013.
[5] A. Aho, J. Hopcroft and J. D. Ullman, The Design and Analysis of Computer Algorithms,
Addison-Wesley Publishing Company, 1974.
[6] Ig-Hoon Lee, Sang-Goo Lee,Junho Shim. Making T-Trees Cache Conscious on Commodity
Microprocessors.JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 27,
143-161 (2011).
[7] Lin Peng, Li Hang, Xu Xuezhou. Optimization of T-tree Index of Main Memory Database in
Critical Application. Computer Engineering,2004, Vol.30 No. 17
[8] Wang shan,Xiao Yan- qin,Liu Da-wei,Qin Xiong-pai.Research of main memory
database.Computer Applications.2007,Vol.27 No.10