* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Study and Optimization of T-tree Index in Main Memory Database
Survey
Document related concepts
Transcript
Study and Optimization of T-tree Index in Main Memory Database Fengdong.Sun1, a, Quan.Guo2,b, Lan.Wang3,c 1,2,3 Department Of Computer Science and Technology , DaLian Neusoft University of Information,Dalian,China a [email protected],[email protected],[email protected] Keywords: T-tree,rotate,balance,T-tail,TTB-tree Abstract The bottleneck is not the disk I/O but CUP clock speed faster than the memory speed in main memory database .In order to achieve high performance in main memory database ,it is a good approach to design new index structures to improve the memory access speed .This chapter presents a T-tree index structure and its algorithms in main memory database firstly .Then presents two results on Optimization of T-tree index ,including T-tail tree and TTB-tree. Our results indicate that the T-Tree provides good overall performance in main memory. 1.Introduction In a main memory database ,the bottleneck that influences the high performance is disk I/O .However, the bottleneck in a main memory database is not disk I/O ,but CPU clock speed has been increased at a much faster rate than memory speed, the memory access speed can’t meet the CPU processing speed[1].So it is necessary to design new index structures in a main memory database to improve the memory access speed. Index structures designed for main memory are different from those designed for disk-based systems. The primary goals for a disk-oriented index structure are to minimize the number of disk accesses and to minimize disk space. A main memory oriented index structure is contained in main memory, hence there are no disk accesses to minimize. Thus, the primary goals of a main memory index are to reduce overall computation time while using as little memory as possible. Researchers have actively sought to design and develop new index architectures for improving the performance in a main memory database,such as T-tree ,B-tree and their variants[2].The T-Tree is an order-preserving tree index structure designed specifically for use in MMDB ,and it has been used widely. This paper will describe the T-tree index structure and its operation algorithm detailed, then will describe T-tree variants ,including T-tail tree and TTB tree. 2.T-tree index 2.1T-tree structure T-tree is proposed by Tobin J.Lehman and Michael J.Carey in 1986.Tt is a type of binary tree data structure that is used by main memory database.T-tree is evolved from AVL Tree and B-Tree, it obtains the advantages of AVL and B-tree[3]. A T-tree is a balanced index tree data structure optimized for cases where both the index and the actual data are fully kept in memory, just as a B-tree is an index structure optimized for storage on block oriented secondary storage devices like hard disks. T-tree seeks to gain the performance benefits of in-memory tree structures such as The work is supported by two National Natural Science Foundations of China under Grant No.61170168 and No.61170169. AVL trees while avoiding the large storage space overhead which is common to them. T-tree does not keep copies of the indexed data fields within the index tree nodes themselves. Instead, It takes advantage of the fact that the actual data is always in main memory together with the index so that it just contains pointers to the actual data fields. A T-tree node usually consists of pointers to the parent node, the left and right child node, an ordered array of data pointers and some extra control data. Fig.1 shows the T-Tree node structure. Nodes with two subtrees are called internal nodes, nodes without subtrees are called leaf nodes and nodes with only one subtree are named half-leaf nodes. A node is called the bounding node for a value if the value is between the node's current minimum and maximum value, inclusively.For each internal node, leaf or half leaf nodes exist that contain the predecessor of its smallest data value (called the greatest lower bound) and one that contains the successor of its largest data value (called the least upper bound). Leaf and half-leaf nodes can contain any number of data elements from one to the maximum size of the data array. Internal nodes keep their occupancy between predefined minimum and maximum numbers of elements[4]. Fig.2 shows the T-Tree structure. 2.2 T-tree operation In main memory database, T-tree index is mainly used for data search, insertion, and deletion .The insertion and deletion are based on data search[5]. (1)Search 1)Search starts at the root node. 2)If the current node is the bounding node for the search value then search its data array. Search fails if the value is not found in the data array. 3)If the search value is less than the minimum value of the current node then continue search in its left subtree. Search fails if there is no left subtree. 4)If the search value is greater than the maximum value of the current node then continue search in its right subtree. Search fails if there is no right subtree. From those,search an element needs the number of comparisons is log2N + (1/2) × log2N / K ,and its time complexity is O (log2N + (1/2) × log2N / K), where N is the number of elements in T-tree, K is the number of elements in the node. (2)Insertion 1)Search for a bounding node for the new value. 2)If such a node exist then check whether there is still space in its data array, if so then insert the new value and finish。 3)if no space is available in the node, then remove the minimum value from the node's data array and insert the new value. Now proceed to the node holding the greatest lower bound for the node that the new value was inserted to. If the removed minimum value still fits in there then add it as the new maximum value of the node, else create a new right subnode for this node. 4)If no bounding node was found then insert the value into the last node searched if it still fits into it. In this case the new value will either become the new minimum or maximum value. If the value doesn't fit anymore then create a new left or right subtree. 5)If a new node was added then the tree might need to be rebalanced, as described below. (3)Deletion 1) Search for the node that bounds the delete value. Search for the delete value within this node, reporting an error and stopping if it is not found. 2) If the delete will not cause an underflow (if the node has more than the minimum allowable number of entries prior to the delete), then simply delete the value and stop. Else, if this is an internal node, then delete the value and borrow the greatest lower bound of this node from a leaf or half-leaf to bring this node’s element count back up to the minimum. Else, this is a leaf or a half-leaf, so just delete the element. (Leaves are oermitted to underflow, and half-leaves are handled in step(3). 3) If the node is a half-leaf and can be merged with a leaf ,coalesce the two nodes into one node (a leaf) and discard the other node. Proceed to step (5). 4) If the current node (a leaf) is not empty, then stop. Else, free the node and proceed to step (5) to rebalance the tree. 5) For every node along the path from the leaf up to the root, if the two subtrees of the node differ in height by more than one ,perform a rotation operation. Since a rotation at one node may create an imbalance for a node higher up in the tree, balance-checking for deletion must examine all of search path until a node of even balance is discovered. 2.3 Rotation and balancing T-tree has good update and storage characteristics, but it need to rotate and balance the tree for insertion and deletion operations typically. A T-tree is implemented on top of an underlying self-balancing binary search tree. T-tree balances like an AVL tree: it becomes out of balance when a node's child trees differ in height by at least two levels. This can happen after an insertion or deletion of a node. After an insertion or deletion, the tree is scanned from the leaf to the root. If an imbalance is found, one tree rotation or pair of rotations is performed, which is guaranteed to balance the whole tree .After a rotation, the side of the rotation increases its height by 1 whilst the side opposite the rotation decreases its height similarly. Therefore, one can strategically apply rotations to nodes whose left child and right child differ in height by more than 1. Self-balancing binary search trees apply this operation automatically. When the rotation results in an internal node having fewer than the minimum number of items, items from the node's new child(ren) are moved into the internal node. In the case of an insertion, at most one rotation is needed to rebalance the tree, so processing stops after one rotation. In the case of a deletion, a rotation on one node may trigger an imbalance for a node higher up in the tree, so processing continues after a rotation until an evenly balanced node is found. These are four types of rotations used to rebalance a T-tree .The type of rotations(LL,LR,RR,and RL)are derived from the child of the node that causes the imbalance.The LL rotation is caused by left subtree of left child,the LR rotation is caused by right subtree of left child, and so on. The algorithms for the RR and RL rotations are symmetrical to the LL and LR rotations. Fig.3 shows these rotations for the case of an insertion. There are 4 cases in all ,choosing which one is made by seeing the direction of the first 2 nodes from the unbalanced node to the newly inserted node and matching them to the top most row .Root is the initial parent before a rotation and pivot is the child to take the root’s place. The rebalancing rotations for deletion are identical to rebalancing after insertion, except that the cause of the imbalance in the tree is that a subtree has grown shorter rather than longer. 3.T-tail Tree 3.1T-tail tree structure In order to reduce the number of rotation and balancing the T-tree, references[7] proposes a T-tail tree structure .When the insertion of a key results a T-tree node overflow ,the T-tail tree needn’t rotate to balance the tree ,but generates a new pointer entry than point a new T-tree node which stores the new insert keywords . In this way ,it can delay the rotation and balancing the T-tree operation, thereby reducing the number of balance rotational operation. Fig.4 shows a complete T-tail tree node structure. Fig.4 (a) is similar with a T-tree node ,the difference is that T-tail node has a tail_pointer entry which points to the node of the tail node (if the tail node exists); Fig.4 (b) also similar with a T-tree node except that the node has no parent node pointer in the control information, and has no additional information such as the subtree pointer, but only stored keywords. In the T-tail tree implementation, one node of the tree has one tail node at most. 3.2 T-tail Algorithm Search.In the T-tail tree, search algorithm is similar to the classic T-tree search algorithm. The difference is that when search a keyword in a T node, assuming the current search node is A, if the keyword does not exist in the T node,while the tail_pointer is not null ,then finds the A’s tail node ,search is performed in the tail node. If T node A has not tail node, the search algorithm is same as the classical T-tree search algorithm. Insertion.When inserting a new keyword , it determines the boundary node by search algorithm firstly,this operation is same as the classic T-tree insertion. If the boundary node is full ,it generates the tail node, and the minimum keyword in the boundary node will be moved to the tail node ,the new keyword will be inserted into the boundary node. Thus, the subsequent insertion operation is perform in the tail node. Deletion.When deleting a keyword,it is firstly determines the boundary node by search operation .Then, if the deleting operation results boundary node underflow, the minimum keyword in the tail node will be moved to the boundary node .After moving operation, if the tail node is null ,then delete the tail node. 3.3 T-tail performance analysis By analyzing the T-tail tree structure definition and operation algorithm ,it can see, if a node has tail node ,the keywords are compared not only in the node ,but also in the tail node .In surface ,it increases the number of comparisons compared with T-tree. In addition ,there are a few pointer operations in T-tail tree ,which are not exists in T-tree .Because the pointer operations are very fast in memory and compared to the T-tree itself time overhead ,the T-tail tree pointer overhead is very little ,so the average search length in T_tail tree is same as in T-tree .In insertion and deletion operations, some operations are related with tail node in T-tail tree ,for example ,by moving keywords keeping the keywords sequence and orderly in the node and the tail node. However, this overhead is relatively small ,and part of these operations are needed in the T-tree. Therefore, the insertion and deletion in T-tail tree is same with in T-tree. Compared with the T-tree, T-tail tree only added a pointer entry tail_pointer, this increase is negligible. Therefore, T-tail tree and T-tree has the same time complexity and space complexity. 4.TTB-tree In order to reduce the number of data overflow in data insertion and deletion in T-tree ,references[8] prompts a TTB-tree structure .Fig.5 shows the TTB-tree node structure and Fig.6 shows the TTB-tree structure. TTB-tree node structure is similar with the of the classic T-tree node structure, the difference is that each TTB-tree node has a successor node pointer which points the successor node. So it can handle data overflow when inserting and deleting data easily . At the same time, search operation does not need to travel the entire tree ,and avoid searching some useless data. TTB-tree can quickly find its successor nodes by the successor pointer and can effectively solve the problem of data overflow. TTB-tree algorithms of searching and inserting and deleting are similar with the T-tree operation algorithms .The difference is that: in the insertion operation ,when need to balance the tree ,it can insert the new maximum boundary to the corresponding location as the minimum node of its right subtree ,rather than the biggest node of its left subtree by the successor pointer .Thus ,it can avoid the data overflow. in the deletion operation ,when deleting a node leads to data underflow ,it can directly use the minimum of its right subtree by successor pointer and avoid traveling the entire tree. Serach.Search starts at the root node.If the current node is the bounding node , search the current node using the binary search method . Search fails if the value is not found in the current node.If the search value is less than the minimum value of the current node then continue search in its left subtree. Search fails if there is no left subtree.If the search value is greater than the maximum value of the current node then search the successor node pointed by the successor pointer. Insertion.Search for a bounding node for the new value.If such a node exists then check whether there is still space in the node, if so then insert the new value and finish.If bounding node exists ,but no space is available in the node, generates a new leaf node ,and inserts the new value into the new node as the minimum value .Then update bounding node successor pointer to point the new node and rotate to rebalance the tree.IF the new values is larger than the maximum in the boundary node ,inserting the new value into the successor node ,and as the minimum in the successor node ,then updating the maximum and minimum in the boundary node and the successor node.IF the new values is less then the maximum in the boundary node ,moving the maximum in the boundary node into the successor node as the minimum ,then updatintg the maximum and minimum in the boundary node and the successor node. Deletion.Search for the boundary node ,if the boundary node does not exist ,then the deletion fails.If the boundary node exists ,but the value to delete does not exist in the boundary node ,then the deletion fails.If the boundary node exists, the value to delete is in the boundary node ,and the deletion operation will not cause an underflow ,then the operation succeed.if the deletion causes an underflow ,it needs to merge nodes and rebalance the tree .This process is familar with the classic T-tree. 5.Conclusion In main memory database all data is stored in memory, rather than on disk, so traditional indexes designed on disk-based storage are not suitable for the main memory database environment. It is important to design index and improve the performance in main memory database environment .This paper introduces the T-tree index structure and its algorithms detailly, also describes the rotation and balance of the T-tree .Base on T-tree ,the chapter presents T-tail and TTB index structures which are the optimization of the T-tree index .At the same time ,the chapter analyzes and compares these index structures and algorithms. 6.References [1] Jun Rao,Kenneth A. Ross. Cache Conscious Indexing for Decision-Support in Main Memory. Journal of Computer Science and Technology, 2009,24(4):708−722. [2] Lu Hongjun, Yuet Yeung Ng, Tian Zengping. T-tree or B-tree: Main Memory Database Index Structure Reviewed. Australasian Database Conference,2000: 65-73 [3] Tobin J. Lehman, Michael J. Carey.A Study of Index Structures for Main Memory Database Management Systems. Proceedings of the Twelfth International Conference on Very Large Data Bases, 1986: 294-303 [4] T-tree .http://en.wikipedia.org/wiki/T-tree.2013. [5] A. Aho, J. Hopcroft and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley Publishing Company, 1974. [6] Ig-Hoon Lee, Sang-Goo Lee,Junho Shim. Making T-Trees Cache Conscious on Commodity Microprocessors.JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 27, 143-161 (2011). [7] Lin Peng, Li Hang, Xu Xuezhou. Optimization of T-tree Index of Main Memory Database in Critical Application. Computer Engineering,2004, Vol.30 No. 17 [8] Wang shan,Xiao Yan- qin,Liu Da-wei,Qin Xiong-pai.Research of main memory database.Computer Applications.2007,Vol.27 No.10