* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download btrees - COW :: Ceng
Survey
Document related concepts
Transcript
Multilevel Indexing and B+ Trees CENG 351 File Structures 1 Problems with simple indexes If index does not fit in memory: 1. Seeking the index is slow (binary search): – We don’t want more than 3 or 4 seeks for a search. N Log(N+1) 15 keys 4 1000 ~10 100,000 ~17 1,000,000 ~20 2. Insertions and deletions take O(N) disk accesses. CENG 351 File Structures 2 Indexed Sequential Files • Provide a choice between two alternative views of a file: 1. Indexed: the file can be seen as a set of records that is indexed by key; or 2. Sequential: the file can be accessed sequentially (physically contiguous records), returning records in order by key. CENG 351 File Structures 3 Example of applications • Student record system in a university: – Indexed view: access to individual records – Sequential view: batch processing when posting grades • Credit card system: – Indexed view: interactive check of accounts – Sequential view: batch processing of payments CENG 351 File Structures 4 The initial idea • Maintain a sequence set: – Group the records into blocks in a sorted way. – Maintain the order in the blocks as records are added or deleted through splitting, concatenation, and redistribution. • Construct a simple, single level index for these blocks. – Choose to build an index that contain the key for the last record in each block. CENG 351 File Structures 5 Maintaining a Sequence Set • Sorting and re-organizing after insertions and deletions is out of question. We organize the sequence set in the following way: – Records are grouped in blocks. – Blocks should be at least half full. – Link fields are used to point to the preceding block and the following block (similar to doubly linked lists) – Changes (insertion/deletion) are localized into blocks by performing: • Block splitting when insertion causes overflow • Block merging or redistribution when deletion causes underflow. CENG 351 File Structures 6 Example: insertion • Block size = 4 • Key : Last name Block 1 ADAMS … BIXBY … CARSON … COLE … • Insert “BAIRD …”: Block 1 ADAMS … BAIRD … Block 2 CARSON .. BIXBY … COLE … CENG 351 File Structures 7 Example: deletion Block 1 ADAMS … BAIRD … BIXBY … Block 2 BYNUM… CARSON .. CARTER .. BOONE … Block 3 DENVER… ELLIS … Block 4 COLE… DAVIS • Delete “DAVIS”, “BYNUM”, “CARTER”, CENG 351 File Structures 8 Add an Index set Key Block BERNE CAGE DUTTON EVANS FOLK GADDIS 1 2 3 4 5 6 CENG 351 File Structures 9 Tree indexes • This simple scheme is nice if the index fits in memory. • If index doesn’t fit in memory: – Divide the index structure into blocks, – Organize these blocks similarly building a tree structure. • Tree indexes: – – – – B Trees B+ Trees Simple prefix B+ Trees … CENG 351 File Structures 10 Separators Block 1 Range of Keys Separator ADAMS-BERNE BOLEN 2 BOLEN-CAGE CAMP 3 CAMP-DUTTON EMBRY 4 EMBRY-EVANS FABER 5 FABER-FOLK FOLKS 6 FOLKS-GADDIS CENG 351 File Structures 11 root EMBRY Index set BOLEN FABER CAMP-DUTTON ADAMS-BERNE 1 CAMP BOLEN-CAGE 3 FOLKS EMBRY-EVANS 4 2 FOLKS-GADDIS FABER-FOLK 6 5 CENG 351 File Structures 12 B Trees • B-tree is one of the most important data structures in computer science. • What does B stand for? (Not binary!) • B-tree is a multiway search tree. • Several versions of B-trees have been proposed, but only B+ Trees has been used with large files. • A B+tree is a B-tree in which data records are in leaf nodes, and faster sequential access is possible. CENG 351 File Structures 13 Formal definition of B+ Tree Properties • Properties of a B+ Tree of order v : – All internal nodes (except root) has at least v keys and at most 2v keys . – The root has at least 2 children unless it’s a leaf.. – All leaves are on the same level. – An internal node with k keys has k+1 children CENG 351 File Structures 14 B+ tree: Internal/root node structure P0 K1 P1 K2 ……………… Pn-1 Kn Pn Each Pi is a pointer to a child node; each Ki is a search key value # of search key values = n, # of pointers = n+1 Requirements: K1 < K2 < … < K n For any search key value K in the subtree pointed by Pi, If Pi = P0, we require K < K1 If Pi = Pn, Kn K If Pi = P1, …, Pn-1, Ki < K Ki+1 CENG 351 File Structures 15 B+ tree: leaf node structure L K 1 r1 K 2 R ……………… K n rn Pointer L points to the left neighbor; R points to the right neighbor K1 < K2 < … < Kn v n 2v (v is the order of this B+ tree) We will use Ki* for the pair <Ki, ri> and omit L and R for simplicity CENG 351 File Structures 16 Example: B+ tree with order of 1 • Each node must hold at least 1 entry, and at most 2 entries Root 40 10* 15* 20 33 20* 27* 51 33* 37* 40* 46* CENG 351 File Structures 51* 63 55* 63* 97* 17 Example: Search in a B+ tree order 2 • Search: how to find the records with a given search key value? – Begin at root, and use key comparisons to go to leaf • Examples: search for 5*, 16*, all data entries >= 24* ... – The last one is a range search, we need to do the sequential scan, starting from the first leaf containing a value >= 24. Root 13 2* 3* 5* 7* 14* 15* 17 24 19* 20* 22* 30 24* 27* 29* CENG 351 File Structures 33* 34* 38* 39* 18 Cost for searching a value in B+ tree • Typically, a node is a page (block or cluster) • Let H be the height of the B+ tree: we need to read H+1 pages to reach a leaf node • Let F be the (average) number of pointers in a node (for internal node, called fanout ) Height – Level 1 = 1 page = F0 page 0 – Level 2 = F pages = F1 pages 1 – Level 3 = F * F pages = F2 pages – Level H+1 = …….. = FH pages (i.e., leaf nodes) – Suppose there are D data entries. So there are D/(F-1) leaf nodes – D/(F-1) = FH. That is, H = logF( FD 1) CENG 351 File Structures level 1 2 19 B+ Trees in Practice • Typical order: 100. Typical fill-factor: 67%. – average fanout = 133 (i.e, # of pointers in internal node) • Can often hold top levels in buffer pool: – – – Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 MBytes • Suppose there are 1,000,000,000 data entries. – – H = log133(1000000000/132) < 4 The cost is 5 pages read CENG 351 File Structures 20 How to Insert a Data Entry into a B+ Tree? • Let’s look at several examples first. CENG 351 File Structures 21 Inserting 16*, 8* into Example B+ tree Root 2* 3* 5* 7* 8* 13 17 24 30 15* 16* 14* You overflow 13 2* 3* 5* 7* 17 24 30 8* One new child (leaf node) generated; must add one more pointer to its parent, thus one more CENG 351 File Structures key value as well. 22 Inserting 8* (cont.) • Copy up the middle value (leaf split) 13 17 24 30 Entry to be inserted in parent node. (Note that 5 is s copied up and continues to appear in the leaf.) 5 2* 5 3* 13 5* 17 24 7* 30 CENG 351 File Structures 8* You overflow! 23 Insertion into B+ tree (cont.) • Understand difference between copy-up and push-up • Observe how minimum occupancy is guaranteed in both leaf and index pg splits. 5 13 17 24 30 We split this node, redistribute entries evenly, and push up middle key. 17 5 13 24 CENG 351 File Structures Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.) 30 24 Example B+ Tree After Inserting 8* Root 17 5 2* 3* 24 13 5* 7* 8* 14* 15* 19* 20* 22* 30 24* 27* 29* 33* 34* 38* 39* Notice that root was split, leading to increase in height. CENG 351 File Structures 25 Inserting a Data Entry into a B+ Tree: Summary • Find correct leaf L. • Put data entry onto L. – – If L has enough space, done! Else, must split L (into L and a new node L2) • Redistribute entries evenly, put middle key in L2 • copy up middle key. • Insert index entry pointing to L2 into parent of L. • This can happen recursively – To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) • Splits “grow” tree; root split increases height. – Tree growth: gets wider or one level taller at top. CENG 351 File Structures 26 Deleting a Data Entry from a B+ Tree • Examine examples first … CENG 351 File Structures 27 Delete 19* and 20* Root 17 5 2* 3* 24 13 5* 7* 8* 14* 16* 19* 20* 22* 30 24* 27* 29* 33* 34* 38* 39* 22* 22* 24* 27* 29* Have we still forgot something? CENG 351 File Structures 28 Deleting 19* and 20* (cont.) Root 17 5 2* 3* • • • • 27 13 5* 7* 8* 14* 16* 22* 24* 30 27* 29* 33* 34* 38* 39* Notice how 27 is copied up. But can we move it up? Now we want to delete 24 Underflow again!CENG But351can we redistribute this time? File Structures 29 Deleting 24* • Observe the two leaf nodes are merged, and 27 is discarded from their parent, but … • Observe `pull down’ of index entry (below). New root 2* 3* 5* 5 7* 8* 13 14* 16* 30 22* 17 27* 29* 33* 34* 38* 39* 30 22* 27* 29* CENG 351 File Structures 33* 34* 38* 39* 30 Deleting a Data Entry from a B+ Tree: Summary • Start at root, find leaf L where entry belongs. • Remove the entry. – – If L is at least half-full, done! If L has only d-1 entries, • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). • If re-distribution fails, merge L and sibling. • If merge occurred, must delete entry (pointing to L or sibling) from parent of L. • Merge could propagate to root, decreasing height. CENG 351 File Structures 31 Example of Non-leaf Re-distribution • Tree is shown below during deletion of 24*. (What could be a possible initial tree?) • In contrast to previous example, can re-distribute entry from left child of root to right child. Root 22 5 2* 3* 5* 7* 8* 13 14* 16* 17 30 20 17* 18* 20* 21* CENG 351 File Structures 22* 27* 29* 33* 34* 38* 39* 32 After Re-distribution • Intuitively, entries are re-distributed by `pushing through’ the splitting entry in the parent node. • It suffices to re-distribute index entry with key 20; we’ve re-distributed 17 as well for illustration. Root 17 5 2* 3* 5* 7* 8* 13 14* 16* 20 17* 18* 20* 21* CENG 351 File Structures 22 30 22* 27* 29* 33* 34* 38* 39* 33 Primary vs Secondary Index • Note: We were assuming the data items were in sorted order – This is called primary index • Secondary index: – Built on an attribute that the file is not sorted on. • Can have many different indexes on the same file. CENG 351 File Structures 34 A Secondary B+-Tree index Root 17 5 2 3 5 7 8 13 14 16 22* 3* 14* 8* 20 17 18 2* 16* 5* 39* 20 21 22 30 22 27 29 33 34 38 39 20* 7* 38* 29* Actual data buckets More… • Hash-based Indexes – Static Hashing – Extendible Hashing – Linear Hashing • Grid-files • R-Trees • etc… CENG 351 File Structures 36 Terminology • Bucket Factor: the number of records which can fit in a leaf node. • Fan-out : the average number of children of an internal node. • A B+tree index can be used either as a primary index or a secondary index. – Primary index: determines the way the records are actually stored (also called a sparse index) – Secondary index: the records in the file are not grouped in buckets according to keys of secondary indexes (also called a dense index) CENG 351 File Structures 37 Summary • Tree-structured indexes are ideal for rangesearches, also good for equality searches. • B+ tree is a dynamic structure. – – – – Inserts/deletes leave tree height-balanced; High fanout (F) means depth rarely more than 3 or 4. Almost always better than maintaining a sorted file. Typically, 67% occupancy on average. If data entries are data records, splits can change rids! • Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS. CENG 351 File Structures 38