Download Tree Indexing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quadtree wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Interval tree wikipedia , lookup

Red–black tree wikipedia , lookup

Binary tree wikipedia , lookup

B-tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
CS 2604 Spring 2004
Indexing with Trees
Index Trees 1
Hash tables suffer from several defects, including:
- good, general purpose hash functions are very difficult to find
- static table size requires costly resizing if indexed set is highly dynamic
- search performance degrades considerably as the table nears its capacity
Using a structured tree (BST, AVL) as an index offers some advantages:
- self-contained, data-independent implementation
- easily accommodates insertion and deletion
- balancing is relatively cheap, although complex in design
But, log(N) search cost is much higher than that of a good hash table.
The tree MUST be well-balanced for good performance.
Most troubling, BST and AVL trees are difficult to store efficiently on disk if the index is
huge.
Data Structures & File Management
Computer Science Dept Va Tech January 2004
©2000-2004 McQuain WD
Disk Representation
Index Trees 2
It is a relatively simple matter to write any binary tree to a disk file, by representing each
tree node by a data record that holds the data element and two file offsets specifying the
locations of the children, if any of that node.
D
B
G
C
E
D
Data
H
F
lChild
rChild
0
D
24
72
24
B
-1
48
48
C
-1
-1
72
G
96
168
96
E
120
144
The nodes don't need to be stored in any
particular order.
120
D
-1
-1
144
F
-1
-1
NULL pointers may be represented by a
negative offset.
168
H
-1
-1
Computer Science Dept Va Tech January 2004
©William D McQuain January 2004
Data Structures & File Management
©2000-2004 McQuain WD
1
CS 2604 Spring 2004
Disk Representation
Index Trees 3
The problem is that this disk representation will require too many individual disk
accesses when processing a typical tree operation, such as a search or a traversal.
Why?
These tree operations typically require transiting from a node to one or both of its
children.
But there's no reason that the child nodes
will be stored anywhere near the parent
node (although we could at least guarantee
that siblings are adjacent).
Since each node stores only one data value,
and the nodes we might well perform one
disk access for every node that is accessed
during the tree operation.
Given the extremely slow nature of disk
access, this is unacceptable.
Computer Science Dept Va Tech January 2004
Data
lChild
rChild
0
D
24
72
24
B
-1
48
48
C
-1
-1
72
G
96
168
96
E
120
144
120
D
-1
-1
144
F
-1
-1
168
H
-1
-1
Data Structures & File Management
2-3 Trees
©2000-2004 McQuain WD
Index Trees 4
The first of these difficulties can be moderated by allowing a tree node to have more than
two children.
The third can be addressed by allowing an internal node to store a larger number of
values.
In a 2-3 tree:
- each node contains either one or two key values
- every internal node either holds one key value and has two children, or holds two
key values and has three children
- every leaf is at the same level in the tree
- there is a BST-like arrangement of values for efficient searching
Compared to a BST:
- 2-3 trees have lower update cost on average
- a 2-3 tree storing N keys will be shallower than the corresponding BST
Computer Science Dept Va Tech January 2004
©William D McQuain January 2004
Data Structures & File Management
©2000-2004 McQuain WD
2
CS 2604 Spring 2004
2-3 Tree Properties
Index Trees 5
A 2-3 tree:
12
10
15
20
21
18
33
23
30
24
48
31
45
47
50
52
Each node can store up to two key values and up to three pointers.
The left child holds values less than the first key.
The middle child holds values greater than or equal to the first key and less than the
second key.
The right child holds values greater than or equal to the second key.
Computer Science Dept Va Tech January 2004
Data Structures & File Management
©2000-2004 McQuain WD
2-3 Tree Insertion
Index Trees 6
Insertion into a 2-3 tree is similar to a BST. First find the appropriate leaf…
12
10
14 15
20
21
18
33
23
30
24
Inserting 14…
Inserting 55…
leaf is not full, so
just insert new
key, shifting old
key if
necesssary.
leaf is full, but parent is
not, so create a new right
child, moving larger key
from middle child to
parent.
Computer Science Dept Va Tech January 2004
©William D McQuain January 2004
Data Structures & File Management
48
31
45
47
48
45
47
50
50
52
52
55
©2000-2004 McQuain WD
3
CS 2604 Spring 2004
2-3 Tree Splitting
Index Trees 7
If a subtree is sufficiently full, insertion may cause the parent to split:
Inserting 19…
Median value is passed up to
the parent node… which has
now acquired another child…
leaf is full, and so is the parent
23
23
20
21
20
30
24
31
Computer Science Dept Va Tech January 2004
19
21
30
24
31
Data Structures & File Management
©2000-2004 McQuain WD
2-3 Tree Splitting the Root
Index Trees 8
In the worst case, the tree root splits, increasing the height of the tree:
Inserting 19 caused an internal node to split and a displaced value to be passed up.
Here, the parent is the tree root and is already full… so it splits also…
23
18
33
20
19
21
Computer Science Dept Va Tech January 2004
©William D McQuain January 2004
30
24
Data Structures & File Management
31
©2000-2004 McQuain WD
4
CS 2604 Spring 2004
B Trees
Index Trees 9
2-3 trees provide the motivation for the B-tree family. In a B-tree of order m:
- The root is either a leaf or has at least two children.
- Aside from the leaves, and possibly the root, each node has between m/2 and m
children.
- Aside from the leaves, for each node the number of key values stored is one less
than the number of children.
- All leaves are at the same level in the tree, so the tree is always height balanced.
- Values are organized as in a 2-3 tree for efficient searching.
Computer Science Dept Va Tech January 2004
Data Structures & File Management
B Tree Example
©2000-2004 McQuain WD
Index Trees 10
A B-tree of order 5:
A B-tree will be relatively shallow compared to a binary tree storing the same number of
key values. Since a binary search may be applied to the key values in each node,
searching is highly efficient.
Computer Science Dept Va Tech January 2004
©William D McQuain January 2004
Data Structures & File Management
©2000-2004 McQuain WD
5
CS 2604 Spring 2004
B Tree Insertion
Computer Science Dept Va Tech January 2004
Index Trees 11
Data Structures & File Management
B Trees
©2000-2004 McQuain WD
Index Trees 12
In a B-tree:
- The value m is usually chosen so that each node will occupy a disk sector.
- So, depending upon the key type, an internal node may have hundreds of children.
- Each internal node must be at least 50% full, reducing the height of the tree.
- The average cost of search, insertion and deletion, for large key sets, is Θ(logKN)
where K is the average branching factor in the tree.
Computer Science Dept Va Tech January 2004
©William D McQuain January 2004
Data Structures & File Management
©2000-2004 McQuain WD
6
CS 2604 Spring 2004
B+ Trees
Index Trees 13
In B+ trees:
- Internal nodes store only key values and pointers*.
- All records, or pointers to records, are stored in leaves.
- Commonly, the leaves are simply the logical blocks of a database file index,
storing key values and offsets. In this case, many key values will occur twice in
the tree, once at an internal node to guide searching, and again in a leaf.
- If the leaves are simply an index, it is common to implement the leaf level as a
linked list of B tree nodes… why?
The B+ tree is the most commonly implemented variant of the B-tree family, and the
structure of choice for large databases.
* In small databases, it is fairly common to use a B-tree
as a direct data structure, with nodes storing records.
Computer Science Dept Va Tech January 2004
©William D McQuain January 2004
Data Structures & File Management
©2000-2004 McQuain WD
7