Download 2-3-4 Trees - Randomly Philled

Document related concepts

Linked list wikipedia , lookup

Quadtree wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Interval tree wikipedia , lookup

Red–black tree wikipedia , lookup

B-tree wikipedia , lookup

Binary tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
Data Structures
2-3-4 Trees
Phil Tayco
Slide version 1.0
Apr. 23, 2015
2-3-4 Trees
Binary trees revisited
• Binary trees combine the best of both worlds of
dynamic memory usage and performing binary
search like you could with a sorted array
• The search algorithm with a binary tree will only
achieve O(log n) as long as the tree is balanced
• The balance of a tree is dependent on the
inserting and deleting of nodes which can lead to
imbalance
• Imbalance leads to O(n) search performance
which is basically a linked list
2-3-4 Trees
Advanced tree ideas
• As with other data structures, we try to address
the cons
• For trees, we want to efficiently maintain balance
as inserts and deletes are performed
• There are tree algorithms that already look at
ways to do this:
– AVL trees
– Red-black trees
• These trees keep the basic structure of a node
• As you would guess, the function algorithms are
more complex than the standard tree
2-3-4 Trees
Multiway tree
• What if we modified the tree node instead?
root
20 40 60
10
30
50
70 80
• Notice each node here contains multiple data elements and
multiple child links
• The modified structure is interesting, but needs to work
within a set of rules to guarantee balance
2-3-4 Trees
Multiway tree
• A non-leaf node with 1 data item always has 2
children
root
20
10
30
2-3-4 Trees
Multiway tree
• A non-leaf node with 2 data item always has 3
children
root
20 40
10
30
50 60
2-3-4 Trees
Multiway tree
• A non-leaf node with 3 data item always has 4
root
children
20 40 60
10
30
50
70 80
2-3-4 Trees
Multiway tree
• Leaf nodes can have any number of data items
root
20 40 60
10
30 31 32
50
70 80
2-3-4 Trees
Multiway tree
• As before, child nodes to the left and right of a
data item are less and greater to maintain order
root
20 40 60
10
30 31 32
50
70 80
2-3-4 Trees
Similarities to Binary trees
• While the number of items and node children
have increased, the basic order is the same
• This promotes a search and insert performance
similar to binary trees at O(log n)
• Search starts at root examining data items
against the search value and traverses down
nodes appropriately
• Insert adds new data items at the appropriate
leaf level
• The algorithms will show that balance will always
be achieved. This makes search and insert
perform at O(log n)
2-3-4 Trees
Insert
• New data items will be inserted at the leaf level
• In order to maintain balance, as we perform the
normal search for the appropriate leaf to insert
the new data element, we add a rule to the
algorithm:
– When visiting any node, if it is full, “split” the node
– Whether or not a split has occurred, continue down the
path using the standard search until a leaf node is
reached
– Once a leaf is reached, add the new data element to it
(if it is full, perform another “split”)
2-3-4 Trees
Split
• The splitting of a node requires creating a new or
modifying an existing parent node as well as
creating a new sibling node
• Data elements are moved and child pointers are
readjusted as follows:
– A new node is created as a sibling to the full node
– The 3rd data item of the full node is moved to the sibling
node as its 1st data item
– The 2nd data item of the full node is added to the parent
node
– The 1st data item of the full node remains where it is
– The 3rd and 4th child pointers of the full node move to
the sibling node as its 1st and 2nd child pointers
2-3-4 Trees
Split example 1
• We want to add 5 to the tree below. We start at
root, 1st data item is 14 so we go down the 1st
child pointer. We see it’s full so we must split it
root
14
3
1
2
4
6 10
17
7
8
12
16
18 20
2-3-4 Trees
Step 1: Create new sibling node
• Notice parent node in this case is root and the
sibling is not yet attached to the parent (the 2nd
child pointer of root is still connected as such)
root (parent)
14
(sibling)
(current)
3
1
2
4
6 10
17
7
8
12
16
18 20
2-3-4 Trees
Step 2: Move 3rd item to as 1st item of new
node
• 10 of current moves to new sibling node
root (parent)
14
(sibling)
(current)
3
1
2
4
6
10
7
8
17
12
16
18 20
2-3-4 Trees
Step 3: Move 2nd item to parent
• Notice 6 is inserted into the data item list of
parent. This shifts 14 as well as its 2 child
pointers
root (parent)
6 14
(sibling)
(current)
3
1
2
4
10
7
8
17
12
16
18 20
2-3-4 Trees
Step 5: Move 3rd and 4th child pointers as 1st
and 2nd child pointers of sibling
• This keeps the parent-child relationships and
orders intact and balanced
root (parent)
6 14
(sibling)
(current)
3
1
2
4
10
7
8
17
12
16
18 20
2-3-4 Trees
Split Analysis
• The split keeps the non-leaf and leaf rules intact
• Guarantees non-leaf nodes with 1, 2 or 3 data
items have 2, 3 or 4 child nodes
• The split is performed as full nodes are
encountered on the way down
• In the previous example, the insert of 5 still has
not been performed
• The insert process resumes at the parent. Note
that if the parent is full as a result of the split, a
split at that node is not performed
2-3-4 Trees
Resume insert at parent
• 5 is less than 6 so we go down child pointer 1. 5
is greater than 3 and there is only 1 data item, so
we go down 2nd child pointer. Node with data
item 4 is a leaf and is not full so we add 5 there.
6 14
3
1
2
4
10
5
7
8
17
12
16
18 20
2-3-4 Trees
Insert Analysis
• The algorithm keeps the tree balanced
• New nodes are created as needed by adding
siblings before adding levels
• Levels are increased when the root node is the
one that requires splitting
• When splitting the root, the same split algorithm
applies, but instead of adding the 2nd data item
to the parent node, a new parent node is created
(as the new root)
2-3-4 Trees
Splitting the root
• Here, we will insert 15. Before we even go down
a child node, we must split the root because it is
full
root
20 40 60
10
30 31 32
50
70 80
2-3-4 Trees
Step 1: Create the sibling node
• The algorithm works the same as before, except
there is no “parent” node (yet)
(current)
root
(sibling)
20 40 60
10
30 31 32
50
70 80
2-3-4 Trees
Step 2: Create new root as parent
• Since the current node is root, we create another
new node to be the parent (and new root)
(parent)
(current) root
(sibling)
20 40 60
10
30 31 32
50
70 80
2-3-4 Trees
Step 3: Move data items
• The normal split occurs. 3rd item of current
moves to 1st of sibling and 2nd item of current
moves to 1st of parent
(parent)
40
(current) root
(sibling)
20
10
30 31 32
60
50
70 80
2-3-4 Trees
Step 4: Update pointers
• 3rd and 4th child pointers of current become 1st
and 2nd of sibling. 1st and 2nd of new parent get
current and sibling nodes respectively
(parent)
40
(current) root
(sibling)
20
10
30 31 32
60
50
70 80
2-3-4 Trees
Step 5: New root and continue
• Make the parent the new root of the tree.
Resume the insert from the root (15 will end up
going down and added to leaf node with 10)
(root)
40
20
10 15
30 31 32
60
50
70 80
• Notice the full leaf node 30, 31, 32 is not split.
This is because it is never visited
2-3-4 Trees
Insert Analysis
• Splitting will only occur when a visited node is full, keeping the 23-4 tree rules intact
• Levels of the tree increase “upward” when the root node is full
(because the new parent is created at that moment and becomes
the new root)
• Splitting a leaf node will never result in more than 4 children for a
parent node (if the parent node had 4 children, it would be full
and split before reaching any of the child leaf nodes)
• Balance is maintained because even if one side gets “heavy” with
data items, the number of nodes will remain balanced because of
the splitting algorithm
• Best practice at understanding the algorithm is to insert a series
of numbers and draw the resulting tree
2-3-4 Trees
public class Node234
{
private int numItems;
private Node234 parent;
private Node234[] children;
private int[] dataItems;
2-3-4 Trees
public Node234()
{
numItems = 0;
parent = null;
children = new Node234[4];
dataItems = new int[3];
for (int n = 0; n < 4; n++)
children[n] = null;
for (int n = 0; n < 3; n++)
dataItems[n] = -1;
}
2-3-4 Trees
public class Tree234
{
private Node234 root;
public Tree234()
{
root = new Node234();
}
2-3-4 Trees
Node234 and Tree234 Code
• More properties needed here for the node
– numItems to keep track of how many data items are in the
node
– Reference to parent node (useful for handling splits)
– Array of child pointers
– Array of data items
• The array sizes are defined in the constructor and initialized
to null (for children) and -1 (for data items)
• We could also use a Linked List for the child and data
arrays, but they are so small, we don’t necessarily need to
(and simplifying the code to start)
• The Tree is just the root node. Note that it is not initialized
to null, but to a new Node234 object with no data items
2-3-4 Trees
public void insert(int value)
{
Node234 current = root;
while(true)
{
if(current.isFull())
{
split(current);
current = current.getParent();
current = getNextChild(current,
value);
}
2-3-4 Trees
Tree234 Insert Code
• We start with a current node at root
• The loop plans to go down child nodes of the tree
until we reach a leaf
• Along the way, if the node.isFull method returns
true, we have to split it
• After the split, we set current to its parent
followed by finding the appropriate child to go to
based on the value to be inserted
• Many methods being used here: isFull, split,
getParent and getnextChild
2-3-4 Trees
public boolean isFull()
{
return (numItems == 3);
}
public Node234 getParent()
{
return parent;
}
// Note: these methods appear in the Node234
class (split and getNextChild are in
Tree234)
2-3-4 Trees
private void split(Node234 n)
{
int thirdItem = n.removeItem();
int secondItem = n.removeItem();
Node234
Node234
Node234
Node234
fourthChild = n.removeChild(3);
thirdChild = n.removeChild(2);
sibling = new Node234();
parent;
2-3-4 Trees
Tree234 Split Code
• It is important now if you haven’t been drawing pictures to
go through code that you do so now…
• Split begins with removing the 2nd and 3rd data items from
the full node and storing their values – these will be
transferred to the parent and sibling nodes respectively
• We do the same with disconnecting the 3rd and 4th child
pointers of the node (so we can transfer them to the
sibling)
• We then create a new sibling node and a parent pointer
(parent is not a new node yet as we haven’t determined if
the full node is root at this point)
• The setup is complete, but there are 2 new methods in
Node234 to review: removeItem and removeChild
2-3-4 Trees
public int removeItem()
{
int lastItem = dataItems[numItems - 1];
dataItems[--numItems] = -1;
return lastItem;
}
// This removes the last data item in the
data array (setting it to -1), decrements
numItems and returns the value that was
removed
2-3-4 Trees
public Node234 removeChild(int n)
{
Node234 child = children[n];
children[n] = null;
return child;
}
// This sets the given child of the node to
null while returning a reference to that
child
// Now we can look at the next part of the
split function…
2-3-4 Trees
if (n == root)
{
parent = new Node234();
root = parent;
root.setChild(0, n);
}
else
parent = n.getParent();
// If the node being split is root, now
create a new node as parent and root and
set its first child to the current node
// Otherwise, a parent exists and we just
get it
2-3-4 Trees
int itemLocation =
parent.insertItem(secondItem);
int parentItems = parent.getNumItems();
int c = parentItems - 1;
while (c > itemLocation)
{
Node234 temp = parent.removeChild(c);
parent.setChild(c + 1, temp);
c--;
}
parent.setChild(itemLocation + 1,
sibling);
2-3-4 Trees
Tree234 Split Code – adjusting the parent
•
•
•
•
•
•
The second item from the full node being split is inserted into the parent
node using the Node’s insertItem function
The location of that insert can vary, so it is returned here to determine
how to adjust the child pointers of the parent
This is done by getting the number of items in the parent and using a loop
down to the location of the new item that was inserted
– At each iteration, we remove the child pointer on its right and set it equal to the
pointer on its left – this shifts the child pointers to the right that are after the
inserted item
Once that shift is complete, there will be a “hole” to the right of where the
item inserted into the parent took place
This hole is filled by connecting it to the new sibling node just created!
Notice we have more Node234 functions: insertItem and getNumItems…
2-3-4 Trees
public int getNumItems()
{
return numItems;
}
// This method is a standard get
function of a class, returning the
numItems property
// insertItem is not as simple…
2-3-4 Trees
public int insertItem(int data)
{
numItems++;
int c = 0;
for (int n = 2; n >= 0; n--)
{
if (dataItems[n] == -1)
continue;
// From right to left of the data items
array, we check for non-empty data items
(denoted as not equal to -1), if a spot is
empty, ignore it
2-3-4 Trees
else
{
int d = dataItems[n];
if (data < d)
dataItems[n + 1] = dataItems[n];
else
{
dataItems[n + 1] = data;
return n + 1;
}
}
}
dataItems[0] = data;
return 0;
}
2-3-4 Trees
Node234 Code – inserting a data item
•
•
The “else” branch here deals with encountering a data item as we go right
to left in the data array looking for the correct place to insert the new data
item
When a data item is found, compare it to the new item
– If the new item is less than it, the new item belongs to the left so we shift the
data item in the array to the right by 1
– Otherwise, the new data item belongs to the right of this item in the array so we
set it there and return that index
•
If we reach the end of the loop, that means all data items in the array
shifted to the right and the new item belongs in the first spot (index 0).
We insert it there and return that index
•
A lot of bouncing back and forth between Node234 and Tree234! We’re
almost done though. At this point, the we’ve created the sibling node, and
inserted the 2nd data item of the full node into the parent (created or
existing)
All that is left in the split function is to set the sibling to the new data and
child pointers
•
2-3-4 Trees
sibling.insertItem(thirdItem);
sibling.setChild(0, thirdChild);
sibling.setChild(1, fourthChild);
}
// Using the Node functions
previously discussed, we insert the
3rd data item from the full node
into the sibling and set its 1st and
2nd child pointers to what was once
the full node’s 3rd and 4th children
2-3-4 Trees
Efficiency
• The insert algorithm and the splits with the 2-3-4
tree guarantee balance
• The balance leads to an O(log n) category
performance
• Each node contains 3 data items which imply
extra data usage and impact to performance
• Question: is the impact on performance on with
traversing each node’s data array significant?
• Question 2: is the array allocation of 3 elements
per node a significant amount of data storage?
2-3-4 Trees
Performance
• Worst case searches mean for each node visited at each
level, the entire data array is traversed before finding the
element or determining the next level to descend (this is
also the tree’s maximum value)
• Because of the way the insert and split algorithms work, it
is rare to see full nodes that haven’t been split on each
level
• Also, even if each node on each level was full when visited,
the number of data item searches will still be O(log n)
proportional to the total number of data elements
• This makes the search performance ultimately comparable
to a balanced binary search tree
2-3-4 Trees
Data Storage
• With most nodes in the tree not usually full, that implies an
amount of unused data space
• The math works out to about 2/7 of unused space based on
the number of elements inserted into the tree
• Compared to self-balancing trees like red black trees and
AVL trees, the amount of overhead to balance the tree is
comparable to the amount of unused space (you get a little
better performance with 2-3-4 than the balancing trees
with a relative price in data storage)
• Why not use a linked list instead of an array? There is an
increased amount of overhead with doing that as well, but
if that is necessary to relieve the unused space, it can be
used
2-3-4 Trees
Tree Traversal
• Displaying data in order with a binary search tree involved
using simple recursion of displaying the subtree on the left,
printing the current element and then displaying the
subtree on the right
• The same concept can apply with a 2-3-4 tree except you
must now account for the multiple data items and child
pointers:
– If the current node is not null, print the child[0] subtree, print
data item[0], print child[1] subtree
– If the current node has 2 data items, also print data item[1]
and then print the child[2] subtree
– If the current node is full, also print data item[2] and then
print the child[3] subtree
2-3-4 Trees
private void displayInOrder(Node234 current)
{
if (current != null)
{
displayInOrder(current.getChild(0));
int n = current.getNumItems();
for (int c = 0; c < n; c++)
{
System.out.println(current.getItem(c));
displayInOrder(current.getChild(c+1));
}
}
}
2-3-4 Trees
Delete
• As you can imagine, the delete function appears quite
challenging:
– Removing an item at the leaf level is not hard
– Removing an item at a non-leaf level requires rearranging
nodes and child pointers
• The “cop out” discussed with Binary Trees is even more
necessary here
– Make each data item a class with an additional “isDeleted”
property
– Mark data items as true for isDeleted when removed
– Rebuild the 2-3-4 tree as needed walking through the tree and
inserting elements into a new tree that are not flagged for
deletion
– The new 2-3-4 tree will still be a balanced tree
2-3-4 Trees
Applications
• Guaranteed balance is a big advantage that you
get with a 2-3-4 over a binary search tree
• Minimized node count and balance also reduces
the amount of node visits
• Reduced node visits can be useful in applications
where nodes representing a significant data
element is captured
– Disk blocks as nodes mean less time to find a block of
data on a track that takes time to find
– Disk storage is a popular use of this data structure
2-3-4 Trees
Other Multiway Trees
• 2-3 trees are similar to 2-3-4 trees:
– 2 data items and 3 child pointers
– Same non-leaf node rules apply
• Larger sized data item trees follow the same
rules for number of data items and children (links
= data items + 1) – this makes the insert and
split algorithm the same
• 2-3 trees split only when the leaf is full and
recursively split full parents up the tree (this
keeps the number of splits necessary per insert
to a minimum)
2-3-4 Trees
Summary
• Whether self-balancing binary search tree or 2-34 type tree, balance is the theme to keep
performance at O(log n)
• Self balancing trees reduce the memory usage
and makes that more dynamic while the
algorithms for 2-3-4 trees are not as complex
• The search is optimized by way of storing the
data in some determined order
• Search can reach O(1) performance if that order
was not as significant and the data elements
could be mapped in a different way…