* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CS503: First Lecture, Fall 2008
Survey
Document related concepts
Transcript
CS503: Eighth Lecture, Fall 2008 Binary Trees Michael Barnathan Project Ideas • • General idea: something that won’t take more than a couple of weeks but that you can use in a portfolio. These are just my ideas; feel free to come up with your own. Web crawler. – Sockets, Trees, Recursion. – Caveats: dead links, status codes, redirects. • Fast file indexer based on frequent words. – File I/O, Trees, Analysis of Algorithms. – Compression (storing the index). • Works well because of Zipf/power law distributions. – Caveats: binary files, errors opening files. Make sure you open read-only. • Trend analysis tool using regression. – Sorting, Recursion, Analysis. – Predictors: I can help you with these. • AI opponent for a simple game (such as checkers or Reversi) – Trees, Storage, Recursion, Heuristic Search, Analysis of Algorithms. • Distributed command server (to order around a bunch of machines). – Sockets, File I/O, Priority Queues (especially for synchronization), Analysis of Algorithms. Grading • Not having exams is going to require adjusting the percentages a bit. • Assignments 40%, Project 30%, Labs 20%, Participation 10%? Here’s what we’ll be learning: • Data structures: – Binary Trees. – Binary Search Trees. • Theory: – Tree traversals. • Preorder. • Inorder. • Postorder. – Balanced and complete trees. – Recursion on Binary Trees. Linear Structures • Arrays, Linked Lists, Stacks, and Queues are linear data structures. – Even circularly linked lists. • One element follows another: there is always just one “next” element. • As we mentioned, these usually yield recurrences of the form T(n) = T(n-1) + f(n). • What if we violate this assumption? What if a structure had two next elements? – In a way, we started with the most restrictive structure (arrays), which we are progressively relaxing. Frankenstein’s Data Structures Data 10 21 44 Index 0 1 2 Data 5 10 Index 1 2 Data 42 Index 2 It hardly even makes sense to talk about a node having more than one successor in an array. What would data[2] be here? It’s not clear. 4 2 1 Linked lists make more sense, but what is 1->next now? We need more information. 5 3 6 7 Binary Trees • There are now two next nodes, not one. • Clearly, we need two pointers to model them. • So let’s rotate that list… 1 Left Right 3 2 4 5 6 7 • Let’s call the pointers “left” and “right”. • This structure is known as a binary tree. – Binary: branches into two nodes. • Higher-order trees exist too; we’ll talk about these later. – Why we call it a tree should be obvious. Nomenclature • Some from botany, some from genealogy. • Root: The “highest” node in the tree; that is, the one without a parent. – Almost all tree algorithms start at the root. • Child/Subtree: A node one level below the current node. Traversing the left or right pointers will bring you to a node’s “left” or “right” child. • Leaf: A node at the “bottom” of the tree; i.e. one without children (or really with two null children). • Parent: The node one level above. • Siblings: Nodes with the same parent. • “Complete” tree: a tree where every node has either 0 or 2 children and all leaves are at the same level. (Basically, it’s fully “filled in”). Recursive Definition • Binary trees have a nice recursive definition: • A binary tree is a value, a left binary tree, and a right binary tree. • Thus, each individual node is itself a tree. • Base case: the empty tree. – Leaves’ left and right children are both empty. – We usually represent this with nulls. Node Access • All you must store is a reference to the root. • You can get to the rest of the nodes by traversing the tree. • Example: Accessing 5. Root 1 • There’s a problem. 3 2 – Anyone see it? 4 5 6 7 Traversal • We have no way of knowing where 5 is. • In order to find it, we need to check every node in the tree. • So what’s the complexity of access? – “Check every” should ring alarm bells by now. • This is a nonlinear data structure, so we have more than one way to traverse, however. • There are three common tree traversals: – Preorder, inorder, and postorder. • There are some more exotic ones, too: traversals based on pointer inversion, threaded traversal, Robson traversal… • Since binary trees are recursive structures, tree algorithms are usually recursive. Traversals are no exception. • Remember how we reversed the output of printTo(n) by moving the output above or below the recursive call? – It turns out you can change the order of the traversal in the same way. Traversals and “Visiting” • You can do anything to a node inside of a tree traversal algorithm! • You certainly can search for a value. • But you can also output its value, modify the its value, insert a node there, etc. • This generic action is simply called “visiting” the node when discussing traversals. Preorder Traversal • If I gave you a binary tree and asked you to search for an element, how would you go about it? – You would check the value. – You would search the left subtree. – You would search the right subtree. • This is how preorder traversal works. – We check/output/do something with the node value. – We recurse on the left subtree. – We recurse on the right subtree. • We stop once we run out of nodes. • This is also called depth-first traversal, because it first focuses on traversing down specific nodes before broadly visiting others. – Like taking a lot of CS courses first before satisfying a core curriculum. – Preorder traversal is a special case of depth-first search on data structures called graphs, which we will discuss soon. Preorder Traversal BinaryTree preorder(BinaryTree node, int targetvalue) { if (node == null) //Base case. return null; else if (node.value == targetvalue) //Found it. return node; } BinaryTree lchild = preorder(node.left); if (lchild != null) return lchild; //Traverse left. return preorder(node.right); //Traverse right. //Found on the left. Preorder Traversal: Illustration Root 1 3 2 4 5 6 7 Order: [1 2 4 5 3 6 7] The root is always the first node to be visited. Inorder Traversal • We have two recursive calls in the preorder traversal: left and right. • In preorder, we checked the node before calling either of them. • In an inorder traversal, we check in-between the two calls. • We dive down all the way on the left before outputting, then we visit the right. – To use recursive stack language, we output after popping on the left but before pushing on the right. • There is no inherent advantage to choosing one traversal over another on a regular binary tree unless you deliberately want a certain ordering. • However, inorder traversal is important on a variation of the binary tree. More on that in just a moment. Inorder Traversal BinaryTree inorder(BinaryTree node, int targetvalue) { if (node == null) //Base case. return null; //All we did was swap the order of these two lines. BinaryTree lchild = inorder(node.left); //Traverse left. if (node.value == targetvalue) //Found it. return node; } if (lchild != null) return lchild; //Found on the left. return inorder(node.right); //Traverse right. Inorder Traversal: Illustration Root 1 3 2 4 5 6 7 Order: [4 2 5 1 6 3 7] The root is always the middle node. Postorder Traversal • The obvious next step: output after both recursive calls. • This causes the algorithm to dive down to the bottom of the tree and output/visit the node when going back up. – Similar to what we did in printTo(), actually. – We are outputting on the pop. Postorder Traversal BinaryTree postorder(BinaryTree node, int targetvalue) { if (node == null) //Base case. return null; BinaryTree lchild = postorder(node.left); BinaryTree rchild = postorder(node.right); if (node.value == targetvalue) return node; if (lchild != null) return lchild; //Found on the right or not at all. return rchild; } //Traverse left. //Traverse right. //Found it. //Found on the left. Postorder Traversal: Illustration Root 1 3 2 4 5 6 7 Order: [4 5 2 6 7 3 1] The root is always the last node to be visited. CRUD: Binary Trees. • • • • Insertion: Access: Updating an element: Deleting an element: • Search/Traversal: ? ? ? ? O(n). • All three traversals are linear: They visit every node in sequence. – They each just follow different sequences. – You can search by traversing, so search is also O(n). • How long would it take to access a node, though? – If I knew I wanted the left child’s left child, how many pointers would I need to follow to get to it? Tree Height. • To analyze worst-case access, we need to talk about tree height. • The height of a tree is the number of vertical levels it contains, not including the root level. • Or you can think of it as the number of times you’d have to traverse down the tree to get from the root to the lowest leaf node. • Nodes in the tree are said to have a depth, based on how many vertical levels they are down from the root. – – – – The root itself has a depth of 0. The root’s children have a depth of 1. Their children have a depth of 2… Etc. • The height is thus also the depth of the lowest node. Height Root 1 Depth = 0 3 2 4 5 6 Depth = 1 7 Height = 2 Remember, don’t count the root level. Depth = 2 Height Balance • A tree is considered balanced (or height-balanced) if the depth of the highest and lowest leaves differs by no more than 1. • This turns out to be an important property because it forms a lower bound on the access time of the tree and lets us find the height. • Question: If we have n nodes in a balanced binary tree, what is the height of the tree? – floor(log2 n) – Note that we had 7 nodes in the previous tree, but a height of 2. The tree was full; adding an 8th node would take the height to 3. • The time to access a node depends on the height, thus we know it is O(log n) on a balanced tree. Degeneracy • Trees with only left or right pointers degenerate into linked lists. – Which gives you another perspective on why Quicksort became quadratic with everything on one side of the pivot. – Access on linked lists is O(n). • Performance gets worse even as we approach this condition, so we want to keep trees balanced. 1 2 3 1 2 3 CRUD: Balanced Binary Trees. • • • • Insertion: Access: Updating an element: Deleting an element: • Search/Traversal: ? O(log n). ? ? O(n). • Once we know where to insert, insertion is simple. – Just add a new leaf there: O(1). • However, discovering where to insert is a bit trickier. – Anywhere that a null child used to be will work. – We don’t want to upset the balance of the tree. – A good strategy is to traverse down the tree based on the value of each node. This creates a partitioning at each level. Binary Tree Insertion void insert(BinaryTree root, BinaryTree newtree) { //This can only happen now if the user passes in an empty tree. if (root == null) root = newtree; //Empty. Insert the root. else if (newtree.value < root.value) { //Go left if <. if (root.left == null) //Found a place to insert. root.left = newtree; else insert(root.left, newtree); //Keep traversing. } else { //Go right if >=. if (root.right == null) root.right = newtree; //Found a place to insert. else insert(root.right, newtree); //Keep traversing. } } Insertion Analysis • This is similar to a traversal, but guided by the value of the node. • We choose left or right based on whether the node is < or >=. • We split into one subproblem of size n/2 each time we traverse. – What recurrence would we have for this? – What would be the solution? CRUD: Balanced Binary Trees. • • • • Insertion: Access: Updating an element: Deleting an element: • Search/Traversal: O(log n). O(log n). O(1). ? O(n). • If we’re already at the element we need to update, we can just change the value, thus O(1). – Note that we can say the same for insertion, but finding a place to put the node is usually considered part of it. • Deletion is quite complex, on the other hand. – If there are no children, just remove the node – O(1). – If there is one child, just replace the node with its child – O(1). – If there are two… well, that’s the tricky case. Deletion • If we need to delete a node with two children, we need to find a suitable node to replace it with. • One good choice is the inorder successor of the node, which will be the leftmost leaf of the right child we’re deleting from. – Inorder successor meaning the next node in an inorder traversal. • So our course is clear: inorder traverse, stop at the next node, swap. Deletion void deleteWithTwoSubtrees(BinaryTree targetnode) { if (targetnode == null) return; //Deleting a null is a no-op. //Find the inorder successor and its parent. BinaryTree inorder_succ; BinaryTree inorder_parent = targetnode; for (inorder_succ = targetnode.right; inorder_succ.left != null; inorder_succ = inorder_succ.left) inorder_parent = inorder_succ; //Keep track of the parent. //Set the value of the parent to that of the inorder successor… targetnode.value = inorder_succ.value; //Delete the inorder successor (here’s why we needed the parent): inorder_parent.left = null; } CRUD: Balanced Binary Trees. • • • • Insertion: Access: Updating an element: Deleting an element: • Search/Traversal: O(log n). O(log n). O(1). O(log n). O(n). • Finding the inorder successor requires time proportional to the height of the tree. If the tree is balanced, this is O(log n). CRUD: Unbalanced Binary Trees. • • • • Insertion: Access: Updating an element: Deleting an element: • Search/Traversal: O(n). O(n). O(1). O(1). O(n). • The worst sort of unbalanced tree is just a linked list. • The deletion algorithm would always hit the second case (only one child), so we’d never experience O(log n) behavior… • But the insertion algorithm is not as efficient as that of a linked list. – Unless we check for this condition explicitly, in which case we get O(1). Binary Search Trees • Binary Search Trees (BSTs) capture the notion of “splitting into two”. • Or, to use the Quicksort term, partitioning. – The value of a node is the pivot. – The left tree contains elements < the pivot. – The right tree contains elements >= the pivot. • They are simply binary trees that are kept sorted in the manner stated above. BSTs: What do they entail? • Like priority queues and sorted arrays, binary search trees are inherently sorted containers. • This means inserting a sequence of elements and then reading them back will get them out in sorted order. – Ah, but this time we have three ways to read them back out. All three can’t give us the same order. – The elements of an inorder traversal are sorted in binary search trees. • It also means that we’ll have to do some extra work to ensure that this guarantee is true. – But will this work influence the asymptotic performance? A Binary Search Tree Root 4 6 2 1 3 5 7 Inorder traversal: [1 2 3 4 5 6 7] < on the left, >= on the right. CRUD: Binary Search Trees. • • • • Insertion: Access: Updating an element: Deleting an element: O(log n). O(log n). ? O(log n). • • Search: Traversal: O(log n). O(n). • Search and traversal are no longer the same operation! – Traversal is analogous to linear search: look at every element, one at a time, and try to find the target. – Search on a BST is analogous to binary search: the data is sorted around the value of the node we’re at, so it guides us to eliminate half of the remaining elements at each step. – Just like other unsorted containers, we have to traverse to search a standard binary tree. And like other sorted containers, a BST lets us do a binary search. • Remember, BSTs are sorted in an inorder traversal. – Therefore, the deletion algorithm we previously specified will preserve the ordering. Access on a BST • Use the same strategy we used in binary search: – Compare the node. – If the target value is less than the node’s value, go left (eliminates the right subtree). – If the target value is greater than the node’s value, go right (eliminates the left subtree). – If it’s equal, we’ve found the target. – If we hit a NULL, the target isn’t in the tree. • This exhibits the same performance: O(log n). – If the tree is balanced. In the degenerate case, we are binary searching a linked list, which is O(n). Insertion on a BST • The algorithm I gave you for insertion was actually the BST insertion algorithm as well. – That was one of the reasons why I chose that strategy, although it does result in a fairly balanced tree if the data distribution is uniform. • In order to keep elements partitioned around the pivot, we need to traverse left when the new element has a value < the pivot and right when it’s >=. • It was O(log n) before, and it still is. Deletion on a BST • I also gave you the BST deletion algorithm. • As the inorder traversal is in sorted order, the inorder successor is the next element after the one we’re deleting in sorted order. • If we replace the element we’re deleting with the next element in the sequence, the sequence is still sorted. – e.g., [ 1 3 5 8 13 ] after deleting 3 -> [ 1 5 8 13]. • It was O(log n) before, and it still is. Updating a BST • Ah, here’s something different. • Updating unsorted containers is usually a constant-time operation, while updating sorted containers usually takes longer. • When we change the value of a node in a BST, we may be required to change the node’s position in the tree to preserve the ordering. – This is why updating sorted containers is usually a slow operation. • No one seems to want to deal with updating these, so most sources (including your textbook) just define it as “delete and reinsert”. – Which fully works and is very simple to do. – Don’t be afraid to do “quick and dirty” things if they don’t harm your performance. • So does this harm performance? – Insertion is O(log n). – Deletion is O(log n). – Unless we can update in O(1) on a BST (we can’t), then no. CRUD: Binary Search Trees. • • • • Insertion: Access: Updating an element: Deleting an element: O(log n). O(log n). O(log n). O(log n). • • Search: Traversal: O(log n). O(n). • This is the ultimate compromise data structure. – Arrays, Lists, Stacks, and Queues all did some things in constant time and other things in linear time. – This does everything (except traversal, which is inherently a linear operation) in logarithmic time. – But remember, logarithmic time isn’t much worse than constant. – So these are pretty good data structures. • As usual, there’s a catch… The Important of Balance • Every operation on a tree begins to degenerate when balance is lost. – And in the worst case, you end up with a less efficient linked list. • Keeping the tree balanced is thus important. – There is one who is prophesized to bring balance to the Force, but I don’t think that includes your trees. – So the burden falls on you, my young padawan. • Since BSTs are the structural analogue of Quicksort, you may have an idea of what insertion sequence will produce the worst case. – Yep, sorted or inverse-sorted, just as in Quicksort. • Most data is not arranged like this already, and on average, BSTs stay fairly well balanced. • But this is enough of a problem where various self-balancing structures have been invented. We will discuss these next week. A General Note • Although I put numbers in most of my examples, any sort of data can go in these. – Strings, Objects, Employees. • Caveat: When using Java’s sorted containers, make sure your class implements Comparable. – Java doesn’t give you a BinaryTree class outright, but it does give you TreeSet and TreeMap. – TreeMap in particular is very neat; check it out. – We’ll do some things with these in Thursday’s lab. “In all my life I’ll never see / a thing so beautiful as a tree.” • The study of trees goes very deep. – We’ve just scratched the surface. – We’ll come back to self-balancing trees, heaps, and perhaps splay trees. • The lesson: – Ideas are universal. They can come from your study. They can come from outside of your study. They can come from nature. They can come from anywhere. • Next class: Linear-time sorting, B+ trees, lab.