* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Array implementation of binary trees
Bloom filter wikipedia , lookup
Lattice model (finance) wikipedia , lookup
Linked list wikipedia , lookup
Rainbow table wikipedia , lookup
Comparison of programming languages (associative array) wikipedia , lookup
Red–black tree wikipedia , lookup
Interval tree wikipedia , lookup
Array implementation of binary trees Array implementation allows Store elements of the tree by levels g b • to travel in the tree from a parent to a child, i • to travel from a child to a parent. a l c very efficient for complete binary trees, j f 0 4 2 g b i a c 1 3 8 6 l 5 10 12 9 14 16 18 j f 7 very inefficient for trees that are not close to complete, 11 13 not a dynamic data structure. 15 17 1 Data structure Heap 2 Heap ADT • a complete binary tree class heap { private: ELEM* Heap; //pointer to array int size; //max size of heap int n; //current size of heap void siftdown(int); //helper for insert/delete public: heap(ELEM*, int,int); // constructor int heapsize() const; // return current size bool isLeaf(int pos) const; //True if a leaf posit. int leftchid(int pos) const; //return leftchild int rightchild(int pos) const; //return rightchild int parent(int pos) const; // return parent of pos void insert(const ELEM val); // insert val ELEM removemax(); ELEM remove(int pos) void buildheap(); // arrange elements into a heap }; • nodes partially sorted Max heap: for any node v, value in v ≥ value of any child of v Min heap: for any node v, value in v ≤ value of any child of v Heaps are used: As a priority queue In a heap sort We talk mostly maxheaps. 3 4 Function siftdown places one element in its proper place in a heap. void heap::siftdown(int pos) { while (!isLeaf(pos)){ insert an element into a heap: • find j, the position of the larger of the children of pos; • if Heap[j] ≤ Heap[pos] return; Otherwise swap (Heap[j], Heap[pos]); pos =j; } time required is O(log2n) void heap :: buildheap() { for (int i = n/2-1; i >=0; i--) siftdown(i); } void heap::insert(const ELEM val) { assert(n <size); int curr = n++; Heap[curr]= val; //go upwards to put it in right place while ((curr!=0) && (Heap[curr]>Heap[parent(curr)])) { swap(Heap[curr], Heap[parent(curr)]); curr = parent[curr] } } time required is O(log2n) time required is O(n) 5 6 AVL Trees delete the maximal element from a heap ELEM heap::removemax() { assert(n >0); swap(Heap[0], Heap[--n]); //put max // element in last place if (n !=0) siftdown(0); return Heap[n]; } The height of a binary tree depends greatly on the order in which the nodes are inserted. If eight nodes 50, 30, 40, 70, 20, 60, 80 are inserted in the above order, we obtain a complete binary tree T1 of height 3. time required is O(log2n) If, however the nodes are inserted in order Very suitable for a priority queue, a queue in which deletions are done in order of the priority of elements. 20, 30, 40, 50, 60, 70, 80 then we obtain a tree T2 of height 7. insertion, deletion is O(log2n) In T2 a search for a node corresponds to a linear search. 7 8 It is important to created a binary tree so that it is “close” to a complete binary tree. AVL tree (named after Adelson, Velskii and Landis) or balanced binary search tree is a tree on n nodes whose height is always O(log n). 20 80 20 30 Definition We say that a binary search tree T is an AVL tree, or balanced tree if for every node v of the tree the difference in heights of the left and right sub-trees of v is at most one. 40 40 a) balanced 60 30 90 70 80 60 b) unbalanced at 80 70 9 An AVL tree is not necessarily the best possible search tree, but an AVL property guarantees that the height of the tree is not far from optimal: Theorem Let T be an AVL tree on n nodes. Then height(T ) ≤ 1.44 log n. We now discuss how the make an insertion of a node into an AVL tree so that 1. the tree remains a binary search tree, 2. the tree remains balanced, 40 −1 20 −1 30 0 80 1 60 −1 90 0 70 0 We have in each node of the tree the balance factor: the difference of the heights of the two subtrees of the node. 3. a search in the tree is O(log n) 4. the insertion of a new node, after the search requires O(1) operations. 10 11 Insertion in AVL tree Initially, insert a node u into an AVL tree like in ordinary binary search trees, as a leaf. The balanced factor of A before the insertion was made was = 0, the insertion of one node can change any balance factor at most by + − 1. If the tree remains balanced, then insertion terminates. Four different types of rotations may be performed: If the tree is not balanced then a “rotation” is performed that makes the tree balanced again. 1. bf (A) = 2 and bf (left child B) = 1: Perform the right rotation around A. This “balancing” involves the node A such that 2. bf (A) = −2 and bf (right child B) = −1: Perform the left rotation around A. 1. A is on the path from the root to the node u that was inserted, 2. BF (A) = 2 or −2, 3. bf (A) = 2 and bf (left child B) = −1: Perform left-right rotation around A. 3. A is closer to u than any other node with balance factor 2 or −2. 4. bf (A) = −2 and bf (right child B) = 1: Perform the right-left rotation around A. 12 2 13 2 0 A 0 A B -1 0 1 B C 0 B A 1 T3 C T 1 T2 T T2 T T3 4 T3 T 2 T1 T1 1 A B Right rotation around A T 4 T3 T2 Double left-right rotation around A -2 A 0 -1 B B -2 A 0 0 0 A B 14 T1 A 15 1 C T2 C T 1 T 2 T 3 T1 T B Any of the rotations is done by changing 2 pointers in a single rotation and by changing 4 pointers in a double rotation. Summary of important facts to remember: 1. A rotation is done in the node that Any of these rotations is Θ(1). (a) has balance factor 2 or -2. (b) is closest to where the new node was inserted. Deletion of a node from an AVL tree can be done so that the tree remains balanced and so that O(log n) operations are needed for it. 2. In any rotation, the subtrees T1 T2 T3 in single rotations, or subtrees T1, T2, T3, T4 in double rotations, remain in the same order. We only change the positions of the nodes A and B in single rotations or of the nodes A, B, C in double rotations. Code for insertion and deletion on node in AVL trees can be found in many data structure books, we can mention the book Example of the construction of an AVL tree by inserting the following items in the given order. 3. The type of rotation depends on the signs of the balance factors. 20, 30, 40, 50, 60, 45 16 a) Insert 20 17 nsert 60 b)Insert 30 −2 30 30 0 20 20 −1 −1 0 0 20 40 −2 0 20 50 0 30 0 −1 0 60 40 50 c) Insert 40 20 0 t of balance at 40, do left rotation −2 f) 60 Out of balance at 20, do a left rotation 30 −1 ert 45. f) 30 −1 −2 30 0 40 0 0 0 20 1 50 20 50 e) Insert 50 0 0 30 0 60 −1 0 60 40 30 0 0 20 40 −1 40 0 20 18 40 19 −1 0 45 Out of balance at 30, do a double right−left 40 d) 0 50 0 rotati Sorting We will allow duplicate keys. We have a collection of records and we need to sort them by one field of the record the sort key. Example: Collection of student records, each record contains id-number, name, specialization If we sort then by id-numbers then id-numbers are the sort keys. If we sort then by names then names are the sort keys. Internal Sorting: All records are available in the main memory in an array. External Sorting: All records are available in a file on a disc memory. We will do mostly internal sorting. Sorting is stable if the sorting does not change the original relative order of records with identical keys. Internal sorting algorithms Some trivial sorting algorithms: Insertion sort Each record is put in the proper place of records already processed. Bubble sort Keep on swapping any two records which are out of place. Selection sort In the ith pass put ith record in its proper place. 20 21 Heapsort All trivial sorting algorithms require Θ(n2) average time to sort n records. Basic idea: 1. Arrange the records of the ARRAY into a Max-heap. We will discuss better sorting algorithms that require O(n log2 n) time to sort n records. If n = 10000, n2 = 100000000, n log2 n ≈ 13300, i.e. n log2 n is lower by a factor of 750. If n = 100000, n2 = 10000000000, n log2 n ≈ 1670000, i.e. n log2 n is lower by a factor of ≈ 6000. 2. for (int i = n; i > 0; i--) { swap (ARRAY[0], ARRAY[i]); siftdown ARRAY[0] to the correct place; consider only the heap in locations 0 to i; } Step 1 need Θ(n) operations, Siftdown needs at most log2 n operations Loop is repeated n times, Heapsort needs Θ(n + n log2 n) = Θ(n log2 n) operations. 22 23 Quicksort The fastest known method on average. Very widely used. Using the heap member functions we can code it as follows: Similarly like a binary search, it is a divide and conquer algorithm. void heapsort(ELEM* ARRAY, int n) { heap H(ARRAY, n, n); for (int i = n; i > 0; i--) H.removemax(); } It repeatedly splits the problem of sorting an array into sorting two half-size arrays. (invented by C.A.R. Hoare). qsort(ELEM* array, int i, int j){ // sort array from position i to j 1. Select one value P from the array; // P is the pivot 2. Partition the records so that first we have all records with keys < pivot, followed by the pivot; followed by records with values >= pivot; let k be the position of the pivot; 3. if ((k-i) > 1) qsort(array, i, k-1); if ((j-k) > 1) qsort(array, k+1, j); } Best case, average case, worst case needs Θ(n log2 n) operations. Heapsort only needs a constant number of additional memory locations. 25 24 25 13 40 17 63 5 81 33 10 18 55 11 11 13 40 17 63 5 81 33 10 18 55 11 13 18 17 10 5 81 33 63 40 55 2 11 13 18 17 10 5 25 33 63 40 55 8 55 8 55 3 55 3 25 sort it sort it 11 11 11 13 18 17 63 13 18 17 10 13 18 17 10 5 5 81 81 33 10 40 55 25 81 33 63 40 55 25 5 33 63 40 26 55 11 13 18 17 10 5 13 18 5 10 18 40 pivot = 33 pivot = 11 5 25 33 63 17 10 11 81 17 13 11 81 63 63 40 25 27 40 Code for the quicksort 5 10 18 17 13 11 void qsort(ELEM* array, int i, int j) { int pivotindex = findpivot(array, i,j); 5 17 18 10 13 11 swap(array[pivotindex], array[j]); undo last swap. // put pivot at the end int k = partition(array, i-1, j, 5 10 18 17 13 11 key(array[j])); // k is where pivot should be placed place pivot. swap(array[k],array[j]); 5 10 11 17 13 18 if ((k-i) > 1) qsort(array, i, k-1); if ((j-k) > 1) qsort(array, k+1, j); etc. } 29 28 Find Pivot .... O(1) time. Partitioning of the array: Partitioning of an array with n records: O(n) time. int partition(ELEM* array, int l, int r, KEY pivot) { do { while (key(array[++l]) < pivot); while (r && key(array[--r])> pivot); swap(array[l],aray[r]); } If each pivot divides the array evenly, we have to do O(log2 n) stages. Each stage involves in total partitioning of an array with n records. Total time: O(n log2 n) in the best and average case. while (l < r); // reverse unnecessary last swap swap(array[l],aray[r]); return l; //position for the pivot If each pivot divides the array very unevenly (one part could be empty) we have to do O(n) stages. Total time: O(n2) } in the worst case. 30 31 Choice of pivot is important so that it breaks the array evenly. Many possibilities for pivot choice. Mergesort A good possibility: Select the pivot as the median of keys of three randomly chosen records. Quicksort can be implemented without recursion by using a stack to store the indices of parts of the array to be sorted later. Quicksort needs additional space O(log2n) General idea Sort the left-half of the array; Sort the right-half of the array; Merge the two sorted sub-arrays; It is not an in-place sorting algorithm. It uses additional space of the same size as the array to be sorted. Quicksort and Heapsort are in-place sorting algorithms. The values of records stay in the originally given array. 32 33 // merge sorted subarrays int i1 = left; int i2=mid+1; int curr=left; void mergesort(ELEM*array, ELEM* temp, while (i1 <= mid) && (i2 <= right) int left, int right){ if key(temp[i1]) <= key(temp[i2]) if (left == right) return; array [curr++] = temp[i1++]; int mid = (left+right)/2; else array [curr++] = temp[i2++]; mergesort(array,temp,left,mid); // copy remaining records from left part mergesort(array,temp,mid+1,right); for (i=i1; i <= mid; i++); // copy array into temp array [curr++] = temp[i1]; for (int i = left; i <= right; i++) // copy remaining records from right part temp[i] = array[i]; for (i=i2; i <= right; i++); array [curr++] = temp[i2]; } 34 35 Radix Sort In the best, average, worst case, Mergesort is Θ(n log2 n) It is a stable sort. Heap and Quicksort are not stable. Needs O(n) additional space. Theorem Any sorting algorithm that is based on comparisons of keys needs at least n log2 n This is a sorting method that was used on mechanical sorting machines before computers were available. General idea (1) Sort all records by the last digit of the keys; (2) Sort all records by the second from last digit of the keys; (3) Sort all records by the third from last digit of the keys; (4) Sort all records by the fourth from last digit of the keys; etc. The sorting by a digit is done by placing the records into 0,1,2,3, .. 9 “bins” and then putting the records from the bins back into the array. comparisons on average. 37 36 void radix(ELEM* A, ELEM* B, int n, int k, int r, int * count) Not an in-place sorting method, // sort A of n records , each key contains needs O(n) additional space, // a stable sorting algorithm. k digits, we use radix r for (int i=0; rtok=1; i<k; i++) { In our examples, we use bin[0], bin[1],,..bin[9], but on computers, we usually use 8 bins corresponding to the last 3 digits of binary number representation. // repeat for k digits for (int j=0;j<r, j++) //initialize counts count[j]=0; // count records in each bin We put all bins together to save space and use count[i] to indicate the available place in bin[i]. for (j=0; j <n; j++) count[(key(A[j])/rtok)%r]++; We sort by the remainder or radix of each key. Variable rtok is rk . // count the bottom of each bin for (j=1;j<r; j++) count[j]=count[j-1]+count[j]; 38 39 // scan A from right to left and // put records into bins in right to left // direction Radix sort time depends on the (1) number of records . . . n (20 number of digits in the keys . . . k O(kn) for (j=n-1; j >=0; j--) B[--count[(key(A[j])/rtok)%r]] = A[j]; // copy B into A for (j=0; j<n; j++) A[j] =B[j]; } k ≥ logr n but it could be more! (if some keys are very long) not an in-place sort. } it is stable. 40 41 Empirical comparison of Sorting algorithms PC computer: Algorithm Insert. sort bubble sort quicksort mergesort heapsort radix sort8 100 .18 .23 .05 .07 .17 .06 array size 10000 30000 1847 16544 2274 20452 7.3 24 10.7 35 36 122 6.0 17 There are many other algorithms, we considered the most common and the best. Excellent reference on sorting algorithms: D.E. Knuth: Sorting and Searching, AddisonWesley. DEC station: Algorithm Insert. sort bubble sort quicksort mergesort heapsort radix sort8 100 .017 .026 .006 .013 .017 .018 array size 10000 100000 168 23382 257 41874 .9 12 2.3 30 3.5 49 1.6 18 42 43 Search Problem: Given a key K, find an object Ai so that Search: Ai.key() = K We have a collection of compound objects A1, A2, . . . , An Each object contains one element that is used for search, the search key ....... Ai = (ki, In) We assume that Ai.key() Ai.inf o() gives ki gives Ii, Search can be successful if an object containing K is found, unsuccessful if no object containing K is found, exact match query: We search only for objects Ai.key() = K range query We search for objects K − c ≤ Ai.key() ≤ K + c Different techniques are used when objects are stored on a disk rather than in the RAM 44 45 Assume we have n objects: We have seen: 1. lists: 2. Binary Search Trees, linked implementation: (a) array implementation, unordered: sequential search O(n), insert, delete (after search) O(1) search O(height), insert, delete O(height) (b) array implementation, ordered: binary search O(log2 n), insert, delete (after search) O(n) Height can be made Θ(log2n) with balanced search trees (AVL trees) (c) linked implementation, unordered: sequential search O(n), insert, delete (after search) O(1) Any search based on comparisons of keys needs Θ(log2n) comparisons. (d) linked implementation, ordered: sequential search O(n), insert, delete (after search) O(1) 46 47 A way to improve the binary search in an array. Ways to improve sequential search Interpolation or Dictionary Search: Given a key K, calculate an index of the array where the object might be. left = 0; // search objects between right = n-1; // left and right, inclusive while ( lef t <= right) { if ((right − lef t) == 0) pos = lef t; else { q = f loat((K − key().A[lef t])/ (A[right].key() − A[lef t].key())); pos = lef t + q ∗ (right − lef t + 1); if (K < A[pos].key()) right = pos − 1; else if (K > A[pos].key()) lef t = pos + 1; else return pos; } } (1) If we know the frequency of accesses of each element, we can store them in the list in the order of frequency of accesses, starting with the highest frequency. Frequency is given as a number between 0 and 1. 0.75 frequency for x means that 75% of the time we access x. It can give a good performance in some cases. Frequencies are often not known. Can work well if the distribution of key values is uniform. 48 (2) 49 Example of Count method: Self-Organizing Lists Lists in which the order of objects in the list is changed based on searches which are done. Place the objects that are being searched at the head, or closer to the head of the list. This can speed-up the search Basic Strategies 1. Count method: each object contains a count of searches for the object. Order objects by decreasing count. 2. Move to Front: Any object found is placed first. 3. Transpose: Any object found is interchanged with the preceding object 4. there are other possibilities. Assume we have a list of keys and counts, we omit the information stored in objects. After some searches, we have the list: (354, 3), (567, 2), (999, 2), (102, 1), (122, 1), (234, 1) Search for 122: 5 comp. (354, 3), (567, 2), (999, 2), (122, 2), (102, 1), (234, 1) Search for 122: 4 comp. (354, 3), (122, 3), (567, 2), (999, 2), (102, 1), (234, 1) Search for 999: 4 comp. (354, 3), (122, 3), (999, 3), (567, 2), (102, 1), (234, 1) Search for 122: 2 comp. (122, 4), (354, 3), (999, 3), (567, 2), (102, 1), (234, 1) Total comparisons = 15 comparisons 50 51 Example of Move to Front: Example of Transpose: Assume we have a list of keys, we omit the information stored in objects. After some searches, we have the list: Assume we have a list of keys, we omit the information stored in objects. After some searches, we have the list: 354, 567, 999, 102, 122, 234, 189, 333 Search for 122: 5 comp. 354, 567, 999, 102, 122, 234, 189, 333 Search for 122: 5 comp. 354, 567, 999, 122, 102, 234, 189, 333 Search for 122: 4 comp. 122, 354, 567, 999, 102, 234, 189, 333 Search for 333: 8 comp. 354, 567, 122, 999, 102, 234, 189, 333 Search for 999: 4 comp. 333, 122, 354, 567, 999, 102, 234, 189 Search for 333: 1 comp 354, 567, 999, 122, 102, 234, 189, 333 Search for 122: 4 comp. 333, 122, 354, 567, 999, 102, 234, 189 Search for 234: 7 comp. 354, 567, 122, 999, 102, 234, 189, 333 234, 333, 122, 354, 567, 999, 102, 189 Total comparisons: 17 comparisons. 52 Count method is better when some objects are accessed very frequently, but these accesses are not grouped together. Does not react very dynamically to changes of frequency of access patterns. It needs additional space. Move to front is better when some accesses are clustered together. It reacts very dynamically to changes of access patterns. Easy to implement for linked lists only. Transpose is easy to implement for both, arrays and linked lists. There are always some access patterns that make any of the above rather bad. A self-organizing list can perform well in some situations. 54 53 Hashing It has, on average, and with high probability, a constant search and insertion time O(1) Objects are stored in an array Table in n locations T able[0], T able[1], . . . , T able[n − 1] where n is usually larger than the number of objects we deal with. There is a hash function h, an object A is stored at the location T able[h(A.key())] whenever possible. i.e., the location of the object containing a given key is calculated by h. h only takes O(1) operations to calculate. 55 Example: Store student information for a class. key: I.D. numbers. hash function: int H(int id) { return id % 89;} T able[0], T able[1], . . . , T able[88] id’s: 2670860 3661156 3810151 3930688 3954765 4051254 4079469 4094603 4106091 4129245 4136977 4158601 4198948 3331423 3734927 3836991 3930947 4032292 4057279 4083733 4097890 4110226 4129350 4137876 4158695 4208072 3400034 3736660 3893715 3936007 4036417 4060695 4086899 4101901 4110625 4129911 4138104 4187490 4208668 3464121 3782239 3895440 3952002 4043316 4072626 4090985 4105370 4111265 4130928 4139739 4190238 4209389 3526100 3791238 3929388 3952746 4047958 4079124 4094026 4106032 4117182 4131037 4149882 4193830 4210344 2670860mod89 = 59, 3400034 mod89 = 56, 3526100 mod89 = 9, 3734927 mod89 = 42, 3782239 mod89 = 6, 3810151 mod89 = 61, 3893715 mod89 = 54, 3929388 mod89 = 38, 3930947 mod89 = 84, 3952002 mod89 = 46, 3954765 mod89 = 50, 4036417 mod89 = 0, 4047958 mod89 = 60, 4057279 mod89 = 36, 4072626 mod89 = 75, 4079469 mod89 = 65, 4086899 mod89 = 19, 4094026 mod89 = 26, 56 4097890 mod89 = 63, 4106032 mod89 = 17, 4110226 mod89 = 28, 4111265 mod89 = 88, 4129245 mod89 = 1, 4129911 mod89 = 44, 4131037 mod89 = 13, 4190238 mod89 = 29, 4198948 mod89 = 17, 4208668 mod89 = 36, 4210344 mod89 = 21 4101901 mod89 = 69, 4106091 mod89 = 76, 4110625 mod89 = 71, 4117182 mod89 = 42, 4129350 mod89 = 17, 4130928 mod89 = 82, 4136977 mod89 = 79, 4193830 mod89 = 61, 4208072 mod89 = 63, 4209389 mod89 = 45, Two keys that hash to the same locations are said to collide 4129350mod89 = 17 4198948mod89 = 17 4106032mod89 = 17 They collide. We need to have a policy how to solve the collisions. 58 3331423 mod89 = 64, 3464121 mod89 = 63, 3661156 mod89 = 52, 3736660 mod89 = 84, 3791238 mod89 = 16, 3836991 mod89 = 23, 3895440 mod89 = 88, 3930688 mod89 = 3, 3936007 mod89 = 71, 3952746 mod89 = 78, 4032292 mod89 = 58, 4043316 mod89 = 46, 4051254 mod89 = 63, 4060695 mod89 = 70, 4079124 mod89 = 76, 4083733 mod89 = 57, 4090985 mod89 = 11, 4094603 mod89 = 69, 57 In total, we had 3 3 2 2 2 3 2 2 2 2 2 keys keys keys keys keys keys keys keys keys keys keys hash hash hash hash hash hash hash hash hash hash hash to to to to to to to to to to to 17 36 42 46 61 63 69 71 76 84 88 Number of keys hashing to one location is typically small. (like above). 59 -1 ... no object stored in that slot Open hashing Objects that collide are stored outside the main table. A good open hashing strategy: store all objects hashed to the same location in a linked list. It is called Separate Chaining. 0 4036417 1 4129245 2 -1 3 3930688 4 -1 . . . 16 3791238 17 4106032 4198948 36 4057279 4208668 88 4111265 4129350 . . Objects in lists may be ordered: (1) by keys (2) by the entry time (3) frequency of access 3895440 60 The Separate Chaining is an excellent choice when we have additional space for the linked lists. Disadvantage: Space in the array not fully used, no control on how much additional space is needed! For hashing, the number of comparisons needed to find/insert an object does not depend on the number of objects we search among. It depends on the load factor 61 In the example: Load factor = 65/89 = 0.73 Search for an object that is in the table: 3 objects that need 3 key comparisons 8+3 objects that need 2 key comparisons 51 objects that need 1 key comparison Average number of comparisons: 3∗3+11∗2+51∗1 65 of objects load factor= no.table size = 82 65 = 1.27 The binary search tree would need on average 5 comparisons. We often distinguish: Search for an object in the table. Search for an object not in the table. Search for an object that is not in the table: Insert Ai = Search for an object not in the table, it stops when we reach an empty location that should contain Ai, and assigning Ai to that location. 62 3∗3+8∗2+78∗1 89 = 103 89 = 1.16 63 Closed Hashing Bucket Hashing All elements are stored in the hash table T able of size m. To avoid infinite loop, we need to have at least one position that does not contain an object. Hash table consists of B buckets each bucket contain m/B slots. + one overflow bucket following the table. h pos is the bucket address of Ai. home position of object Ai with key value Key = Ai.key() is h(Key) Insert: store Ai in the first available slot in the bucket, if a slot is available; else store it in the first available slot in the overflow bucket (1) h pos = h(Key); (2) If (T able[h pos] == N o Entry) T able[h pos] = Ai; // store it at home else apply collision resolution; We now discuss different collision resolution for close hashing. good for hash tables on a disk, we read in the whole bucket. bad for tables in RAM as a lots of space is often wasted, and the search can be slow for items in the overflow bucket. 64 Linear Probing 65 Some improvements: collision resolution puts the object in the next unoccupied slot going downward in the table (with a wrap-around) probe sequence is T able[h pos], T able[(h pos + 1)% T able size], T able[(h pos + 2)% T able size], etc., quadratic probing, its probe sequence is T able[h pos], T able[(h pos + 1)% T able size], T able[(h pos + 22)% T able size], T able[(h pos + 32)% T able size], T able[(h pos + 42)% T able size], etc., pseudo-random probing, its probe sequence is based on a fixed pseudo-random sequence r1 , r 2 , r 3 , . . . This has many drawbacks: It tends to cluster objects together, which increases the probability of further collisions and larger clusters. Large clusters imply a high probability of long searches. primary clustering 66 T able[h pos], T able[(h pos + r[1])% T able size], T able[(h pos + r[2])% T able size], T able[(h pos + r[3])% T able size], etc., Problem: Secondary clustering: two keys having the same home address have the same probe sequence. 67 Further improvement: Example on closed hashing: double hashing Main hash function: h(key) = key mod 13 We have one more hash function h2 > 0 which is used to calculate the steps in the probe sequence. hash function used for double hashing: h2(key) = key mod 7 + 1 step = h2(Key) its probe sequence is a table of keys and their home positions. key 267 133 340 246 352 166 373 275 178 479 139 T able[h pos], T able[(h pos + step)% T able size], T able[(h pos + 2 ∗ step)% T able size], T able[(h pos + 3 ∗ step)% T able size], It avoids most problems like primary and secondary clustering. Double Hashing is a good method for Closed Hashing. home pos. 267 mod 13 = 7 133 mod 13 = 3 340 mod 13 = 2 246 mod 13 = 12 352 mod 13 = 1 166 mod 13 = 10 373 mod 13 = 9 275 mod 13 = 2 178 mod 13 = 9 479 mod 13 = 11 139 mod 13 = 9 step for double hash 275 mod 7+1 = 3 178 mod 7+1 = 4 139 mod 7+1 = 7 68 Tables obtained: quadrat. linear 0 479 3 1 352 2 3 4 5 3 1 178 352 340 133 1 340 1 133 1 1 275 3 139 10 6 1 69 double hash 479 7 1 352 6 178 352 2 2 3 340 133 5 1 340 4 275 3 133 1 1 5 139 2 139 4 6 2 7 275 4 3 275 267 1 139 0 4 Linear probing no. of probes (unsuccessful search) 1 267 2 1 8 267 1 267 1 1 10 373 166 373 166 1 1 373 166 1 1 11 178 479 246 1 1 479 246 1 1 Average = 69/13 = 5.3 comparisons. 12 3 1 7 8 9 1 246 no. of probes (successful search) 70 11 10 373 166 11 178 12 246 9 8 9 10 In close hash tables, deleted items are replaced by tombstone. 71 Expected number of comparisons of keys S ... successful search U ... unsuccessful search collision resolution Number of comparison for different load factors .25 .50 .75 .9 .95 Open Hashing separate chaining Closed Hashing linear probing double hashing S U 1.12 1.03 1.25 1.11 1.38 1.22 1.45 1.31 1.48 1.34 S U S U 1.17 1.39 1.15 1.33 1.50 2.50 1.39 2.00 2.50 8.50 1.85 4.00 5.50 50.5 2.56 10.00 10.50 200.5 3.15 20.00 Some conclusions: Open hashing with separate chaining works well even when the load is high. Closed hashing should be used with: Important: It is the load factor and not the number of objects that is important for a good performance. double hashing, load not more than .9 to have a good performance. The numbers are valid on average, worst case could be worse. 72 Hash Functions 73 (2) mid-square method (1) mod method. h(key) = key%n; where n is the table size. Used for integer keys, Performs very well when keys are random. Does not perform very well when many keys end with the same suffix (sometimes last digits in i.d numbers indicate the area.) n must be a prime number . Example If we need to have a table for 8000 items and closed hashing, choose n ≈ 1.12 ∗ 8000 = 8960 Can choose n = 8963 from prime number tables. 74 h(key) = (((key ∗ key) >> b)%2a); Extracts a bits at positions a + b, . . . b + 1 from the left. Choose a, b so that we take the middle bits. Performs very well, good for integer keys. Table is of size 2a. Example If we need to have a table for 8000 items and open hashing, choose n = 213 = 8192 Each key is of length 16 bits, key 2 is of length 32 bits, Choose a = 13, b = 10 if key = 11234 = 0010101111100010, key 2 = 000001111|0000101101100|1110000100 h(key) = 0000101101100 = 364 75 (3) Folding Method Used for strings, h(key) = ((int)s[1]+(int)s[2]+· · ·+(int)s[i])%n; Add all characters together, take it mod table size; Example: n = 557 key = Smith, Jamie 83 + 109 + 105 + 116 + 104 + 42 + 32 + 74 + 109 + 105 + 101 = 980 mod 557 = 423 (4) other variations of the previous methods, see the textbook. When we know more about the keys, we can custom design a hash function that minimizes the collisions. It table size is bigger, add the numerical values of pairs of characters. The above explains the use of term hashing: chop identifier into pieces and mix it up 76 77 Evaluation of hash-based search techniques Excellent method in many cases. Search, insert, delete O(1) on average. -1 ... no object stored in that slot 0 4036417 1 4129245 2 3 Drawbacks: 4 Some space wasted, . .. Performance O(1) cannot be guaranteed, worst case could be O(n). obj. 2 -1 obj. 4 3930688 obj. 3 -1 obj. 7 16 3791238 17 4106032 obj. 5 4198948 4129350 . . Not suitable for range queries, 36 obj. 1 4057279 4208668 obj. 6 Cannot print items in order without sorting them first. In many cases, the hash table does not contain the objects, but only the keys and pointers to the objects. 78 obj. 8 88 4111265 obj. 67 3895440 79 Trees are not sufficient to represent or model some situations. Example: course prerequisites: C326 Sometimes we need a graphs C346 A graph consists of C353 C354 set of vertices ... V set of edges .... E C352 each edge is a pair of vertices, E ⊆ V × V If edge (u, v) is directed from u to v, we talk about a directed graph. digraph C229 C335 C249 C239 If edge (u, v) is considered to be between u and v, we talk about an undirected graph. Whether we use a directed or undirected graph depends on an application. C228 A graph whose edges are assigned a cost or weight are called weighted C238 C248 directed, no weights on edges. 80 81 We need to have a computer representation of the course prerequisites for student’s registration. example: Air Canada Network Sud. Th. B. 800 St. J. Que. Ott. 200 200 400 800 350 Review of some terminology: 250 800 250 550 Tor. 150 Montr. 500 650 London We need to have a computer representation of an airline network to find flights between cities. 1000 path: A sequence of vertices v1, v2, . . . , vn where (v1, v2), (v2, v3), . . . , (vn−1, vn) are edges in the graph. Hal. Weighted, each weight represents the distance, a simple path: all vertices of the path are distinct. cycle: a path that begins and ends in the same vertex. undirected. a subgraph of G: a subset S of vertices of G and a subset E1 of edges of G, E1 ⊆ S × S. 82 83 Graph Representations b connected graph: There is a path between any two vertices. a c component: a connected subgraph that cannot be made larger. e acyclic graph: a graph without a cycle. f d DAG a directed acyclic graph. tree can be viewed as connected, undirected graph with no cycle. the trees that we studied earlier are considered rooted trees since one node is designated as the root. 0 1 a e e f a d f 2 3 4 5 d a e c b a e e c d b a f b c e b c d Adjacency list for a graph 85 84 b a c The space cost of each representation e Adjacency matrix: Θ(|V |2) f d a b c d e f a ne 1 ne 1 1 1 b 1 ne 1 ne 1 ne ne 1 1 ne Adjacency list for a graph Θ(|V | + |E|) Notice that |E| in the worst case is Θ(|V |2), but is often only Θ(|V |) if the graph is sparse ne = no edge c ne 1 d 1 ne 1 ne 1 ne e 1 1 1 1 ne 1 f 1 ne ne ne 1 ne If the graph is undirected then the adjacency matrix is symmetric. Adjacency matrix: a 0-1 matrix, 1 indicate an edge. we need one bit per entry. 86 87 Graph ADT class Graph { public: Graph(); // constructor ~Graph(); // destructor int n(); // number of vertices int e(); // number of edges Edge first(int); // get the first edge for a vertex bool isEdge(Edge); // true if this is an edge Edge next(Edge); // get the next edge for a vertex int v1(Edge); // return the origin vertex of edge int v2(Edge); // return the target vertex of edge int weight(int,int); // return weight of edge int weight(Edge); // return weight of edge } 88 Matrix is a one-dimensional array of (numV ertex)2 N OEDGE indicates no edge there rows, columns numbered 0, 1, . . . , numV ertex − 1 Edge id the address of the entry in the adjacency matrix. element in row i, column j of the adjacency matrix is matrix[i ∗ numV ertex + j] 90 typedef int* Edge; class Graph { // Adjacency matrix implementation private: int * matrix // the edge matrix int numVertex; // Number of vertices int numEdge; // Number of edges bool* Mark; // used for marking visited vertices // in some algorithms public: Graph(); // constructor ~Graph(); // destructor int n(); // number of vertices int e(); // number of edges Edge first(int); // get the first edge for a vertex bool isEdge(Edge); // true if this is an edge Edge next(Edge); // get the next edge for a vertex int v1(Edge); // return the origin vertex of edge int v2(Edge); // return the target vertex of edge int weight(int,int); // return weight of edge int weight(Edge); // return weight of edge } 89 EdgeGraph::first(int v) { // Get the first edge of v int stop = (v+1)*numVertex; for (int pos = v*numVertex; pos <stop; pos++) if (matrix[pos] !=NOEDGE) return & matrix[pos]; return NULL; } EdgeGraph::next(Edge e) { // Get the first edge of v int stop = (v1(e)+1)*numVertex; // for (int pos = e-matrix+1; // pos is left of Edge pos <stop; pos++) if (matrix[pos] !=NOEDGE) return & matrix[pos]; return NULL; EdgeGraph:: v1(Edge e) // return the origin of e { return (e-matrix) / NumVertex;} // find the row of the matrix containing e 91 Graph Traversal Many algorithms require to traverse all vertices of a graph. Sometimes we want to traverse vertices of a graph from a given vertex using the edges of the graph until some specific goal is achieved. Example of goals: - visit all vertices in an order, - reach another given vertex (shortest path problem), - reach a vertex having some specific property, Must avoid running in a cycle in the graph. To traverse all vertices: repeat the traversal from a specific vertex until all vertices are visited. void graph_traverse(Graph *G) { for (v=0; v<G.n(); v++) G.Mark[v] = UNVISITED; // initialize the marks for (v=0; v<G.n(); v++) if (G.Mark[v]== UNVISITED) do_traverse(G,v); } do-traverse visits all vertices reachable from v Use the Mark array by marking every vertex already visited, 92 93 Depth-First Search (like preorder) When we visit a vertex v, do a depth first search for each vertex adjacent to v. h // Depth-First Search void DFS(Graph & G, int v) { PreVisit(G,v); // can print, change etc e g f G.Mark[v]= VISITED; // avoid looping for (Edge e = G.first(v); G.isEdge(e); b i c e=G.next(e)) d if (G.Mark[G.v2(e)] == UNVISITED) a DFS(G, v2(e)); j PostVisit(G,v); // can change, update, Recursively explore all nodes accessible from an edge before going sideways. } If we start from h with edge (h, e) we get: h, e, b, a, j, c, d, g, f, i, 94 95 Breadth-First Search (like level order traversal for trees) When we visit a vertex v, examine all nodes connected to v before going any further. h e b g f i a c d j We get h, e, f, g, b, i, c, d, a, j // Breadth-First Search void BFS(Graph & G, int v) { Queue Q(G.n()); // create a queue of // of sufficient length Q.enqueue(v); G.Mark[v]= VISITED; // avoid looping while (!Q.isEmpty()){ int u = Q.dequeue(): PreVisit(G,u); // can print, change etc for (Edge e = G.first(u); G.isEdge(e); e=G.next(e)) if (G.Mark[G.v2(e)] == UNVISITED){ (G.Mark[G.v2(e)] =VISITED; Q.enqueue(G.v2(e)); } PostVisit(G,v); // can change, update, } } We use a queue for this traversal. 96 97 C326 C346 C353 C352 C354 C335 Topological Sort Given a DAG G, find a linear order of all vertices of G u1 , u 2 , u 3 , . . . , u n C229 C249 C239 so that there is no edge (ui, uj ), i < j in G. C228 C238 C248 Possible orders: c228, c238, c248, c229, c239, c249, c335, c352, c346, c353, c354, c326 or c248, c238, c228, c239, c229, c249, c352, c335, c354, c346, c326, c353, or ... 98 99 Method 2: use a queue Q: 1. count no. of edges going into each node. Method 1: 1. Do DFS 2. enqueue in Q all nodes with no incoming edge. 2. when preVisit, do nothing 3. while !isEmpty(Q) (a) dequeue a node v 3. when posVisit, print out the node. (b) print the node v 4. it prints nodes in the reversed of a topological sort. (c) decrease the count of any node pointed at by v (d) if any count of a node becomes 0, enqueue the node. 100 101 Memory management There are many important graph algorithms: When a computer program is running, space must be allocated to • Shortest path problem • Minimum-cost spanning tree problem • each function call for the variables associated with the function, • Travelling Salesman problem • each data item created by a call to new When a function terminates or we call delete, we have to reclaim the space for reuse. • Maximum flow problem • many other problems This must be done dynamically, as the execution of the code procedes. they all use the data-structure graph. This allocation, de-allocation of space is called dynamic storage allocation. 102 103 For the variables associated with each function call we use a stack. (was discussed earlier.) Stack is not appropriate for the dynamically created variables. (to be discussed now.) We assume that to each program being executed, a contiguous segment of memory locations is allocated (in addition to the run-time stack). From this space we assign segments of it for dynamically created variables. Traditionally, this space is called a heap. Heap ≡ a messy pile of items of all sizes. It is different form data structure heap we used earlier. (Example of name overloading). Do not confuse the two different notions! 104 Any operating system also contains a memory management system to handle the memory requests for each process and deallocate the space of processes that terminate. Since the allocation/deallocation is dynamic, the heap can look like the following: heap: occupied block Some blocks are not being used at the time - free blocks. The status of blocks changes dynamically. In C + +: Each call of new ⇔ call the memory manager to create a new reserved block of appropriate size. Each call of delete ⇔ call the memory manager to deallocate a reserved block into a free block. The memory manager must keep track of the blocks so that the process of creation of a reserved/free block can be done efficiently. 105 Block allocation policy: Method used by the memory manager to select what free block or what part of a free block is given to a memory request. Failure policy: The policy used by the memory manager when no suitable free block is available for a request. A memory manager usually puts all free blocks into a linked list. freelist. Nodes in the freelist are of variable size. If m is close enough to n than the whole block is allocated. Some blocks are allocated to variables - reserved blocks To a request of size n manager must allocate a block of size m with m ≥ n. We see heap as an array of memory locations. It is divided into blocks of various sizes, free block Memory request of size n: A call of the memory manager to allocate a block of n consecutive words. 106 This creates internal fragmentation: some free space is located inside allocated blocks. If m is quite larger than n than the block is split: A block of size n is allocated and a free block of size m − n is kept in the freelist. 107 This creates external fragmentation: many free blocks are too small for future requests. Sequential Fit Method The free blocks are organized into a doubly linked (circular) list. Each allocated block contains 3 fields used by the memory manager: Request of size n creates an occupied block of size n + 3. These additional fields are used to minimize external fragmentation. (We assume below that each field is one word, but in practice it can be less.) Assume we allocate a block of size n to a request of size m: if n − m > M IN EXT RA then split the free block in two parts: the first part remains in the free list with changed size, second part is marked allocated and its address is given to the request. if n−m < M IN EXT RA then allocate the whole block: a node is removed from the linked list and marked allocated. 108 Tag Size + m #begin{slide} #define STARTTAG 0 #define SIZE 1 #define PREV 2 #define NEXT 3 #define ENDSIZE 4 #define ENDTAG 5 Llink Rlink m Free block + Size Tag Tag Size − m Reserved block m − Size Tag 109 110 int* { // // if allocate(int m) // Return a block of size >= m The size field store the actual number of free spaces, not including maintenance fields (m < 3) m = 3; // Must be big enough to be //a free block later int* temp = pick_free_block(m); // the address of block is in temp // Must be at least m+3 units if (temp[SIZE] >= m+MINEXTRA) { // Split block, save excess int start = temp[SIZE] - m + 3; // First unit of reserved block 111 temp[start] = temp[temp[SIZE] + ENDTAG] = RESERVED temp[start+SIZE] = m; temp[SIZE] -= m+3; // This much was reserved temp[temp[SIZE] + ENDSIZE] = temp[SIZE]; temp[temp[SIZE] + ENDTAG] = FREE; return &temp[start]; } else { // give over the whole block, // remove it from free list temp[STARTTAG]=temp[temp[SIZE] + ENDTAG]=RESERVED; temp[SIZE] += 3; // for the extra maint. fields // Freelist pointers point directly to positions // of neighboring blocks in array MemoryPool. MemoryPool[temp[PREV]] = temp[NEXT]; MemoryPool[temp[NEXT]] = temp[PREV]; return temp; } De-allocation of a block: If a de-allocated block at address start is adjacent to a free block on either side (check the tags at temp[start − 1] and at temp[start + SIZE + 3]) merge it with on or two free blocks. If a de-allocated block at address start is not adjacent to a free block on either side then change its tag fields, insert it into the list of free nodes as a new node. } 112 113 • best fit: go down the list and allocate the space from the free block whose size n is larger than m and for which n − m is the smallest from among all free blocks. Possible Allocation strategies Assume that we have a request of size m. We have to allocate a block of size n with n ≥ m. If n − m > M IN EXT RA then leave in the list a free block of size n − m. • first fit: go down the list and allocate the space from the first free block whose size n is larger than m. Possible disadvantage: it may allocate a large block that would be more suitable for subsequent requests. Possible disadvantage: must examine the whole list in most cases and can create bad external fragmentation. • worst fit: go down the list and allocate the space from the largest free block. Possible disadvantage: must examine the whole list and might fail when some request are very large. Which strategy is best: none, it depends on the expected sizes of requests. If expected sizes not known, the first fit is used. 114 115 Buddy Method: Initially lk contains one node, all other lists are empty. Assumes the available memory is of size 2k for some k. Any free or reserved block is of size 2i for i ≤ k. Notice that any buddies are adjacent and of the same size. De-allocation is inverse of allocation: It keeps k lists of free blocks, list li keeps free blocks of size 2i. For a request of size m, we allocate a block of size 2j where j = log2 m . if list lj is not empty: allocate a block from it; else if list lj+1 is not empty:{ take a block from it; split it in two buddies; allocate one of them; put the other in lj ; } else repeat the process recursively with lj+2. To free a block of size 2j : inspect list lj ; if the list does not contain the buddy of the block, insert it in lj ; else {remove the buddy from lj ; merge two buddies in block of size 2j+1; proceed recursively with list lj+1; } Advantage: less of external fragmentation, easy merging of free blocks. Disadvantage: more of internal fragmentation. 116 117 Failure policies It refers to actions taken when a memory request cannot be satisfied. Compaction: move reserved blocks so that all free blocks are merged. (We must take care to adjust addresses of data items that are moved.) When out of memory, the memory manager collects all nodes that are not accessible through variables in the program. This is referred to as the garbage collection. It can recover the memory space lost through dangling pointers. Two typical algorithms: Garbage collection: When all blocks are of the same size, the memory manager links all free nodes into a freelist. For any request a node is detached from the freelist. No attempts are made to return nodes to the freelist until the memory manager runs out of memory. 118 reference count: each node contains the count of pointers pointing to a node. When count of a node is 0, the node can be recovered. (cannot handle recursive references) 119 mark and sweep: each node contains a mark bit. 1. visit sequentially all nodes and turn their mark bits off. 2. in each node reachable through variables in the program turn the mark on. 3. visit sequentially all nodes and append all unmark nodes into the freelist. This can handle most general cases, but can be slow. 120