Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SORTING AND SEARCHING • Sorting • Algorithm Analysis • Searching Sorting • Algorithm Analysis • Sorting and Searching are most frequent operations • Elementary Sorting – bubble sort – selection sort – insertion sort • Merge Sort • Algorithm analysis is important – performance matters! Can observe and/or analyze, then tune or revise algorithm. Sorting and Searching 2 of 44 November November13, 8, 2005 2003 Analysis of Algorithms • Computing resources consumed – running time – memory space • Implementation of algorithm – machine (UltraSPARC 10, Mac, PC,...) – language (Java, C++,...) • For given input, time and space used depend on implementation. • Size of input data, denoted N, e.g., – number of elements to sort – number of nodes in tree to be visited • Worst-case time complexity T(N) – maximum running time of algorithm over all possible inputs of size N Sorting and Searching 3 of 44 November November13, 8, 2005 2003 Big-Oh Notation - OrderOf() • How to abstract from implementation? • Big-Oh notation • O(N) means each element is accessed once – N elements * 1 access/element = N accesses • O(N2) means each element is accessed n times – N elements * N accesses/element = N2 accesses • Only consider “asymptotic behavior” i.e., when N>>1 (N is much greater than 1) – N is unimportant compared to N2 • Disregard constant factors: – newer machine might run all programs twice as fast – line of code in one language may be several lines on another • Remember, only largest N expression without constants matters n – 3N2 is O(N2) i = N(N+1)/2 is O(N2) – N/2 is O(N) – 4N2 + 2N + is O(N2) i=1 Sorting and Searching 4 of 44 November November13, 8, 2005 2003 f(N) on linear graph paper 180 160 2N N2 NlogN N 140 120 f(N) 100 80 60 40 N logN 20 0 20 40 60 80 100 120 140 N Sorting and Searching 5 of 44 November November13, 8, 2005 2003 f(N) on log-log graph paper 109 108 N2 2N NlogN N 107 106 f(N) 105 N 104 103 102 logN 101 0 101 102 103 104 105 106 107 N • x-axis: log N; y-axis: log f(N) • the diagram of cf(N) is obtained by “shifting” the diagram of f(N) up by log c. Sorting and Searching 6 of 44 November November13, 8, 2005 2003 Example: Bubble Sort • Iterate through sequence, compare each element to right neighbor. • Exchange adjacent elements if necessary. • Keep passing through sequence until no exchanges are required (up to N times). • Each pass causes largest element to bubble into place: 1st pass, largest; 2nd pass, 2nd largest, ... 49 2 36 55 4 72 23 Before a pass 2 36 49 55 4 72 23 Middle of first pass 2 36 49 4 55 23 72 After one pass Sorting and Searching 7 of 44 November November13, 8, 2005 2003 Worst-Case Time, Bubble Sort # executions Instruction 1 1 N (N - 1) i = N; sorted = false; while((i > 1)&&(!sorted)) { sorted = true; for(int j=2; j<=i; j++){ if (a[j-1] > a[j]) { temp = a[j-1]; exchange a[j-1] = a[j]; a[j] = temp; sorted = false; } } i--; } { (N-1)+(N-2)+ ... + 2 + 1 = N(N-1)/2 (N - 1) [ – N is number of objects in sequence – [j] is shorthand for accessing jth element in sequence Worst-case analysis: • while-loop is iterated N-1 times • iteration i executes 2 + 6 (i - 1) instructions Total: 2 + N + 2(N-1) + 6[(N-1)+ ... + 2 + 1] = 3N + 6N(N-1)/2 = 3N2+... = O(N2) Sorting and Searching 8 of 44 November November13, 8, 2005 2003 Insertion Sort • Like inserting new card into a partially sorted hand by bubbling to left into sorted subarray; little less brute-force than bubble sort – add one element a[i] at a time – find proper position, j+1, to the left by shifting to the right a[i-1], a[i-2], ..., a[j+1] left neighbors, til a[j] < a[i] – move a[i] into vacated a[j+1] 2 4 36 55 9 72 23 i 2 4 36 55 72 23 j j+1 2 4 9 36 55 72 23 j j+1 i • After iteration i<n, a[1] ... a[i] are in sorted order, but not necessarily in final position Sorting and Searching 9 of 44 November November13, 8, 2005 2003 Time Complexity of Insertion Sort Pseudocode implementation for (int i = 2; i <= n; i++) { int j; for (j = i - 1; (j > 0) && (a[j] > a[i]); j--) { move a[j] forward; } move a[i] to a[j+1]; } Analysis • Most executed instructions are ones for move in inner for-loop. • Such instructions are executed at most 1 + 2 + ... + (N-2) + (N-1) times. • Time complexity: O(N2) worst-case; constants do not matter for Big-Oh. Sorting and Searching 10 of 44 November November13, 8, 2005 2003 Selection Sort • Find smallest element and put it in a[1]. • Find 2nd smallest element and put it in a[2]. • etc. Less data movement (no bubbling) Pseudocode: for (int i = 1; i < n; i++) { find minimum element a[min] in subseqeunce a[1..n] exchange a[min] and a[i] } a 2 4 36 55 5 72 23 i min • After iteration i, a[1] ... a[i] are in final position. Sorting and Searching 11 of 44 November November13, 8, 2005 2003 Time Complexity of Selection Sort for (int i = 1; i < n; i++) { int min = i; for (int j = i + 1; j <= n; j++) { if (a[j] < a[min]) { min = j; } } temp = a[min]; a[min] = a[i]; a[i] = temp; } Worst Case Analysis • Most executed instruction are those in inner for loop (if) • Each such instruction is executed (N-1) + (N-2) + ... + 2 + 1 times • Time complexity: O(N2) Sorting and Searching 12 of 44 November November13, 8, 2005 2003 Comparison of Elementary Sorting Algorithms Movements Comparisons Selection Insertion Bubble Best n2 2 n n Average n2 2 n2 4 n2 2 Worst n2 2 n2 2 n2 2 Best 0 0 0 Average n Worst n n2 4 n2 2 n2 2 n2 2 Note: smaller terms omitted Sorting and Searching 13 of 44 November November13, 8, 2005 2003 Merge Sort • Divide-and-Conquer algorithm • Time complexity: O(N log N) • Simple recursive formulation • Stable -- preserves initial order of equal keys Sorting and Searching 14 of 44 November November13, 8, 2005 2003 Merging Two Sorted Lists A 1 5 9 25 B 2 3 17 N M M+N C 1 2 3 5 9 17 25 • Time: O(M + N) Sorting and Searching 15 of 44 November November13, 8, 2005 2003 Outline of Recursive (Top Down) Merge Sort • Partition sequence into two sub-sequences of N/2 elements. • Recursively sort each sub-sequence. • Merge the sorted sub-sequences. Sorting and Searching 16 of 44 November November13, 8, 2005 2003 Recursive Merge Sort • listSequence is sequence to sort. • first and last are smallest and largest indices of sequence. public class Sorts { // other code here public void mergeSort( ItemSequence listSequence; int first, int last) { if (first < last) { int middle = (int)((first + last) / 2); mergeSort(listSequence, first, middle); mergeSort(listSequence, middle + 1, last); merge(listSequence, first, middle, last); } } } Sorting and Searching 17 of 44 November November13, 8, 2005 2003 Bottom - Up Merge Sort 7 2 2 7 3 5 8 3 5 2 3 5 7 4 4 8 1 6 1 6 1 4 6 8 1 2 3 4 5 6 7 8 This should remind you of ... a tree!!! Sorting and Searching 18 of 44 November November13, 8, 2005 2003 Bottom - Up Merge Sort for k = 1, 2, 4, 8, ... , N/2 { merge all pairs of consecutive sub-sequences of size k into sorted sub-sequences of size 2k. } • Number of iterations is log2N – if there are 1000 elements in the list, only 10 iterations are required!!! • Each iteration (a single merge of two subsequences) takes O(N) time. • Total time T(N) = O( N log2 N ). • Sequential Data Access: – merge sort accesses data sequentially – particularly useful for sorting linked lists and data sorted on disk Sorting and Searching 19 of 44 November November13, 8, 2005 2003 Merging Guidelines • Use buffer sequence (temporary copy) to put merged sub-sequences into. • Scan each half of original sequence. • Compare two elements from two different halves of sequence. • Continue until you have reached end of both halves of the sequences. Sorting and Searching 20 of 44 November November13, 8, 2005 2003 Recurrence Relations for Time Complexity • Recurrence relation (or just recurrence): – recursive definition of function – often used to express time complexity of recursive algorithm • Example: ƒ(1) = 1 ƒ(N) = ƒ(N - 1) + N for N >= 2 • Solution: (by “unfolding”) ƒ(N) = ƒ(N - 1) + N ƒ(N) = ƒ(N - 2) + (N - 1) + N ƒ(N) = ƒ(N - 3) + (N - 2) + (N - 1) + 1 ƒ(N) = 1 + 2 + 3 + ... + (N - 2) + (N - 1) + N ƒ(N) = N((N + 1)/2) = O(N2) Sorting and Searching 21 of 44 November November13, 8, 2005 2003 Example: Towers of Hanoi • Goal: transfer all N disks from peg A to peg C. • Rules: – move one disk at a time – never place larger disk above smaller one • Recursive solution: – transfer N - 1 disks from A to B – move largest disk from A to C – transfer N - 1 disks from B to C • Total number of moves: – T(N) = 2 T(N - 1) + 1 – T(1) = 1 Sorting and Searching 22 of 44 November November13, 8, 2005 2003 Solution of the Time Complexity Recurrence for Towers of Hanoi • Recurrence relation: – T(N) = 2 T(N - 1) + 1 – T(1) = 1 • Solution by unfolding: – T(N) = 2 (2T(N - 2) + 1) + 1 = 4 T(N - 2) + 2 + 1 = 4 (2T(N - 3) + 1) + 2 + 1 = 8 T(N - 3) + 4 + 2 + 1 ... = 2i T(N - i) + 2i-1 + 2i-2 + ... + 21 + 20 – the expansion stops when i = N - 1 T(N) = 2N-1 + 2N-2 + 2N-3 + ... + 21 + 20 – this is a geometric sum, so that we have: T(N) = 2N - 1 = O(2N) Sorting and Searching 23 of 44 November November13, 8, 2005 2003 Time Complexity of Recursive Merge Sort merge N T(N) = 2 • T +N for N 2 2 2 recursive sorts T(1) = 1 N (1) T(N) = 2 • T +N 2 N N +N T(N) ?= 2 • 2T 4 + 2 N N (2) T(N) ?= 4 • T 4 + 2 • +N 2 ( ) (3) ( ) [ ( ) ] ( ) N N T(N) ?= 8 • T ( )+ 4 • +N 8 4 •• • (i) N 2T 2i i ( ) i + N + N + ... + N • Stops when N/2i = 1, i.e., i = log2N T(N) = 2log2N + Nlog2N = O(Nlog2N) Sorting and Searching 24 of 44 November November13, 8, 2005 2003 Sorting and Searching • Part II -- Searching • Using: – – – – sequences (arrays, vectors) linked lists tree hash tables Sorting and Searching 25 of 44 November November13, 8, 2005 2003 Searching In Different Structures • Searching is one of the most common and fundamental computer tasks. We learned how to sort so that we could search faster. input key search output value(s) or object function ... 1. Sequences key1 value key2 value key3 value key4 value 2. Linked Lists ... key1 value key2 value 3. Binary Trees key value Sorting and Searching 26 of 44 November November13, 8, 2005 2003 Simple Searches in Sequences • Searching in unordered sequence: – Dumb Linear Search – Time = N when: – duplicates must be found – worst-case scenario arises – Time = N/2 for average case in successful searches – Time = N for unsuccessful search • Searching in ordered sequence: – use binary search (e.g., telephone book) – average and worst-case time = log2N for successful and unsuccessful searches – duplicates can be found by checking adjacent elements • Using binary search on ordered sequence thus offers logarithmic rather than linear runtime. • But we don’t always use sequences to store our data – insertion and deletion require data movement! – linked lists obviate most data movement at cost of extra pointers Sorting and Searching 27 of 44 November November13, 8, 2005 2003 Simple Searches in Linked Lists • Searching unordered linked list head key1 value key2 value key3 value tail • Linear search; can’t do binary search. • Time = N if duplicates must be found, and for worst case. • Time = N / 2 on average for successful searches. Sorting and Searching 28 of 44 November November13, 8, 2005 2003 Simple Searches in Linked Lists (cont.) • Searching ordered linked list head Mike ... Sara ... Kate ... tail – time = N / 2 for average case of successful, unsuccessful, and duplicate searches • Easy-to-implement dynamic data structure at price of increased running time: – O(N) >> O(log2N) Sorting and Searching 29 of 44 November November13, 8, 2005 2003 Binary Trees • Offer advantages of both sequences and linked lists – logarithmic search time for mostly balanced trees – proportional to depth of tree – log2N on average – easy inserting • What’s the catch? – extra reference in each node – somewhat more complicated deletions as shown in the Trees lecture Sorting and Searching 30 of 44 November November13, 8, 2005 2003 n-ary Trees • Anything faster than binary tree? • n-ary tree (n references per node)! – searches in logn N time • But – difference between logs is small because log (corresponds to tree depth) grows slowly e.g., log2 106 = 20; log10 106 = 6 – requires more complex node processing – not in common use (in this form) • Is there a scheme that offers advantages of both log2 and logn? Sorting and Searching 31 of 44 November November13, 8, 2005 2003 Hybrid Trees • n-ary at root, binary thereafter. Alphanumeric Keys A B C D E Binary tree with all keys with first letter = A Sorting and Searching F G ........ Binary tree with all keys with first letter = E 32 of 44 November November13, 8, 2005 2003 Most Efficient Data Structures • Most efficient search would be 1-level tree where we could access each key immediately: key1 value 1 key2 .... key n value 2 value n • Implement this dream-tree as sequence; then if each key were integer in range of sequence size, use it as index. Check contents of sequence at that location and done! ... key n valuei • Sounds perfect, right? Sorting and Searching 33 of 44 ... November November13, 8, 2005 2003 Symbol Table Example • Consider stupid old BASIC with one-character variable names. • These correspond to 8-bit integer keys in range (65 - 90, 97 - 122) using ASCII code • Value is link to contents (value) of identifier. ASCII value for ‘A’ = 01000001 = 65 value for ‘A’ 65 .... 90 value for ‘Z’ value for ‘a’ .... 97 .... value for ‘z’ 122 Symbol Table • But does this generalize to realistic size keys and tables? Sorting and Searching 34 of 44 November November13, 8, 2005 2003 Problems with Different-Sized Keys • Creating table to look up CS15 students based on ID number would be tremendous waste of space – ID number is one letter followed by five digits (e.g., D00011), so there are 26 * 105 combinations! – do not want to allocate 2,600,000 words for no more than 100 students (1 word = 4 bytes) – sequence would be rather sparse... • What about using their social security number? – would need to allocate 109 words, about 4 gigabytes, for no more than 100 students! • Thus, two major problems: – how can we deal with arbitrarily long keys, both numeric and alpha-numeric? – how can we build a small, dense sequence that we can index into to find keys and values? Sorting and Searching 35 of 44 November November13, 8, 2005 2003 Introduction to Hashing • Hashing refers to deriving sequence index from arbitrarily large key using a hash function. • Index leads to value or object. Therefore, two-step process. hash table key hash function Sorting and Searching index 36 of 44 value November November13, 8, 2005 2003 Introduction to Hashing (cont.) • Hash table typically holds several hundred to several thousand entries. sequence of links to instances of the class Student Hash(‘Mike’)=1 Hash(‘Sara’)=3 Hash(‘Kate’)=5 1 2 3 4 5 null Mike Sara null Kate N Sorting and Searching 37 of 44 November November13, 8, 2005 2003 Collisions • Problem: Normally have more keys than entries in our table. Therefore inevitable that two keys hash to same position… – e.g., Hash(‘Monica’) = 4 – and, Hash(‘Scott’) = 4 • Called collision and is the rule, not exception. • But sequences won’t let us store multiple elements at one location! • This did look too good to be true... Sorting and Searching 38 of 44 November November13, 8, 2005 2003 Handling Collisions • Since by design, we can’t avoid collisions, we use buckets to catch extra entries. • Consider stupid hash that returns integer value of first letter of each name. • Each entry in table could be reference to bucket – implement bucket with unsorted linked list – then we only need to search within (small) linked lists – called “chaining” – for longer lists, we could sort bucket, or use binary trees; hybrid tree! 55 ... Linked List head tail 65 ... 90 Linked List Sorting and Searching Linked List head tail “Mike” 39 of 44 “Sara” “Kate” November November13, 8, 2005 2003 Linear Buckets • Alternatively, hash table entry could point to (maximum sized) sequence for each bucket and overflow into adjacent bucket if needed. 45 65 83 Kate ... Mike ... Monica ... Sara ... Scott ... • Wastes space. • May have to search multiple sequence buckets. Sorting and Searching 40 of 44 November November13, 8, 2005 2003 Building a Good Hash Function • Good hash functions – take into account all information in key – fill out hash table as uniformly as possible • Thus, function that uses only first character (or any character) is terrible hashing function. – Not many Q’s or Z’s, lots of A’s, M’s, etc • % (mod) provides simple method of insuring that any integer can be brought within desired range (pick prime as table size for best results). • Thus, we could take a string, chop it into 4letter sections, then take value of 32 bits that make up each 4-letter section and XOR them together, then % that result by table size. • Almost any reasonable function that uses all bits will do, so choose fast one! – example: hashValue = (key.GetValue()) % 101; Sorting and Searching 41 of 44 November November13, 8, 2005 2003 Sorting and Searching 42 of 44 November November13, 8, 2005 2003