Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Structures & Computational Complexity ECE 244 2013 Vaughn Betz Data Structure Complexity Operation Unordered List Vector Binary Search Tree Build (N items) O(N) O(N) O(N log N) Find item by value/key O(N) in general O(1) at head / tail O(N) O(log N) Insert item O(1) O(N) in general O(1) at back O(log N) Delete item O(N) in general O(1) at head / tail O(N) in general O(1) at back O(log N) • BST: all operations reasonably fast – Good match to database (find, insert, delete common) – Search through everything to find entry • vector: – Slow to insert/delete except at end/back – Slow find by value/key • Unordered List: fast insert, slow find and delete – Good if you only find/delete at head and tail – Stacks and queues 2 Important Vector Special Case • If key is an integer • AND range of key limited: [0 to N-1] Can store in array/vector, use key as index array MyObject 0 MyObject 1 MyObject 2 MyObject 3 MyObject array[N]; int key; MyObject object_with_key = array[key]; MyObject 4 ... MyObject N-3 • Build: O(N) • Find, Insert, Delete: O(1) • Fastest data structure! MyObject N-2 MyObject N-1 3 Hash Tables ECE 244 2013 Hash Table - Idea • Stores data in a specially indexed array • Enables O(1) find, insert, delete operations – Very fast! – Trade-off: wastes some space • Each item has a unique key – Key range large (e.g. student numbers: 900123456) – Many more keys than items to store – Can’t just make array[MAX_KEY] too big • Know roughly how many items to store – E.g. number of students in ECE department • Create array with more entries than we have items to store • “hashing function” h(key) to map from key to array entry • Reference: Algorithms by Cormen et al, Chap. 11 5 Hash Table - Implementation MyObject hash_array[M]; hash_array key (0 .. 999999999) h(key) index [0..M-1] – Example: – h(key) = key % M; – M = 10 – key = 900200204 – h(key) = 4 MyObject M-1 MyObject M-2 MyObject M-3 ... MyObject 4 MyObject 3 MyObject 2 MyObject 1 MyObject 0 6 Hashing Collisions const int M = 10; MyObject hash_array[M]; – h(key) = key % M; – M = 10 – key0 = 789222308 – h(key0) = 8 – key1 = 900200205 – h(key1) = 5 – key2 = 988777335 – h(key2) = 5 collision! • Problem: can’t overwrite data of another object/key hash_array MyObject 9 MyObject 8 MyObject 7 MyObject 6 MyObject 5 MyObject 4 MyObject 3 MyObject 2 MyObject 1 MyObject 0 7 Solution 1: Hashing with Chaining • • • Also known as “open hashing” Don’t store object at array entry i Instead store pointer to linked list of objects – Every object with a key such that h(key) == i ListofObjects hash_array[M]; hash_array key0 = 789222304 h(key0) = 4 key1 = 900200201 h(key1) = 1 key2 = 988777331 h(key2) = 1 ... List 5 = NULL List 4 = NULL List 3 = NULL key = 789222304 next = NULL List 2 = NULL List 1 = NULL List 0 = NULL key = 900200201 988777331 next = NULL 8 Open Hashing • If hash function is “good” – Spreads out keys across array evenly • And if hash table (array) big enough – More array locations than keys we want to insert • Then average length of list is approx. 1 – Achieves O(1) access 9 Open Hashing & Hash Quality • Bad hash function example: – Instead of h(key) = key % M, could use h’(key) = key / 100000000 • E.g. – key = 988777334 – h’(key) = 9 – If keys are student numbers, and many start with same digit (entered university at same time) h’ will not map evenly across array Lists will be longer, and access slower 10 Solution 2: Closed Hashing • If h(key) == i, and location i occupied – check location (i+1) % M in hash table – store key/object there if empty hash_array – else check location (i+2) % M –… key=789222308 const int M = 10; MyObject hash_array[M]; key0 = 789222308 h(key0) = 8 key1 = 900200204 h(key1) = 4 key2 = 988777334 h(key2) = 4 9 8 7 6 key=988777334 5 key=900200204 4 3 2 1 0 11 Closed Hashing • Consider: – h(key) == i – i occupied • Linear probing: – Check (i+1) % M, (i+2) %M, … – Until we find an empty slot – Problem: used entries tend to cluster, even with a good hash function • More probes per insert or find • Slows down average insert and average find • Quadratic probing: – Check (i+1) % M, (i+4) % M, (i+9) % M, … – Less clustering (find an empty slot faster) 12 Closed Hashing • Double hashing (or re-hashing) – Defines two hash functions: – h(key) and h2(key) – Example: • h(key) = key % M • h2(key) = (key * 7) % M – – – – – First map to h(key) If collision, try [h(key) + h2(key)] % M Then [h(key) + 2 * h2(key)] % M Then [h(key) + 3 * h2(key)] % M … • +ve: reduces clustering • -ve: more time to compute hash functions 13 Closed Hashing: Class // Storing Element type objects, using an integer key. // Using closed hashing with linear probing in this class. Class HashTable { private: Element *htable; // [0..m_table_size-1] int m_table_size; int n_items; // Num items stored in table. int hash (int key); // Helper function: the hash bool is_empty (int index); // Helper functions void set_to_empty (int index); public: HashTable(int table_size); ~HashTable(); bool insert (const Element &item); bool find (int key, Element &item); bool delete (const Element &item); }; 14 Closed Hashing: Constructor HashTable::HashTable (int table_size) { n_items = 0; m_table_size = table_size; htable = new Element[m_table_size]; // Need to store a special value for empty items, // so we can tell where we have spots to insert things // in the table. for (int i = 0; i < m_table_size; i++) set_to_empty (i); } 15 Closed Hashing: Insert bool HashTable::insert (const Element &item) { if (n_items == m_table_size - 1) return (false); // Table full! int index = hash (item.key); Need to resize. // “Home” spot for item. while (!is_empty(index)) { if (item.key == htable[index].key) return (false); // Already in table. // Look for free entry, using linear probing. index = (index + 1) % m_table_size; } htable[index] = item; n_items++; return (true); // Found free entry. Store item. } 16 Closed Hashing: Find bool HashTable::find (int key, Element &item) { int index = hash (key); // “Home” spot for item. while (!is_empty(index)) { if (key == htable[index].key) { // Found item! item = htable[index]; return (true); } // Keep looking, using linear probing. index = (index + 1) % m_table_size; } // Hit an empty spot, without finding item with key // Will have at least one empty spot, since insert // only allowed m_table_size – 1 items return (false); } 17 Closed Hashing: Delete HashTable my_hash(10); item0.key = 789222308; my_hash.insert (item0); item1.key = 900200205; my_hash.insert (item1); item2.key = 988777335; my_hash.insert (item2); htable 9 <empty> <empty> 8 key=789222308 7 <empty> key=988777335 6 <empty> <empty> key=900200205 5 my_hash.delete (900200205); my_hash.find (988777335, found_item); <empty> <empty> 4 <empty> 2 <empty> 1 <empty> 0 3 18 Closed Hashing: Delete HashTable my_hash(10); item0.key = 789222308; my_hash.insert (item0); item1.key = 900200205; my_hash.insert (item1); item2.key = 988777335; my_hash.insert (item2); htable 9 <empty> key=789222308 8 7 <empty> key=988777335 6 <empty> key=900200205 5 my_hash.delete (900200205); my_hash.find (988777335, found_item); Find will fail (return false)! Problem: didn’t reproduce insert’s probe sequence How can we fix? <empty> <empty> 4 <empty> 2 <empty> 1 <empty> 0 3 19 Closed Hashing: Delete Solution: can’t mark deleted items as empty. Mark as “deleted” (another special value) – OK for insert to re-use. – But find must keep looking past them, as it probes. – Ensures original (insert) probe sequence reproduced by find htable 9 <empty> key=789222308 8 7 <empty> key=988777335 6 my_hash.delete (900200205); my_hash.find (988777335, found_item); Now finds item in index 6 What do we need to change in insert & find code? <deleted> key=900200205 5 <empty> <empty> 4 <empty> 2 <empty> 1 <empty> 0 3 20 Closed Hashing: Fixed Insert bool HashTable::insert (const Element &item) { int index; if (n_items == m_table_size - 1) return (false); // Table full! Need to resize. index = hash (item.key); // “Home” spot for item. while (!is_empty(index) && !is_deleted(index)) { if (item.key == htable[index].key) { return (false); // Already in table. // Look for free entry, using linear probing. index = (index + 1) % m_table_size; } htable[index] = item; n_items++; return (true); // Found free entry. Store item. } 21 Closed Hashing: Fixed Find bool HashTable::find (int key, Element &item) { int index = hash (key); // “Home” spot for item. while (!is_empty(index)) { if (!is_deleted(index) && key == htable[index].key) { // Found item! item = htable[index]; return (true); } // Keep looking, using linear probing. index = (index + 1) % m_table_size; } // Hit an empty spot, without finding item with key return (false); // Can’t find key in table. } 22 String Hash Functions • Can hash more than integers • pair of ints, strings, … string s = “Hash tables enable fast find operations!”; int hash_index = hash_func (s); int hash_func (string s) { return (s[0] % m); // m is the hash table size } Good? No! 1. Indices all from 0 to 255. If m = 5000, most indices never used. 2. Some letters (e.g. ‘e’) more likely than others – Poor spreading over indices even from 0 to 255 3. Empty string will crash (no s[0]) 23 Better String Hash Function int hash_func (string s) { int hash = 0; int length = s.length(); for (int i = 0; i < length; i++) hash = hash * 127 + s[i]; // Multiply by a non-power of 2 for “bit scrambling” } hash = hash % m; // m is the hash table size return (hash); } 1. Produces larger numbers: uses all indices – With strings of length 4, indices up to 500 million possible 2. Each bit of hash is a scramble of the string characters Spreads various strings better across indices 24 Complexity Analysis – Open Hashing • Let = n / m – n = items in table – m = table size (array size) – “load factor” • Worst case? – All items could hash to one location – Linked list with n items to search – O(n) find and delete – O(1) insert (at head) 25 Complexity Analysis – Open Hashing • Average case – – – – Good hash function: all m locations equally likely Assume O(1) time to compute hash function Expected length of each linked list: n/m = Find and delete are: O(1 + ) • 1 to compute hash and go to “home” index • to look through a linked list of length – If kept small (constant): • O(1) find and delete • Already had O(1) insert – To keep small, keep m within a constant factor of n • Often want m >= n for highest speed 26 Complexity Analysis – Closed Hashing • Worst case – Insert, find, delete could all take O(n) time • Average case – Good hash function: all m locations equally likely – Insert and unsuccessful find: 1 • O(1−) – Successful find • O(1 ∗ ln(1−1)) – O(1) if kept well below 1 • = 0.5: average 2 probes per insert and 1.4 per successful find – – – – Degrades rapidly as approaches 1 Table fails (cannot insert) when == 1 Usually slightly faster than open hashing for small Degrades more rapidly than open hashing for larger 27 Average Probes vs. Load Factor 25 20 Probes (Comparisons) per Find 15 Open hashing: Find Closed Hashing: Find 10 5 0 0 0.2 0.4 0.6 0.8 1 (load factor) 28