* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Notes 33 Royden
Survey
Document related concepts
Transcript
Data Structures CSCI 132 Hash Tables 1 Tables with complicated index functions •Index functions are not always simple functions that compute an integer value from integer inputs. •Often, the key used for table lookup is not a number, but rather an object or string. •Example: Keys that consist of 8 character words. •Problem: There are 268 = 2 x 1011 possible arrangements of characters. There is not enough memory to contain a table with one position for each possible word. Furthermore, only a few of the table positions would be filled--it would be a sparse table. 2 Hash Tables •Hash tables use an index function that maps many possible keys to a single location. •If the table is sparse, then most of the time only 1 key will go to each location. •If 2 records do get assigned to the same location (a collision), we use a method for reassigning the second record (collision resolution). A hash table 3 The Hash Table Algorithm Insertion: 1) Calculate hash function of the key of the record to be inserted. 2) If the location is empty, insert the record there. 3) If the location contains the same record, do not insert. 4) If the location contains a different record, find a new location for insertion with collision resolution method. Retrieval: 1) Calculate the hash function of the key. 2) If the record is at that location, retrieve it. 3) Otherwise, follow collision resolution method to find the record. 4 Creating Hash Functions Hash functions should: 1) Be easy and quick to compute 2) Achieve an even distribution of keys across the table. Methods: Truncation Folding Modular Arithmetic 5 A Hash Function Example class Key: public String { public: char key_letter(int position) const; void make_blank( ); // Add constructors and other methods. }; int hash(const Key &target) { int value = 0; for (int position = 0; position < 8; position++) value = 4 * value + target.key_letter(position); return value%hash_size; } 6 Collision Resolution Methods: Linear Probing Quadratic probing Key dependent Increments Random probing Chaining 7 Chaining Chaining uses a table of linked lists. Collisions are resolved by inserting the new elements into a list at the shared location. 8 Advantages and disadvantages of chaining Advantages: •Create an array of addresses rather than records. If the records are large, this saves considerable space. •Collision handling is simple--Insert colliding records into a list. •Allows more records to be stored than the size of the table. •Deletion of records is easy. Disadvantages: •If table is full (or nearly full) there may be long lists at some key locations. This can slow down retrieval because you have to search the list for your record. •Pointers take up memory space. This may be wasteful if the records are small. 9 The C++ Hash Table Specification const int hash_size = 997; // a prime number of appropriate size class Hash_table { public: Hash_table( ); void clear( ); Error_code insert(const Record &new_entry); Error_code retrieve(const Key &target, Record &found) const; private: Record table[hash_size]; }; 10 Implementation of insert( ) Error_code Hash_table :: insert(const Record &new_entry) { Error_code result = success; int probe_count, // Counter to be sure that table is not full. increment, // Increment used for quadratic probing. probe; // Position currently probed in the hash table. Key null; // Null key for comparison purposes. null.make_blank( ); probe = hash(new_entry); //Find location to insert new_entry probe_count = 0; increment = 1; 11 insert( ) continued //we will complete this in class. } 12 Likelihood of collisions •How many people have to be in a room before the probability that two of them have the same birthday reaches 50%? P=? •The calculation for a probability of a collision in a table is similar. •The table does not have to be very full for the probability of a collision to reach at least 50%. •Therefore: Collisions happen! We must handle them efficiently. 13 Counting Probes •We can analyze the running time of hash tables by counting comparisons. •Comparisons take place when "probing" an entry: Looking at an entry and comparing its key to a target. •The number of probes done depends on how full the table is. n = number of entries in the table t = number of total positions in table (= hash_size) l = n/t = Load Factor l = 0 means no entries in table l = 0.5 means the table is 1/2 full l <= 1 for contiguous table without chaining (open addressing) l can be greater than 1 if using chaining 14 Number of comparisons for chaining Unsuccessful searches: •If entries distributed evenly over the table, then the expected number of entries in each chain is: n/t = l. •For an unsuccessful search, we must do one probe for each entry in the list, so the average number of probes (or comparisons) is l. Successful searches: •Average number of comparisons for sequential search of a list with k items is: (k + 1)/2 •The node we are looking for is in our list, the other n-1 nodes are distributed evenly over the table so the average number of nodes will be: k = (n-1)/t + 1 ~ n/t + 1 = l + 1. •Average number of comparisons will be (l + 1 + 1)/2 = l/2 + 1 15 Open addressing (without chaining) Evenly distributed entries, Random probing: Number of Comparisons (approx) Successful case: (1/l)ln(1/(1-l)) Unsuccessful case: 1/(1 - l) Linear Probing: Successful case: Unsuccessful case: 0.5(1 + 1/(1-l) ) 0.5(1 + 1/(1-l)2 ) 16 Theoretical and empirical results 17 Hash Tables vs. Other Methods •Speed of retrieval from a hash table does not depend on the total number of entries, but on the ratio of entries/table-size (l). •A table of size 40 with 20 entries has the same performance as a table of size 4000 with 2000 entries. Sequential Search: Q(n) Binary Search: Q( lg(n)) Hash Table retrieval: O (1) for small l. •Read section 9.8 on choosing a method for storage and retrieval of data. 18 Radix sort Radix sort creates a table of queues. Each queue corresponds to a letter of the alphabet. Sort from least significant letter to most significant letter. 19 Implementation of Radix Sort const int key_size = 5; const int max_chars = 28; template <class Record> void Sortable_list<Record> :: radix_sort( ) { Record data; Queue queues[max_chars]; for (int position = key_size - 1; position >= 0; position--) { // Loop from the least to the most significant position. while (remove(0, data) == success) { int queue_number = alphabetic_order(data.key_letter(position)); queues[queue_number].append(data); // Queue operation. } rethread(queues); // Reassemble the list. } } 20