* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download DSLec(Hashing). - CSE246DataStructures
Survey
Document related concepts
Transcript
Data Structures and Algorithms Lecture (Searching) Instructor: Quratulain Date: 4 and 8 December, 2009 Faculty of Computer Science, IBA Basic search terminologies A table or file is a group of elements, called a record. Each record is associated with key to differentiate the record. Key could be internal or embedded with table. In other case could be external (outside of table). For every record, the unique key is called a primary key. A table of records in which a key is used for retrieval is often called a search table or dictionary. Basic search terminologies A table is organized with specific search technique. The table may be contained completely in memory, completely in auxiliary storage or it may be divided between the two. In this course we will concentrate on internal search (completely in memory). Dictionary as an Abstract Data Type A dictionary is an ordered or unordered list of keyelement pairs, where keys are used to locate elements in the list. Example: ◦ consider a data structure that stores Student information; it can be viewed as a dictionary, where Student ID serve as keys for identification of Student objects. Operation on Dictionary ◦ Size(), empty(), finditem(key), removeitem(), insertitem(key, element) Note Java has a built-in abstract class java.util.Dictionary Search Techniques for Dictionary ADT Sequential search Indexed sequential search (book handout) Binary Search Hashing Trees searching ◦ ◦ ◦ ◦ ◦ Binary search trees Multiway search trees B-trees AVL trees Red black trees Sequential Search Simplest form of search algorithm for dictionary, implemented either in array or linklist. The algorithm examine each key; upon finding one that matches the search argument, its index or value is returned. In worst case, number of comparison are O(n). Indexed Sequential Search This technique increase the search efficiency for a sorted file but need more space to store indexes. These indexes can be applicable to both arrays and linklist. Binary Search Most efficient search technique. Use sorted array. Cannot implement using linklist The worst case complexity is O(logn) Hashing Hashing is a method for directly referencing an element in a table by performing arithmetic transformations on keys into table addresses. This is carried out in two steps: Step 1: Computing the so-called hash function H: K -> A. Step 2: Collision resolution, which handles cases where two or more different keys hash to the same table address. Hashing Hashing is function that maps each key to a location in memory. Minimize the number of comparison to insert a data. A key’s location does not depend on other elements, and does not change after insertion. unlike a sorted list A good hash function should be easy to compute, minimize hash collision, and uniformly distributed records in table. With such a hash function, the dictionary operations can be implemented in O(1) time. Hashing Suppose the company has100 inventory record then two digit indices of array will be sufficient. But if the 100 inventory with 5-digit of code then we have to use 5-digit indices of array but store only 100 records will waste the memory. Map key values to hash table addresses keys -> hash table address This applies to find, insert, and remove Usually: integers -> {0, 1, 2, …, Hsize-1} Typical example: f(n) = n mod Hsize Non-numeric keys converted to numbers ◦ For example, strings converted to numbers as Sum of ASCII values First three characters Hash collision When h(k1)=h(k2) then two values cannot be place on one location this is called Hash Collision or hash clash Two method to deal with clashes. ◦ Rehashing (use secondary hash function that contunue until new location found) ◦ Chaining ( build a link list of all items with same hash key) Resolve hash clashes using Open Address(Rehashing) The simplest method of resolving clashes is to place the record in the next available position in the array which is still open. This technique of rehashing is called linear probing. Rehash function ◦ rh=(h(k) +1 )%m [cover all indices] ◦ rh=(h(k) +2 )%m [cover only even indices] Loop executes forever to find location if: ◦ Hash table is full, therefore maintain the count. Resolve hash clashes using Open 0 Address 1 0 1 2 2 A 3 3 I 4 4 J 5 5 6 6 C 7 7 B 8 8 D 9 9 F 6 10 10 K G 43 7 11 11 H 67 16 12 12 I 88 3 13 13 J 36 2 14 14 K 40 6 15 15 16 16 k H(k) A 53 2 H(k) = k mod 17 B 41 7 C 91 6 D 75 7 E 13 13 F 6 Index between 0 and 16 E H Hashing Good hash function is one minimize the collision Hashing allow direct access therefore preferable to other search technique. It is more efficient to initialized hash table with extra spaces to reduce rehash many times. Resolve hash clashes by open addressing Store all elements in table If a cell is occupied, try another cell. Linear probing, try cells ◦ H(k), H(k) + 1 mod m, H(k) + 2 mod m, . . Rh(i)=(i+2)%17 used to rehash function. Then any key that hashes into an even integer rehashes into successive even integers even then list is empty in odd locations and same with odd integers. Primary Clustering The phenomena, where two keys that hash into different values compete with each other in successive rehashes, is called primary clustering. Example: any record whose new coming key hashes into 4,5,6,7 will be placed in 7, where as only record whose key hashes into 7 will be placed in that location. 0 4700 1 2 3 4 114 5 665 6 346 7 8 508 9 109 Primary Clustering To eliminate the primary clustering is to allow the rehash to depend on the number of time hash function is applied: rh(i,j), rehash value I if the key is being rehashed for the j time. E.g. rh(i,j)=(i+j)%tablesize Another approach is to use random number between tablesize to ensure that no two rehashes for same key conflict. However these approaches eliminate primary clustering but not eliminate secondary clustering. Secondary clustering The secondary clustering is a phenomena in which different keys that hash to the same value follow the same rehash path. To eliminate all clustering, use another phenomena called double hashing. Double hashing Involve the use of two hash function h1(key) and h2(key) h1 is the primary key function then If the position is occupied then rh(i,key)=(i+h2(key))%tablesize until empty loc is found. The rehash function use original key with hash value As long as h2(key1) does not equal h2(key2), records do not compete for the same set of locations. Analysis The efficiency of hashing method is measured by the average number of table position that must be examined in searching for a particular item. Let n be the number of item currently in hash table and table size be the number of positions in table then for large table it has been proved that average number of probes required for successful search. Analysis The load factor x is defined as the ratio between occupied slots and table size. x=(n-1)/tablesize If the fraction of location that is occupied is equal to the load factor x, we can approximate the number of probe (checks) for successful search under linear rehashing. When the load factor is reaches to the tablesize (table is 80% full) then create a hashtable of double size to minimize rehashing. Analysis Worst-case: All keys hash to the same bucket. Insert takes O(1), but delete and find take O(N). In linear probing max Load factor <=1 Performance of the hash tables, based on open addressing scheme is very sensitive to the table's load factor. If load factor exceeds 0.7 threshold, table's speed drastically degrades. Indeed, length of probe sequence is proportional to (loadFactor) / (1 - loadFactor) value. In extreme case, when loadFactor approaches 1, length of the sequence approaches infinity. In practice it means, that there are no more free slots in the table and algorithm will never find place to insert a new element. Hence, this kind of hash tables should support dynamic resizing in order to be efficient. Chaining One simple scheme is to chain all collisions in lists attached to the appropriate slot. This allows an unlimited number of collisions to be handled and doesn't require a priori knowledge of how many elements are contained in the collection. The trade off is the same as with linked lists versus array implementations of collections Chaining In worst case, all keys map to same index. Good hash function should uniformly distributes keys in all indices. Load factor x=n/tablesize. Where n is number of slots occupied. If x is load factor then O(1+x), where x proportion of comparison need to search in linklist or array list which is refer by an index. Analysis In chaining load factor n/m Worst-case: All keys hash to the same bucket. Insert takes O(1), but delete and find take O(N). Chaining Average Case: ◦ If keys are uniformly distributed, then all buckets are about the same size. ◦ Each linked list has expected size O(N/M). Thus, insert is still O(1), but delete and find become O(N/M). N/M is called the load factor. Chaining Vs Open addressing Chaining Open addressing Collision resolution Using external data structure Using hash table itself Memory waste Pointer size overhead per entry (storing list heads in the table) No overhead 1 Performance dependence on Directly proportional table's load factor Proportional to (loadFactor) / (1 - loadFactor) Allow to store more items, than hash table size No. Moreover, it's recommended to keep table's load factor below 0.7 Yes Hash function requirements Uniform distribution Uniform distribution, should avoid clustering Handle removals Removals are ok Removals clog the hash table with "DELETED" entries Implementation Simple Correct implementation of open addressing based hash table is quite tricky Hash Maps A map data structure stores (key, value) pairs. Values are retrieved by searching for the appropriate key. Maps are also sometimes called dictionaries or associative arrays. Maps can be implemented in a number of ways. One of the most common ways is to store the (key, value) pairs in a hash table. This is called a hash map.