Survey

# Download Hashing

Document related concepts

Transcript

Hashing Chapter 20 Hash Table • A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). • The simplest way of implementation a hash table is an array. Example: Suppose we want to store integers in the range of 0- 65535. Create any array “a” of size 65536 with indices in the range of 0-65535 and initialize the array with all zeros. insert (i) -a[i]++ find (i) -Is a[i] > 0 ? remove (i) -if ( a[i] > 0) a[i] -- • If our keys are 8-letter alphabetic words, there are 268 or about 200 billion possible keys [about 200 ‘gig’ of keys]. • Only a small fraction of these keys will actually occur • Conceptually, a very large array, with very few cells occupied. • We need a better way • Allow many of the different possible keys which can occur to be mapped to the same location under the action of the index function. • A hash function takes a key and maps it to some index (possibly a smaller index) in the array. • A collision occurs when the hash function maps two actual keys to the same index. Hash table operations • Given a hash function hash(key) which returns an integer: The simplistic approach is as follows – insert(key): A[hash(key)] = object to insert – find(key): is the object at A[hash(key)] ? – remove(key): remove object at A[hash(key)] • But what happens on collisions? Choosing a hash function • Desirable properties of a good hash function – quick and easy to compute – uniformly distributes the keys over the range of indices – minimizes collisions Methods of building a hash function • Truncation – ignore part of the key and use the remaining part as the index • Folding – Partition the key into several parts and combine these parts to obtain the index • Modular arithmetic – Convert the key to an integer and mod by the table size Example • Given 8-digit integers and a table of size 1000 – Truncation • e.g. -- use the 4th, 7th and 8th digits to form the index: hash(62538194) = 394 – Folding • e.g. -- break into groups of 3, 3, and 2 digits, add the parts and truncate if necessary: hash(62538194) = (625+381+94) mod 1000 = 1100 mod 1000 = 100 Example cont’d – Modular arithmetic • e.g. Simply mod by the table size: hash(62538194) = 62538194 mod 1000 = 194 • It seems best to have a table size which is a prime number for modular arithmetic, so a table size of 997 or 1009 would perform a little better. • A combination of these techniques may be even better Collision Resolution • Open addressing – The table is an array which holds at most one object per index -- contiguous storage • Chaining – The table is an array of chains, all elements on a chain have the same index [these chains are sometimes called “buckets”] -- dynamic storage Open Addressing • Linear probing – This is the simplest method of collision resolution – Start with the hash index and perform a linear search for the desired key or an empty location – The table is considered circular, the search wraps around from the last index to the first Open Addressing • Clustering – The major drawback of linear probing is that when the table becomes about half full, these is a tendency toward clustering – Clustering occurs when records start to appear to as long strings of adjacent positions, which may have several different hash values – Linear searches for empty locations become longer and longer An Example Insert the items 67, 89, 17, 20, 90, 19 into an empty hash table using an array of size 10 and using the following hash function hash (key) = key mod 10. Use linear probing to handle collisions. Open Addressing • Other techniques of collision resolution – Rehashing • use a second hash function to find an alternative position – Quadratic Probing • if hash(key) = h, probe at locations h+1, h+4, h+9, h+16, etc. [i.e., locations h+i2 for i = 1,2,3,4,…] – Random Probing • use a seeded pseudo-random number generator to obtain the increment Open Addressing • Deletions – deletions with open addressing is awkward. (why?) – lazy deletion is the preferred means – that is, making items as deleted rather than physically removing them from the table. Chaining • Advantages to linked storage – with a good hash function, the linked lists will be short – clustering is not a problem -- records with different keys are on different chains – The size of the table is of less concern – Deletions are easy and efficient – [The chains could be binary search trees or other structures] Load Factor • The load factor of a hash table is the ratio of the number of items in the table to size of the hash table – – – – – n - the number of items in the table t - the size of the hash table the load factor = n/t = 0 indicates an empty table = 0.5 indicates a table half full Load Factor • In open addressing, may never exceed 1, and in practice, > 0.5 will begin to cause clustering problems. • In chaining, there is no limit to the size of . Linear Probing Theorem: The average number of cells examined in an insertion using linear probing is [1 + 1/(1 – k)^2]/2 where k is the load factor. Theorem: The average number of cells examined in a successful search is approximately [1 + 1/(1 – k)]/2 where k is the load factor. Quadratic Probing Note that in linear probing, each probe tries a different cell. Does quadratic probing guarantees that, when a cell is tried, we have not already tried it during the course of the current access? Does quadratic probing guarantees that, when we are inserting x and the table is not full, x will be inserted? Quadratic Probing • Theorem: If quadratic probing is used and the table size is prime, then a new element can always be inserted if the table is at least half empty. Furthermore, in the course of the insertion, no cell is probed twice. Hash Table Vs. BST Insert and find operations can be implemented using a BST with average insert/find time of O(logn). However, a BST is generally a more powerful data structure than a hash table as it can easily support routines that require order, for example, finding the smallest/largest element. Hash Table VS. BST If the input is sorted, a BST will perform poorly. Although balanced trees can be used to avoid the O (n) time insert/find, they are quite expensive to implement. Hence, if no ordering information is required and there is any suspicion that the input might be sorted, hashing is the data structure of choice. Applications of Hash Tables Hash tables are used in implementing • Symbol Tables • Game Programs • Spelling Checkers HW: Problems 20.1-20.6 on page 710