Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hashing A quick lookup strategy Copyright © 2003-2005Curt Hill What is a hash? • Hashing is another name for key transformation • The original key is usually a character string or other sparse key • The result is usually a dense integer key Copyright © 2003-2005Curt Hill Components • Every hash scheme has three pieces – A hash function – A collection of buckets – Collision scheme • The hash function transforms the key into an integer 0 – N • The buckets are numbered 0 – N and hold the resulting data Copyright © 2003-2005Curt Hill Basics • The idea: – Have N containers for the data – Have a function that maps the original key to a number in the range 0 – (N-1) – One access retrieves the data regardless of size – When it works it is the fastest way to lookup a value Copyright © 2003-2005Curt Hill Hash function • Convert a sparse key into a dense key in the form of an integer • The input is the real key which is a sparse key – Sparse keys use very few of very many possible key values • Typical key is a character string while the result needs an integer Copyright © 2003-2005Curt Hill Typical hash function • Input is a character string • Output is an integer in range 0 - N • Sum the ordinal value of each character of the string • Divide the result by N and take the remainder – This guarantees the right range • Number theory tells us to make N prime Copyright © 2003-2005Curt Hill Example values with N = 256 • • • • • • • • “Abcdef” returns 53 “Hi there” returns 233 “ABCDEF” returns 149 “FEDCBA” returns 149 “A character string” returns 197 “A big character string” returns 23 “Zoology” returns 243 See the pattern yet? Copyright © 2003-2005Curt Hill Hash functions • Computing the function is easy • Two problems – One result produced by two different keys – called a collision – The order of the original key is scrambled in the result • Very good for an equality test and very bad for a range test Copyright © 2003-2005Curt Hill Is it that simple? • Not usually • The previous scheme maps all words with exactly the same characters to the same integer value • Many variation – Do a different computation on even and odd characters so a re-arrangment will produce different values Copyright © 2003-2005Curt Hill Collisions • Two keys giving the same integer – Assume each bucket may only hold one record • Prevent hashing from the optimal search technique it could be • Almost every hashing scheme needs a collision strategy • There have been many variations Copyright © 2003-2005Curt Hill Collision Strategies • Linear probing – Add one to index until right one is found • Quadratic probing – Square the index when the collision occurs • Rehash – Secondary hash function • Chaining – Link duplicates in an overflow area Copyright © 2003-2005Curt Hill Collision Strategies Linear 0 Quadratic Chaining 0 6 y 0 2 e 1 1d 1 1 d 1 2 2 e 2 2 e 2 2 t 3 2 t 3 3 6 y 4 4 2 t 4 5 5 c 5 5 c 5 6 6 m 6 6 m 6 7 6 y 7 7 Copyright © 2003-2005Curt Hill 1 d 5 c 6 m Free Space • As the free space decreases the collisions increase • The best plan is to have about twice as many buckets as will be needed Copyright © 2003-2005Curt Hill Effect of load using linear probing Load factor Number of probes 10% 1.06 25% 1.17 50% 1.5 75% 2.5 90% 5.5 95% 10.5 Copyright © 2003-2005Curt Hill Dynamics • A hash table is much harder to both grow and shrink • Increasing or decreasing the number of buckets requires: – Allocate a new table (larger or smaller) – Rehashing every item in old into the new • Deleting a record (without resizing) is also a problem because of collisions Copyright © 2003-2005Curt Hill Deletion of a key • The problem is collisions • If the key has no collisions just mark the bucket as empty • If it has collisions – All of them need to be rehashed – Finding them depends on the collision strategy Copyright © 2003-2005Curt Hill Two more terms • A perfect hash is one that has no collisions – Only occurs in situations with two conditions: – Keys are fixed and known in advance – The hash function is tailored to these keys • A minimal hash has no empty buckets Copyright © 2003-2005Curt Hill Data Skew • A good hash function spreads the keys uniformly among the integer range • How well does this work when the data is not uniformly distributed? • For example consider – Names - There are many more Smiths than Garnjobsts – Numbers – there are many more 101 courses than most other numbers • Can the hash function still give a good distribution? Copyright © 2003-2005Curt Hill Summary • Hashing works best when it works well – Stable index – Good distribution from the hash function • It is much harder to make dynamic than trees Copyright © 2003-2005Curt Hill