Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Structures Brett Bernstein Lecture 9: Hashtables and HashMap Exercises 1. Explain a small change that could speed up the following code: Scanner in = new Scanner(System.in); int n = in.nextInt(); ArrayList<String> al = new ArrayList<>(); for (int i = 0; i < n; ++i) al.add(in.nextLine()); Collections.sort(al); for (int i = 0; i < n; ++i) { process(al.get(i)); //does something } 2. What is the runtime for nding a particular value (not key) in a hashtable? 3. You have a hashtable whose keys are Integers, values are Doubles, and that uses chaining for collision handling. Suppose it has 5 buckets and that the hash function takes the absolute value of the integer mod 5. What does the hashtable look like after the following operations, in order: put(8,0.0); put(11,1.0); put(21,1.0); put(53,9.2); put(11,−7.9); remove(21); 4. Below is a Java interface for the Set ADT: Set.java //Stores elements but disregards duplicates (according to .equals). public interface Set<T> { //Adds t to the set void add(T t); //Determines if t is in the set boolean contains(T t); //Removes t from the set if it is the set. //Otherwise does nothing. void remove(T t); //Returns the number of elements in the set int size(); } How could this be easily implemented using a Map? 1 5. Suppose we wanted a special Map that associated multiple values to a single key. How can this be done easily using a standard Map (without modifying its implementation)? Solutions 1. By plugging n into the constructor we can avoid the unnecessary copying and allocations made as we grow the ArrayList. 2. To nd a value we will iterate through the bucket table, and for each bucket index, iterate through each chain. Thue runtime is thus Θ(n + B) where n is the number of keys and B is the number of buckets. Usually n and B will be pretty close. With extra memory we could improve this to Θ(n) by maintaining a separate linked list of all of the entries in the table. The Java class LinkedHashMap maintains a linked list of all of the entries in order of their insertion into the map. 3. 0 1 11 -7.9 53 9.2 2 3 4 4. Store the elements of the set as keys of a Map, and just use null for all the values. Since Maps force all keys to be distinct, this gives the desired behavior. 5. Use an ArrayList as the value type. For instance, we might use code as what follows: HashMap<String,ArrayList<Integer>> map = new HashMap<>(); //do stu //Next we add 3 to the list of key "A" ArrayList<Integer> al = map.get("A"); if (al == null) { al = new ArrayList<>(); map.put("A",al); } al.add(3); 2 Hash Functions What remains is to nd a good hash function h that determines which bucket each key corresponds to. Recall that any hash function must take equal keys to the same hash code. Furthermore, we require it to be deterministic. That is, the hash of a xed key will never change (e.g., we don't return a random answer each time). Let's see what a good hash function would also have: 1. Fast: A very slow hash function would slow down all hashtable operations. 2. Well distributed relative to keys: If S is your potential set of keys then the hash function should try to uniformly distribute S over the bucket indices. That is, each bucket should have roughly the same number of keys that correspond to it. This can be very dependent on your set S . For example, if your keys are Strings, then potential sets could be all English words, last names, social security numbers, or random strings. 3. Memory Ecient: Some hash functions use tables to help them compute their values. If used, these tables should be relatively small. If a relatively good hash function is used, then all of the Map operations implemented in a hashtable have Θ(1) expected performance. If chaining is used, the average length of each chain is the load factor. The chain lengths will roughly have a Poisson distribution. In many languages the computation of the hash function is broken into 2 parts. In Java every Object has a hashCode method that returns an int. The default implementation uses the address of the object in memory, but this is often overridden. A good hashCode usually has to be designed without knowing anything specic about the potential key set S . Thus the hashCode implementation should have properties that are good in a general sense: ecient, use all possible integer values, perform well (i.e., distributes the values evenly across the integers) on all commonly used key sets (if these exist). Hashtables will then take these ints and turn them into bucket addresses. There are several recommended ways to nd a bucket address once we have a hashCode. Below we assume k is the hashCode of your key, and N is the number of buckets. 1. Division Method: Force N to be a nice prime and compute h(k) = k mod N . If you are implementing this in Java you can use the code (int)(Integer.toUnsignedLong(k)%N) since simply writing k%N can give you a negative result. You can also use (k&0x7FFFFFFF)%N if you are willing to throw away the sign bit. Bit representation of integers will be discussed next lecture. 2. Fibonacci Hashing (using oating point arithmetic): h(k) = bN (ϕk − bϕkc)c where ϕ = √ 5−1 2 . Here bxc means round x down to an integer. 3 As another alternative, Java's HashMap forces N to be a power of two, and tries to shue the bits of k a bit before modding by N . This avoids the ineciency of modding, but can create more collisions. To deal with the increased collisions, HashMap uses a more complex chaining system that we may discuss later in the course. Object's hashCode As we saw above, the performance of HashMap is dependent on the quality of the hash function. Thus overriding hashCode is usually a good idea. If you override equals then you should denitely override hashCode since equal objects are expected to have the same hashCode. Moreover, if you override equals then your hashCode method should only use the elds that equals compares. Some good guidelines (recommended by Joshua Bloch in his great but slightly old book Eective Java) are: 1. To compute a hash code for a primitive type, look at the implementations of hashCode in the wrapper classes. For instance, to hash a double you can look at the API for the description of Double.hashCode(). 2. To hash an array, use Arrays.hashCode. 3. To hash an Object with several elds a,b,c you can use Objects.hash(a,b,c) but this will be a bit inecient (it allocates an array each time). To avoid the allocation you can just implement it yourself: int hash = a.hashCode(); hash = 31*hash + b.hashCode(); hash = 31*hash + c.hashCode(); This is almost correct, since it fails if any of the elds are null. We can write a bunch of if-statements, or we can use the helper method Objects.hashCode: int hash = Objects.hashCode(a); hash = 31*hash + Objects.hashCode(b); hash = 31*hash + Objects.hashCode(c); Objects also has a helper method Objects.equals which compares objects for equality and handles nulls (could make containsValue on Homework 4 cleaner). 4. For collections you should treat each element like a eld, and combine them as above. As an example, here is the calculation that is eectively used by Java lists: int hash = 0; for (E e : this) hash = 31 * hash + Objects.hashCode(e); return hash; Eclipse will actually auto-generate equals and hashCode if you right-click on a Java le, and select source. 4 Varargs In case you were interested, the Objects.hash method has the following declaration: public static int hash(Object... values) This is called using a varargs (variable length argument list). The function hash will take any number of arguments. The variable values just becomes a Object[] which has the arguments in it. To implement this Java simply turns hash(a,b,c) into hash(new Object[]{a,b,c}) so there is an implicit array allocated each time you call the function. If you want to use varargs in your programs there are only two things to know (in addition to what I already said): the varargs must be the last parameter, and you can use any type before the elipsis (including primitive types). For example, static void func(int a, double b, long... cs) { //Treat cs like a long[] here } Object.hashCode Exercises 1. What optimization can be made when computing the hashCode of an immutable object? 2. Why do we force objects that are equal (with respect to .equals) to have the same hashCode? 3. Several times we have seen System.out.printf used in class. What do you think its arguments are? 4. What is wrong with the following potential hashCode implementations for String? (a) Use the rst character's value as the hashCode (and 0 if the String is empty). (b) Use a uniform randomly generated int value as the hashCode. (c) Use the sum of all the characters' values. Object.hashCode Solutions 1. Once computed you can store the hashCode, since it cannot change. The extra memory required is usually worth it for an object like a String (which does use this optimization) but not for the wrapper classes (which do not). 2. If they had dierent hashCodes, they could go into dierent buckets, and thus dierent chains. Then all of our hashtable operations would treat the keys as distinct when they should be treated as equal. 3. public static void printf(String format, Object...) {/*stu*/} 5 4. (a) The main issue is that the maximum possible hashCode would be limited by the size of a character. Thus even if we had a large number of buckets only a small portion would ever get used. (b) Even though the generated hashCodes would be well distributed amongst the integers, there is no repeatability. That is, every time you rehash the same key you would get a dierent value. (c) Firstly, the hashCodes aren't large enough so we run into the same issues as the rst part. Secondly, it ignores the ordering of the letters in the string. For example, note that anagrams all collide. In general, if there is a simple transformation of a key leads to a collision then the hashCode isn't as strong as we would like (as a general purpose hashCode). Java HashMap The HashMap class in the Java API supports all of the Map functionality we have discussed using a hashtable and chaining. It also provides several methods of iteration by using the methods keySet, valueSet and entrySet as we show below. import java.util.Collection; import java.util.HashMap; import java.util.Iterator; import java.util.Set; HashMapIteration.java public class HashMapIteration { public static void main(String[] args) { HashMap<String,Integer> wordMap = new HashMap<>(); wordMap.put("Hello", 3); wordMap.put("Great", 9); wordMap.put("Great2", 9); wordMap.put("Frank", 2); wordMap.put("hmm", −1); wordMap.put("wow", null); Set<String> keys = wordMap.keySet(); Collection<Integer> values = wordMap.values(); Set<HashMap.Entry<String,Integer>> entries = wordMap.entrySet(); System.out.println("Keys:"); for (String k : keys) System.out.print(k+" "); System.out.println(); 6 System.out.println(keys.contains("hmm")); keys.remove("Frank"); values.remove(9); System.out.println(keys.contains("Frank")); System.out.println(keys.contains("Great")); System.out.println("Keys:"); for (Iterator<String> it = keys.iterator(); it.hasNext(); ) { String k = it.next(); System.out.print(k+" "); if (k.equals("Hello")) it.remove(); } System.out.println("Values:"); for (Integer d : values) System.out.print(d+" "); System.out.println(); } } System.out.println("Entries:"); for (HashMap.Entry<String,Integer> e : entries) { System.out.println(e.getKey()+","+e.getValue()); } The methods keySet, valueSet and entrySet are all Θ(1) and use Θ(1) memory since instead of creating a new collection, they simply represent a view into the data in the HashMap. As seen above, removals from these collections cause removals from the HashMap. One important dierence between the Java HashMap and our Map is that the get, remove, containsKey, all take type Object instead of the key type. In addition, containsValue takes type Object instead of the value type. The implementations only use hashCode and equals on the argument, so they don't really care what the type of the argument is. This decision was made to allow for greater exibility, but gives less protection against unintentional programmer errors: HashMap<String,Integer> map = new HashMap<>(); map.put("12",94); System.out.println(map.containsKey(12)); //Returns false 7