Download Lecture 9: Hashtables and HashMap

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Structures Brett Bernstein
Lecture 9: Hashtables and HashMap
Exercises
1. Explain a small change that could speed up the following code:
Scanner in = new Scanner(System.in);
int n = in.nextInt();
ArrayList<String> al = new ArrayList<>();
for (int i = 0; i < n; ++i) al.add(in.nextLine());
Collections.sort(al);
for (int i = 0; i < n; ++i) {
process(al.get(i)); //does something
}
2. What is the runtime for nding a particular value (not key) in a hashtable?
3. You have a hashtable whose keys are Integers, values are Doubles, and that uses chaining for collision handling. Suppose it has 5 buckets and that the hash function takes
the absolute value of the integer mod 5. What does the hashtable look like after the
following operations, in order:
put(8,0.0); put(11,1.0); put(21,1.0); put(53,9.2); put(11,−7.9); remove(21);
4. Below is a Java interface for the Set ADT:
Set.java
//Stores elements but disregards duplicates (according to .equals).
public interface Set<T>
{
//Adds t to the set
void add(T t);
//Determines if t is in the set
boolean contains(T t);
//Removes t from the set if it is the set.
//Otherwise does nothing.
void remove(T t);
//Returns the number of elements in the set
int size();
}
How could this be easily implemented using a Map?
1
5. Suppose we wanted a special Map that associated multiple values to a single key. How
can this be done easily using a standard Map (without modifying its implementation)?
Solutions
1. By plugging n into the constructor we can avoid the unnecessary copying and allocations made as we grow the ArrayList.
2. To nd a value we will iterate through the bucket table, and for each bucket index,
iterate through each chain. Thue runtime is thus Θ(n + B) where n is the number of
keys and B is the number of buckets. Usually n and B will be pretty close. With extra
memory we could improve this to Θ(n) by maintaining a separate linked list of all of
the entries in the table. The Java class LinkedHashMap maintains a linked list of all
of the entries in order of their insertion into the map.
3.
0
1
11
-7.9
53
9.2
2
3
4
4. Store the elements of the set as keys of a Map, and just use null for all the values.
Since Maps force all keys to be distinct, this gives the desired behavior.
5. Use an ArrayList as the value type. For instance, we might use code as what follows:
HashMap<String,ArrayList<Integer>> map = new HashMap<>();
//do stu
//Next we add 3 to the list of key "A"
ArrayList<Integer> al = map.get("A");
if (al == null) {
al = new ArrayList<>();
map.put("A",al);
}
al.add(3);
2
Hash Functions
What remains is to nd a good hash function h that determines which bucket each key
corresponds to. Recall that any hash function must take equal keys to the same hash code.
Furthermore, we require it to be deterministic. That is, the hash of a xed key will never
change (e.g., we don't return a random answer each time). Let's see what a good hash
function would also have:
1. Fast: A very slow hash function would slow down all hashtable operations.
2. Well distributed relative to keys: If S is your potential set of keys then the hash
function should try to uniformly distribute S over the bucket indices. That is, each
bucket should have roughly the same number of keys that correspond to it. This can
be very dependent on your set S . For example, if your keys are Strings, then potential
sets could be all English words, last names, social security numbers, or random strings.
3. Memory Ecient: Some hash functions use tables to help them compute their values.
If used, these tables should be relatively small.
If a relatively good hash function is used, then all of the Map operations implemented in a
hashtable have Θ(1) expected performance. If chaining is used, the average length of each
chain is the load factor. The chain lengths will roughly have a Poisson distribution.
In many languages the computation of the hash function is broken into 2 parts. In Java
every Object has a hashCode method that returns an int. The default implementation uses
the address of the object in memory, but this is often overridden. A good hashCode usually
has to be designed without knowing anything specic about the potential key set S . Thus the
hashCode implementation should have properties that are good in a general sense: ecient,
use all possible integer values, perform well (i.e., distributes the values evenly across the
integers) on all commonly used key sets (if these exist).
Hashtables will then take these ints and turn them into bucket addresses. There are
several recommended ways to nd a bucket address once we have a hashCode. Below we
assume k is the hashCode of your key, and N is the number of buckets.
1. Division Method: Force N to be a nice prime and compute h(k) = k mod N . If you
are implementing this in Java you can use the code (int)(Integer.toUnsignedLong(k)%N)
since simply writing k%N can give you a negative result. You can also use (k&0x7FFFFFFF)%N
if you are willing to throw away the sign bit. Bit representation of integers will be discussed next lecture.
2. Fibonacci Hashing (using oating point arithmetic):
h(k) = bN (ϕk − bϕkc)c
where ϕ =
√
5−1
2
. Here bxc means round x down to an integer.
3
As another alternative, Java's HashMap forces N to be a power of two, and tries to shue
the bits of k a bit before modding by N . This avoids the ineciency of modding, but can
create more collisions. To deal with the increased collisions, HashMap uses a more complex
chaining system that we may discuss later in the course.
Object's hashCode
As we saw above, the performance of HashMap is dependent on the quality of the hash
function. Thus overriding hashCode is usually a good idea. If you override equals then
you should denitely override hashCode since equal objects are expected to have the same
hashCode. Moreover, if you override equals then your hashCode method should only use the
elds that equals compares. Some good guidelines (recommended by Joshua Bloch in his
great but slightly old book Eective Java) are:
1. To compute a hash code for a primitive type, look at the implementations of hashCode
in the wrapper classes. For instance, to hash a double you can look at the API for the
description of Double.hashCode().
2. To hash an array, use Arrays.hashCode.
3. To hash an Object with several elds a,b,c you can use Objects.hash(a,b,c) but this
will be a bit inecient (it allocates an array each time). To avoid the allocation you
can just implement it yourself:
int hash = a.hashCode();
hash = 31*hash + b.hashCode();
hash = 31*hash + c.hashCode();
This is almost correct, since it fails if any of the elds are null. We can write a bunch
of if-statements, or we can use the helper method Objects.hashCode:
int hash = Objects.hashCode(a);
hash = 31*hash + Objects.hashCode(b);
hash = 31*hash + Objects.hashCode(c);
Objects also has a helper method Objects.equals which compares objects for equality
and handles nulls (could make containsValue on Homework 4 cleaner).
4. For collections you should treat each element like a eld, and combine them as above.
As an example, here is the calculation that is eectively used by Java lists:
int hash = 0;
for (E e : this)
hash = 31 * hash + Objects.hashCode(e);
return hash;
Eclipse will actually auto-generate equals and hashCode if you right-click on a Java le, and
select source.
4
Varargs
In case you were interested, the Objects.hash method has the following declaration:
public static int hash(Object... values)
This is called using a varargs (variable length argument list). The function hash will take any
number of arguments. The variable values just becomes a Object[] which has the arguments
in it. To implement this Java simply turns hash(a,b,c) into hash(new Object[]{a,b,c}) so there
is an implicit array allocated each time you call the function. If you want to use varargs in
your programs there are only two things to know (in addition to what I already said): the
varargs must be the last parameter, and you can use any type before the elipsis (including
primitive types). For example,
static void func(int a, double b, long... cs) {
//Treat cs like a long[] here
}
Object.hashCode Exercises
1. What optimization can be made when computing the hashCode of an immutable object?
2. Why do we force objects that are equal (with respect to .equals) to have the same
hashCode?
3. Several times we have seen System.out.printf used in class. What do you think its
arguments are?
4. What is wrong with the following potential hashCode implementations for String?
(a) Use the rst character's value as the hashCode (and 0 if the String is empty).
(b) Use a uniform randomly generated int value as the hashCode.
(c) Use the sum of all the characters' values.
Object.hashCode Solutions
1. Once computed you can store the hashCode, since it cannot change. The extra memory
required is usually worth it for an object like a String (which does use this optimization)
but not for the wrapper classes (which do not).
2. If they had dierent hashCodes, they could go into dierent buckets, and thus dierent
chains. Then all of our hashtable operations would treat the keys as distinct when they
should be treated as equal.
3.
public static void printf(String format, Object...) {/*stu*/}
5
4. (a) The main issue is that the maximum possible hashCode would be limited by the
size of a character. Thus even if we had a large number of buckets only a small
portion would ever get used.
(b) Even though the generated hashCodes would be well distributed amongst the
integers, there is no repeatability. That is, every time you rehash the same key
you would get a dierent value.
(c) Firstly, the hashCodes aren't large enough so we run into the same issues as the
rst part. Secondly, it ignores the ordering of the letters in the string. For example, note that anagrams all collide. In general, if there is a simple transformation
of a key leads to a collision then the hashCode isn't as strong as we would like
(as a general purpose hashCode).
Java HashMap
The HashMap class in the Java API supports all of the Map functionality we have discussed
using a hashtable and chaining. It also provides several methods of iteration by using the
methods keySet, valueSet and entrySet as we show below.
import java.util.Collection;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Set;
HashMapIteration.java
public class HashMapIteration
{
public static void main(String[] args)
{
HashMap<String,Integer> wordMap = new HashMap<>();
wordMap.put("Hello", 3);
wordMap.put("Great", 9);
wordMap.put("Great2", 9);
wordMap.put("Frank", 2);
wordMap.put("hmm", −1);
wordMap.put("wow", null);
Set<String> keys = wordMap.keySet();
Collection<Integer> values = wordMap.values();
Set<HashMap.Entry<String,Integer>> entries = wordMap.entrySet();
System.out.println("Keys:");
for (String k : keys) System.out.print(k+" ");
System.out.println();
6
System.out.println(keys.contains("hmm"));
keys.remove("Frank");
values.remove(9);
System.out.println(keys.contains("Frank"));
System.out.println(keys.contains("Great"));
System.out.println("Keys:");
for (Iterator<String> it = keys.iterator(); it.hasNext(); )
{
String k = it.next();
System.out.print(k+" ");
if (k.equals("Hello")) it.remove();
}
System.out.println("Values:");
for (Integer d : values) System.out.print(d+" ");
System.out.println();
}
}
System.out.println("Entries:");
for (HashMap.Entry<String,Integer> e : entries)
{
System.out.println(e.getKey()+","+e.getValue());
}
The methods keySet, valueSet and entrySet are all Θ(1) and use Θ(1) memory since instead
of creating a new collection, they simply represent a view into the data in the HashMap. As
seen above, removals from these collections cause removals from the HashMap.
One important dierence between the Java HashMap and our Map is that the get, remove,
containsKey, all take type Object instead of the key type. In addition, containsValue takes
type Object instead of the value type. The implementations only use hashCode and equals
on the argument, so they don't really care what the type of the argument is. This decision
was made to allow for greater exibility, but gives less protection against unintentional
programmer errors:
HashMap<String,Integer> map = new HashMap<>();
map.put("12",94);
System.out.println(map.containsKey(12)); //Returns false
7