Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sets, Maps and Hash Tables Sets • We have learned that different data structures have different advantages – and drawbacks • Choosing the proper data structure depends on typical usage patterns • Array- and list-oriented data structures are appropriate when the order of elements matter – but that is not always the case RHS – SOC 2 Sets • A Set is a data structure which can hold an unordered collection of elements • Not having to worry about ordering can improve performance of other operations • On a Set, we want to be able to – Insert an element – Delete an element – Check if a given element is in the Set RHS – SOC 3 Sets public interface Set<T> { void add(T element); void remove(); boolean contains(T element); Iterator<T> iterator(); } RHS – SOC 4 Sets • It turns out that insertion, deletion and check for containment can be done in O(log(n)), or even faster! • Depends on the underlying implementation of the interface • In Java, implementation is either – HashSet (based on Hash Tables) – TreeSet (based on Trees) RHS – SOC 5 Sets • A Set iterator is ”simpler” than e.g. a List iterator – Elements will occur in ”random” order – No add method – we just call add on the Set itself – No previous method – does not make sense • The Set iterator does however have a delete method (why?) RHS – SOC 6 Sets – Quality tip • When using a Set, we must choose a specific implementation (HashSet or TreeSet) • However, the definition should look like: Set<Car> cars = new HashSet<Car>(); RHS – SOC 7 Sets – Quality tip Set<Car> cars = new HashSet<Car>(); • Why…? We should in general only refer to the interface, not the implementation • Easy to switch implementation! RHS – SOC 8 Maps • A Map is a data structure which stores associations between – A collection of keys – A collection of values • All keys map to a value • Keys are unique (values are not) RHS – SOC 9 Maps K1 V1 K2 V2 K3 V3 K4 RHS – SOC 10 Map public interface Map<K,V> { void put(K key,V value); V get(K key); void remove(K key); Set<K> keySet(); } RHS – SOC 11 Map • The keySet method returns a Set containing all keys in the Map • You must then iterate through this Set, in order to get all values stored in the Map RHS – SOC 12 Map Map<String,Car> carMap = new HashMap<String,Car>(); ... Set<String> regNumbers = carMap.keySet(); for (String regNo : regNumbers) { Car aCar = carMap.get(regNo); ... // Do something with the Car object } RHS – SOC 13 Exercises • Review: R16.1, R16.4, R16.6 • Programming: P16.4, P16.12 RHS – SOC 14 Hash Tables • A Set and a Map are both abstract data types – we need a concrete implementation in order to use them • In the Java library, two implementations are available: – Sets: HashSet, TreeSet – Maps: HashMap, TreeMap RHS – SOC 15 Hash Tables • The implementations HashSet and HashMap are based on a Hash Table • A Hash Table is based on the below ideas: – Create an array of length N, which can store objects of some type T – Find a mapping from T to the interval [0; N-1] (a Hash Function f) – Store an object t of type T in the position f(t) RHS – SOC 16 Hash Tables f(Car1) = 3 Car3 Car1 0 f(Car2) = 0 Car2 1 f(Car3) = 2 2 3 RHS – SOC 4 17 Hash Tables • A Hash Table is thus ”almost” an array • Instead of having an index directly available, we must calculate it • If calculation can be done in constant time, then all basic operations (insert, delete, lookup) can be done in constant time! • Better than tree-based implementations, which have O(log(N)) RHS – SOC 18 Hash Tables • However, there are some issues: – How do we define a good mapping from the objects to [0; N-1]? – What happens if we try to store two objects at the same position? RHS – SOC 19 Hash Functions • Before finding a good mapping – i.e. a good hash function – we must consider the size of the array • For good performance, the array should at least be as large as the maximal number of objects stored • Rule of thumb is about 30 % larger • Size should be a prime number (???) RHS – SOC 20 Hash Functions • What if the expected number of objects is unknown in advance? • We can expand a hash table dynamically • If the hash table in running out of space, double the capacity • Start out with a reasonably large array (space is cheap…) RHS – SOC 21 Hash Functions • Having handled the choice of N, how do we define a proper hash function? • Properties of a hash function: – Must map all objects of type T to the interval [0; N-1] – Should map objects as uniformly as possible to the interval [0; N-1] RHS – SOC 22 Hash Functions • We can enforce the mapping to [0;N-1] by using the modulo operator: f(t) = g(t) % N • g(t) can then produce any integer value • How do we achieve a uniform distribution? • Theory for this is complicated, but there are some general rules to follow RHS – SOC 23 Hash Functions • A good hash function should be ”almost random”, but deterministic – ”Almost random” – values are well distributed in the interval – Deterministic – always produce the same output for the same input RHS – SOC 24 Hash Functions • In Java, all objects have a hashCode method – Defined in Object class – Can be overrided – Returns an integer (the Hash Code) – We must use modulo on the value ourselves RHS – SOC 25 Hash Functions • Hash function for integers: – The number itself… • Hash function for strings: final int HASH_MULTIPLIER = 31; int h = 0; for (int i = 0; i < s.length; i++) h = (HASH_MULTIPLIER * h) + s.charAt(i); RHS – SOC 26 Hash Functions • Hash code for an object can be calculated by combining hash codes for instance fields • Combine values in a way similar to the algorithm used to find string hash codes RHS – SOC 27 Hash Functions public int hashCode() { final int MULTIPLIER = 31; int h1 = regNo.hashCode(); int h2 = mileage; int h3 = model.hashCode(); int h = h1*MULTIPLIER + h2; h = h*MULTIPLIER + h3; return h; } RHS – SOC 28 Hash Functions • But wait…what about numeric overflow? • We multiply a ”random” integer value with a number…? • Does not really matter… • As long as the algorithm is deterministic, overflow is not a problem • Just helps ”scrambling” the value RHS – SOC 29 Hash Functions • Common pitfalls: – Remember to define a hashCode function – If you forget, the hashCode implementation in Object is used – Based solely on memory location of object – Two objects with the same value of instance fields will produce different hash codes… RHS – SOC 30 Hash Functions • Common pitfalls: – The hashCode function must be ”compatible” with your equals function – If a.equals(b) it must hold that a.hashCode() == b.hashCode() – If not, duplicates are allowed! – The reverse condition is not required; two different objects may have the same hash code RHS – SOC 31 Hash Functions • In general, you must remember to: – Either define the hashCode and the equals method – Or not define any of them! RHS – SOC 32 Handling collisions • Even with a good hash function, we will still experience collisions • Collision: two different objects t1 and t2 have the same hash code • We will then try to store both objects in the same position in the array • Now what…? RHS – SOC 33 Handling collisions • What we store in each position in the array is not the objects themselves, but a linked list of objects • Objects with the same hash code h are stored in the linked list in position h • With a good hash function, the average length of non-empty lists is less than 2 RHS – SOC 34 Handling collisions Car6 Car4 Car2 0 Car3 1 2 Car1 3 RHS – SOC Car5 4 35 Handling collisions • Basic operations (insert, delete, lookup) follow this structure: – Calculate hash code for the object – Find the corresponding position in the array • Insert: Insert element at the end of list • Delete/Lookup: Iterate through list until element is found, or end of list is reached RHS – SOC 36 Handling collisions • Basic operations are thus not done in truly constant time • However, if a proper hash function is used, running time is constant in practice • Use hash-based implementations unless special circumstances apply – Hard to define hash/equals function – More functionality required RHS – SOC 37 Exercises • Review: R16.8, R16.10 • Programming: P16.6 RHS – SOC 38