Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ ΤΕΧΝΙΚΕΣ ΚΑΤΑΚΕΡΜΑΤΙΣΜΟΥ ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 1 HASHING Αποτελεσματικός τρόπος για: ...αποθήκευση δεδομένων ...ανάκτηση δεδομένων Στόχος: αναζήτηση σε O(1) ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 2 Βασική λειτουργία • Αποθηκεύουμε το στοιχείο με κλειδί k, στη θέση h(k) του πίνακα κατακερματισμού. • Συνάρτηση κατακερματισμού h : – Όταν k δεν είναι ακέραιος, το h(k) είναι. – Έχει τιμές από 0 ως Ν -1. • Συγκρούσεις, όταν k1 ≠ k2 και h(k1) = h(k2) – Διαφορετικά κλειδιά αντιστοιχίζονται στην ίδια θέση του πίνακα κατακερματισμού. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 3 Συνάρτηση κατακερματισμού • Επιτελεί δύο λειτουργίες: – μετατρέπει την τιμή του κλειδιού σε ακέραιο, – περιορίζει την τιμή του ακεραίου που υπολόγισε στο προηγούμενο βήμα, εντός της περιοχής [0..N-1] • Η πιο προφανής μέθοδος είναι η μέθοδος της διαίρεσης ( mod N ) • Μια εναλλακτική μέθοδος είναι η : h(k)= (a*k+b) mod N (MAD – Multiply Add and Divide), όπου το N είναι ένας πρώτος αριθμός και a mod N <> 0. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 4 Πίνακας Κατακερματισμού 0 1 2 025-612-0001 3 981-101-0003 4 451-229-0004 Έστω N το μέγεθος του πίνακα, εδώ Ν = 10000 … Χρησιμοποιείται η συνάρτηση κατακερματισμού: «πάρε τα 4 τελευταία ψηφία», δηλ. h(k) = k mod Ν 9997 9998 200-751-9998 9999 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 5 Αντιμετώπιση Συγκρούσεων • Όσο δεν έχουμε συγκρούσεις, ο κατακερματισμός αποδίδει σε χρόνο Ο(1) • Ορισμός: Παράγων Φόρτου (load factor) α (ή λ) = n / N – n = πλήθος στοιχείων που έχουν εισαχθεί – N = μέγεθος πίνακα • Στην περίπτωση συγκρούσεων τα στοιχεία μπορεί να φυλάσσονται: – σε άλλη δομή δεδομένων έξω από τον πίνακα – σε εναλλακτικές θέσεις του πίνακα κατακερματισμού ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 6 Questions to Ask When Analyzing Resolution Schemes 1. Are we guaranteed to find an empty cell if there is one? 2. Are we guaranteed we won’t be checking the same cell twice during one insertion? 3. What should the load factor be to obtain O(1) average-case insert, search, and delete? ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 7 Three factors affecting the performance of hashing 1. The hash function – Ideally, it should distribute keys and entries evenly throughout the table – It should minimise collisions, where the position given by the hash function is already occupied 2. The size of the table – Too big will waste memory; too small will increase collisions and may eventually force rehashing (copying into a larger table) – Should be appropriate for the hash function used – and a prime number is best 3. The collision resolution strategy – Separate chaining: chain together several keys/entries in each position – Open addressing: store the key/entry in a different position ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 8 1. Choosing a hash function: turning a key into a table position • Truncation – Ignore part of the key and use the rest as the array index (converting non-numeric parts) – A fast technique, but check for an even distribution throughout the table • Folding – Partition the key into several parts and then combine them in any convenient way – Unlike truncation, uses information from the whole key • Modular arithmetic (used by truncation & folding, and on its own) – To keep the calculated table position within the table, divide the position by the size of the table, and take the remainder as the new position ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 9 Examples of hash functions • Truncation: If students have an 9-digit identification number, take the last 3 digits as the table position – e.g. 925371622 becomes 622 • Folding: Split a 9-digit number into three 3-digit numbers, and add them – e.g. 925371622 becomes 925 + 371 + 622 = 1923 • Modular arithmetic: If the table size is 1000, the first example always keeps within the table range, but the second example does not (it should be mod 1000) – e.g. 1923 mod 1000 = 923 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ (in C++: 1923 % 1000) 10 • Using a telephone number as a key – The area code is not random, so will not spread the keys/entries evenly through the table (many collisions) – The last 3-digits are more random • Using a name as a key – Use full name rather than surname (surname not particularly random) – Assign numbers to the characters (e.g. a = 1, b = 2; or use Unicode values) – Strategy 1: Add the resulting numbers. Bad for large table size. – Strategy 2: Call the number of possible characters c (e.g. c = 54 for alphabet in upper and lower case, plus space and hyphen). Then multiply each character in the name by increasing powers of c, and add together. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 11 Modular arithmetic : Division • The key is subject to modular (remainder) division by an integer, which is usually prime. • This integer should be almost equal the desired size of the array. The result of the division – the remainder – determines which array location is used. • most common-used others ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ in combination with 12 Modular arithmetic : Midsquare • The key is squared and the digits in the middle are retained for the address. This works better with smaller hash values (sizes less than 10000) EXAMPLE : number 9876 (9876)2 = 975 353 76 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 13 Modular arithmetic : Folding / Boundary Folding • Social Security Number : 387-58-1505 – hash as sum of three integers: • 387 + 58 + 1505 = 1950 – hash as sum of three integers: 387 + 85 (this number is reversed) + 1505 = 1977 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 14 Bar Coding • A bar code consists of 10 digits: 1 234 567 890 • Store bar codes in a hash table • Suppose total number of bar codes is less than 10,000…. – Where and how do I store the codes? – What is the size of hash table? ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 15 Modular arithmetic : Folding • The key is divided into several parts, each of which are combined and processed to give an address. For example, if the bar code is 70662 11001 • HashTable has 15000 entries • Group into pairs: 70 66 21 10 01 • Multiply the first three pairs together 70 x 66 x 21 = 97020 • Add this number to the last two pairs: 97020 + 10 + 01 = 97031 • Find the remainder of mod division by 14987 (15000 – 3) 97031 % 14987 = 7109 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 16 • bar code : 66702 10110 – Group into pairs: 66 70 21 01 10 – – – 66 x 70 x 21 = 97020 97020 + 1 + 10 = 97031 97031 % 14987 = 7109 OOPS….same value as last bar code! ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 17 Real problems • Suppose we are storing numeric id’s of customers, maybe 100,000 • Now, we want to check if a person is delinquent, usually less than 400 such people. • Use an array of size 1000, for the delinquents. • Put id in at id mod tableSize. – id = 987567 table index = 567 • Clearly fast for searching • But what happens if entries collide? ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 18 • Suppose we are storing students by social security number • How many students? • How big should the table be? • How do I store the students? ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 19 2. Choosing the table size to minimize collisions • As the number of elements in the table increases, the likelihood of a collision increases - so make the table as large as practical • If the table size is 100, and all the hashed keys are dividable by 10, there will be many collisions! – Particularly bad if table size is a power of a small integer such as 2 or 10 • More generally, collisions may be more frequent if: – Greatest Common Divisor (hashed keys, table size) > 1 • Therefore, make the table size a prime number (GCD = 1) • An excess of approximately 30% is typical. – This means that if s is the number of slots in the table and e is the number of elements, then s = a prime number >= 4/3 * e ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 20 3. The collision resolution strategy • Collisions may still happen, so we need a collision resolution strategy. • Two principal strategies (techniques) : – Separate chaining – Open addressing ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 21 Open Addressing Strategy • To insert a key K, compute h0(K). If the location of the hash array, let T[h0(K)], is empty, insert it there. If collision occurs, probe alternative cell h1(K), h2(K), .... until an empty cell is found. hi(K) = (hash(K) + f(i)) mod m, with f(0) = 0 – f: collision resolution strategy ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 22 Probing: If the table position given by the hashed key is already occupied, increase the position by some amount, until an empty position is found – Linear probing: increase by 1 each time [mod table size!] – Quadratic probing: to the original position, add 1, 4, 9, 16,… Use the collision resolution strategy when inserting and when finding (ensure that the search key and the found keys match) Double hash : result of linear probing result of another hash function With open addressing, the table size should be double the expected number of elements ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 23 • Linear Probing – f(i) = i • cells are probed sequentially (with wraparound) • hi(K) = (hash(K) + i) mod m • Quadratic Probing – f(i) = i2 • hi(K) = ( hash(K) + i2 ) mod m • Double Hashing – f(i) = i * hash2(K) • e.g. hash2(K) = R - (K mod R), with R is a prime smaller than m ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 24 Linear Probing • If the table is fairly empty with many collisions, linear probing may cluster (group) keys/entries – This increases the time to insert and to find 1 2 3 4 5 6 7 8 For a table of size n, then if the table is empty, the probability of the next entry going to any particular place is 1/n In the diagram, the probability of position 2 getting filled next is 2/n (either a hash to 1 or to 2 fills it) Once 2 is full, the probability of 4 being filled next is 4/n and then of 7 is 7/n (i.e. the probability of getting long strings steadily increases) Linear Probing suffers from primary clustering ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 25 Primary Clustering • We call a block of contiguously occupied table entries a cluster • On the average, when we insert a new key K, we may hit the middle of a cluster. Therefore, the time to insert K would be proportional to half the size of a cluster. That is, the larger the cluster, the slower the performance. • Linear probing has the following disadvantages: – Once h(K) falls into a cluster, this cluster will definitely grow in size by one. Thus, this may worsen the performance of insertion in the future. – If two cluster are only separated by one entry, then inserting one key into a cluster can merge the two clusters together. Thus, the cluster size can increase drastically by a single insertion. This means that the performance of insertion can deteriorate drastically after a single insertion. – Large clusters are easy targets for collisions. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 26 Primary Clustering Consider inserting the following entries 81, 70, 97, 63, 76, 38, 85, 68, 21, 9, 55, 73, 57, 60, 72, 74, 85, 16, 61, 7, 49 Use the number modulo 25 to determine which bin it should occupy – The first five don’t cause any collisions 0 1 76 2 3 4 5 6 7 8 9 10 11 81 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 12 13 63 14 15 16 17 18 19 20 70 21 22 23 24 97 27 Primary Clustering Inserting 38 causes a collision in bin 13 The next seven do not cause any further collisions 0 1 76 2 3 4 5 6 7 55 81 57 8 9 10 9 85 11 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 12 13 14 63 38 15 16 17 18 68 19 20 21 22 23 70 21 97 73 24 28 Primary Clustering The next four insertions cause collisions: 60 (bin 72 (bin 74 (bin 85 (bin 10) 22) 24) 10) We can safely insert 16 into bin 16 0 1 74 76 2 3 4 5 6 7 55 81 57 8 9 10 11 12 13 14 9 85 60 85 63 38 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 15 16 16 17 18 68 19 20 21 22 23 24 70 21 97 73 72 29 Primary Clustering The remaining insertions all cause collisions: 61 (bin 11) 7 (bin 7) 49 (bin 24) 0 1 2 74 76 49 3 4 5 6 7 8 9 10 11 12 13 14 15 16 55 81 57 7 9 85 60 85 63 38 61 16 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 17 18 68 19 20 21 22 23 24 70 21 97 73 72 30 Primary Clustering The length of these chains will affect the number of probes required to perform insertions, accesses, or removals It is possible to estimate the average number of probes for a successful search, where λ is the load factor: 1 2 1 1 1 For example: if λ = 0.5, we require 1.5 probes on average ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 31 Primary Clustering The number of probes for an unsuccessful search or for an insertion is higher: 1 2 1 1 2 1 For 0 ≤ ≤ 1, then (1 – )2 ≤ 1 – , and therefore the reciprocal will be larger – Again, if = 0.5 then we require 2.5 probes on average ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 32 Primary Clustering The following plot shows how the number of required probes increases ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 33 Primary Clustering • Our goal was to keep all operations O(1) • Unfortunate, as grows, so does the run time • One solution is to keep the load factor under a given bound • If we choose = 2/3, then the number of probes for either a successful or unsuccessful search is 2 and 5, respectively. • Therefore, we have three choices: – Choose M large enough so that we will not pass this load factor – Double the number of bins if the chosen load factor is reached – Choose a different strategy from linear probing ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 34 Primary Clustering • The first solution (choose M sufficiently large) is most useful if we know all the possible entries • The second (doubling) is only useful if we have an environment where we can dynamically allocate memory • For the third, we will look at quadratic probing and double hashing. Quadratic Probing is the most common technique to avoid clustering. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 35 Quadratic probing • Quadratic probing is a solution to the primary clustering problem – Linear probing adds 1, 2, 3, etc. to the original hashed key – Quadratic probing adds 12, 22, 32 etc. to the original hashed key • However, whereas linear probing guarantees that all empty positions will be examined if necessary, quadratic probing does not. • Two keys with different home positions will have different probe sequences – e.g. m=101, h(k1)=30, h(k2)=29 – probe sequence for k1: 30,30+1, 30+4, 30+9 – probe sequence for k2: 29, 29+1, 29+4, 29+9 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 36 Example (1) - Quadratic Probing Use quadratic probing to insert the following numbers into an initially empty hash table with 11 bins where the hash value of a number is the least-significant digit. 81, 70, 34, 49, 50, 64 Starting with an initially empty table: 0 1 2 3 4 5 6 7 8 9 10 ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ We insert 81 in 1, 0 1 2 3 4 5 6 7 8 9 10 ∅ 81 ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 37 70 in 0, 0 1 2 3 4 5 6 7 8 9 10 70 81 ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ 34 in bin 4, 0 1 2 3 4 5 6 7 8 9 10 70 81 ∅ ∅ 34 ∅ ∅ ∅ ∅ ∅ ∅ and 49 in bin 9. 0 1 2 3 4 5 6 7 8 9 10 70 81 ∅ ∅ 34 ∅ ∅ ∅ ∅ 49 ∅ ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 38 • Inserting 50, we note that bin 0 is occupied and therefore we check: – – – – 0 + 1 ≡ 1 which is occupied, 0 + 4 ≡ 4 which is occupied, 0 + 9 ≡ 9 which is occupied, and 0 + 16 ≡ 5 which is unoccupied. • Thus, 50 goes into bin 5. 0 1 2 3 4 5 6 7 8 9 10 70 81 ∅ ∅ 34 50 ∅ ∅ ∅ 49 ∅ ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 39 • Inserting 64, we note that bin 4 is occupied, and therefore we check: – bin 4 + 1 ≡ 5 which is occupied, and – bin 4 + 4 ≡ 8 which is unoccupied. • Thus, 64 goes into bin 8. 0 1 2 3 4 5 6 7 8 9 10 70 81 ∅ ∅ 34 50 ∅ ∅ 64 49 ∅ ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 40 Example (2) – Quadratic Probing From the hash table in Example 1, search for the elements 60, 61, 62, 63, 64, 65, 66, 67, 68, and 69. 60. Searching for 60, we check 0, 0 + 1 ≡ 1, 0 + 4 ≡ 4, 0 + 9 ≡ 9, 0 + 16 ≡ 5, 0 + 25 ≡ 3, and 3 is empty, and therefore 60 is not in the hash table. 61. Searching for 61, we check 1 and 1 + 1 ≡ 2 and 2 is empty. Therefore 61 is not in the hash table. 62. 2 is empty, therefore 62 is not in the hash table. 63. 3 is empty, therefore 63 is not in the hash table. 64. Searching for 64, we check 4, 4 + 1 ≡ 5, and 4 + 4 ≡ 8, and 64 is located in 8, and therefore 64 is in the hash table. 65. Searching for 65, we check 5 and 5 + 1 ≡ 6 and 6 is empty. Therefore 65 is not in the hash table. 66. 6 is empty, therefore 66 is not in the hash table. 67. 7 is empty, therefore 67 is not in the hash table. 68. Searching for 68, we check 8, 8 + 1 ≡ 9, 8 + 4 ≡ 1, and 8 + 9 ≡ 6 and 6 is empty. Therefore 68 is not in the hash table. 69. Searching for 69, we check 9 and 9 + 1 ≡ 10 and 10 is empty. Therefore 69 is not located in the hash table. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 41 Example (3) - Quadratic Probing • For example, suppose an element was to be inserted in bin 23 in a hash table with 31 bins • The sequence in which the bins would be checked is: 23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0 • Even if two bins are initially close, the sequence in which subsequent bins are checked varies greatly • Again, with M = 31 bins, compare the first 16 bins which are checked starting with 22 and 23: 22 23 22,23,26,0,7,16,27,9,24,10,29,19,11,5,1,30 23,24,27,1,8,17,28,10,25,11,30,20,12,6,2,0 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 42 Quadratic Probing: Properties • Thus, quadratic probing solves the problem of primary clustering • Unfortunately, there is a second problem which must be dealt with • Suppose we have M = 8 bins: 12 ≡ 1, 22 ≡ 4, 32 ≡ 1 • In this case, we are checking bin h + 1 twice having checked only one other bin ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 43 • Disadvantage of this method: – After a number of probes the sequence of steps repeats itself (remember that the step will be probe number2 mod the size of the hash table). This repetition occurs when the probe number is roughly half the size of the hash table. – e.g. Table size 16 and original hashed key 3 gives the sequence: 3, 4, 7, 12, 3, 12, 7, 4… • More generally, with quadratic probing, insertion may be impossible if the table is more than half-full! • Need to rehash ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 44 • For any < 0.5, quadratic probing will find an empty slot; for bigger , quadratic probing may find a slot • If the table size is prime, then a new key can always be inserted if the table is at least half empty • Keys that hash to the same home position will probe the same alternative cells • Simulation results suggest that it generally causes less than an extra half probe per search • Quadratic probing does not suffer from primary clustering: keys hashing to the same area are not bad ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 45 Secondary Clustering • The phenomenon of primary clustering will not occur with quadratic probing • Quadratic Probing suffers from a milder form of clustering called secondary clustering (if multiple items all hash to the same initial bin, the same sequence of numbers will be followed ). The effect is less significant than that of primary clustering. • As with linear probing, if two keys have the same initial probe position, then their probe sequences are the same, since h(k1,0) = h(k2,0) implies h(k1,1) = h(k2,1). So only m distinct probes are used. • Clustering can occur around the probe sequences. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 46 Secondary Clustering • Secondary clustering may be a problem if the hash function does not produce an even distribution of entries • To avoid secondary clustering, the probe sequence need to be a function of the original key value, not the home position • One solution to secondary is double hashing: associating with each element an initial bin (defined by one hash function) and a skip (defined by a second hash function) ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 47 Review Linear probing: – Look at bins k, k + 1, k + 2, k + 3, k + 4, … – Primary clustering Quadratic probing: – Look at bins k, k + 1, k + 4 , k + 9, k + 16, … – Secondary clustering (dangerous for poor hash functions) – Expensive: • Prime-sized arrays • Euclidean algorithm for calculating remainders ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 48 Double Hashing An alternate solution – Give each object (with high probability) a different jump size – Associate with each object an initial bin and a jump size for ( int k = 0; k < M; ++k ) { bin = (initial + k*jump) % M; – The jump size and the number of bins M must be relatively prime ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 49 Problem: – Will initial + k*jump step through all of the bins? – The output of: M = 16; initial = 5 jump = 12; for ( int k = 0; k < M; ++k ) { cout << (initial + k*jump) % M << ' '; } is 5 1 13 9 5 1 13 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 9 5 1 13 9 5 1 13 9 50 Problem: – Not all jump sizes will visit all bins – However, in this case, we visit all bins M = 16; initial = 5 jump = 7; for ( int k = 0; k < M; ++k ) { cout << (initial + k*jump) % M << ' '; } is 5 12 3 10 1 8 15 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 6 13 4 11 2 9 0 7 14 51 Double Hashing • Double hashing aims to avoid both primary and secondary clustering and is guaranteed to find a free element in a hash table as long as the table is not full. It achieves these goals by calculating the step value using a second hash function g. step(k) = k.g(key) • This new hash function g should: – be different from the original hash function (remember that it was the original hash function that resulted in the collision in the first place) and, – not result in zero (as original index + 0 = original index) • One of the best methods available for open addressing because the permutations produced have many of the characteristics of randomly chosen permutations. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 52 • The second hash function is usually chosen as follows: g(key) = q – (key % q), where q is a prime number q < N (N is the size of the array). • In order for the entire table to be searched, the value of the second hash function, g(k), must be relatively prime to the table size m. • Remark: It is important that the size of the hash table is a prime number if double hashing is to be used. This guarantees that successive probes will (eventually) try every index in the hash table before an index is repeated (which would indicate that the hash table is full). – For other hashings (and for q) we want to use prime numbers to eliminate existing patterns in the data. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 53 Double Hashing • Probe sequence: 0th probe = h(k) mod TableSize 1th probe = (h(k) + g(k)) mod TableSize 2th probe = (h(k) + 2*g(k)) mod TableSize 3th probe = (h(k) + 3*g(k)) mod TableSize ... ith probe = (h(k) + i*g(k)) mod TableSize • Performance of Double hashing: – – Much better than linear or quadratic probing because it eliminates both primary and secondary clustering. BUT requires a computation of a second hash function. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 54 Example h(k) = k mod 7 and g(k) = 5 – (k mod 5) 93 76 0 1 2 3 4 5 6 Probes 76 1 0 1 2 3 4 5 6 93 76 1 40 0 1 2 3 4 5 6 93 40 76 1 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 47 0 1 2 3 4 5 6 47 93 40 76 2 10 0 1 2 3 4 5 6 47 93 10 40 76 1 55 0 1 2 3 4 5 6 47 93 10 55 40 76 2 55 Selecting a second hash function • Second hash function g()must never evaluate to zero • For any key K, g(K) must be relatively prime to the table size m. Otherwise, we will only be able to examine a fraction of the table entries. – E.g., if hash(K) = 0 and g(K) = m/2, then we can only examine the entries T[0], T[m/2],and nothing else! • One solution is to make m prime, and choose R to be a prime smaller than m, and set g(K) = R – (K mod R) ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 56 Example • Load the keys 18, 26, 35, 9, 64, 47, 96, 36, and 70 in this order, in an empty hash table of size 13 (a) using double hashing with the first hash function: h(key) = key % 13 and the second hash function: g(key) = 1 + key % 12 (b) using double hashing with the first hash function: h(key) = key % 13 and the second hash function: g(key) = 7 - key % 7 Show all computations. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 57 hi(key) = [h(key) + i*g(key)]% 13 h0(18) = (18%13)%13 = 5 h(key) = key % 13 h0(26) = (26%13)%13 = 0 g(key) = 1 + key % 12 h0(35) = (35%13)%13 = 9 h0(9) = (9%13)%13 = 9 collision g(9) = 1 + 9%12 = 10 h1(9) = (9 + 1*10)%13 = 6 h0(64) = (64%13)%13 = 12 h0(47) = (47%13)%13 = 8 h0(96) = (96%13)%13 = 5 collision g(96) = 1 + 96%12 = 1 h1(96) = (5 + 1*1)%13 = 6 collision h2(96) = (5 + 2*1)%13 = 7 h0(36) = (36%13)%13 = 10 h0(70) = (70%13)%13 = 5 collision g(70) = 1 + 70%12 = 11 h1(70) = (5 + 1*11)%13 = 3 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 58 hi(key) = [h(key) + i*g(key)]% 13 h(key) = key % 13 h0(18) = (18%13)%13 = 5 h0(26) = (26%13)%13 = 0 g(key) = 7 - key % 7 h0(35) = (35%13)%13 = 9 h0(9) = (9%13)%13 = 9 collision g(9) = 7 - 9%7 = 5 h1(9) = (9 + 1*5)%13 = 1 h0(64) = (64%13)%13 = 12 h0(47) = (47%13)%13 = 8 h0(96) = (96%13)%13 = 5 collision g(96) = 7 - 96%7 = 2 h1(96) = (5 + 1*2)%13 = 7 h0(36) = (36%13)%13 = 10 h0(70) = (70%13)%13 = 5 collision g(70) = 7 - 70%7 = 7 h1(70) = (5 + 1*7)%13 = 12 collision h2(70) = (5 + 2*7)%13 = 6 59 Probing Techniques - review 0th try 1st try i 2nd try 3rd try 0th try 1st try i ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 0th try 2nd try 2nd try 3rd try … … i Quadratic probing: Double hashing*: 1st try 3rd try … Linear probing: *(determined by a second hash function) 60 Rehashing • If the load factor goes over the safe limit, we should increase the size of the hash table (as for dynamic arrays). This process is called rehashing. • Comments: – we cannot just double the size of the table, as the size should be a prime number; – it will change the main hash function – it’s not enough to just copy items • Rehashing will take time O(N) Rehashing When the table gets too full, create a bigger table (usually 2x as large) and hash all the items from the original table into the new table. • When to rehash? – When load factor reaches some threshold (e.g,. λ ≥0.5), – when an insertion fails – some other threshold • Cost of rehashing? 62 Rehashing: enlarging the table • To rehash: – Create a new table of double the size (adjusting until it is again prime) – Transfer the entries in the old table to the new table, by recomputing their positions (using the hash function) • When should we rehash? – When the table is completely full – With quadratic probing, when the table is half-full or insertion fails • Why double the size? – If n is the number of elements in the table, there must have been n/2 insertions before the previous rehash (if rehashing done when table full) – So by making the table size 2n, a constant cost is added to each insertion ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 63 Rehashing Example h(x) = x mod 7 λ = 0.57 h(x) = x mod 17 λ = 0.29 Rehashing Insert 23 λ = 0.71 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 64 Applications (1) • Compilers use hash tables to keep track of declared variables • A hash table can be used for on-line spelling checkers — if misspelling detection (rather than correction) is important, an entire dictionary can be hashed and words checked in constant time • Game playing programs use hash tables to store seen positions, thereby saving computation time if the position is encountered again • Hash functions can be used to quickly check for inequality — if two elements hash to different values they must be different • Storing sparse data • Useful in applications when the input keys come in sorted order. This is a bad case for binary search tree. AVL tree and B+-tree are harder to implement and they are not necessarily more efficient. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 65 Applications (2) • Keeping track of customer account information at a bank – Search through records to check balances and perform transactions • Keep track of reservations on flights – Search to find empty seats, cancel/modify reservations • Search engine – Looks for all documents containing a given word 66 Performance of Hashing • The number of probes depends on the load factor (usually denoted by ) which represents the ratio of entries present in the table to the number of positions in the array • We also need to consider successful and unsuccessful searches separately • For a chained hash table, the average number of probes for an unsuccessful search is and for a successful search is 1 + /2 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 67 Performance of Hashing (2) • For open addressing, the formulae are more complicated but typical values are: Load Factor 0.1 0.5 0.8 0.9 0.99 Successful search Linear Probes Quadratic Probes 1.05 1.04 1.6 1.5 3.4 2.1 6.2 2.7 21.3 5.2 Unsuccessful search Linear Probes 1.13 Quadratic probes 1.13 2.7 2.2 15.4 5.2 59.8 11.9 430 126 • Note that these do not depend on the size of the array or the number of entries present but only on the ratio (the load factor) ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 68 Performance of Hashing (3) • In the worst case, searches, insertions and removals on a hash table take O(n) time • The worst case occurs when all the keys inserted into the map collide • The load factor = n/N affects the performance of a hash table • Assuming that the hash values are like random numbers, it can be shown that the expected number of probes for an insertion with open addressing is 1 / (1 ) ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ • In practice, hashing is very fast provided the load factor is not close to 100% • Applications of hash tables: – small databases – compilers – browser caches 69 Analysis of Hashing with Chaining • Worst case – All keys hash into the same bucket – a single linked list. – insert, delete, find take O(n) time. • Average case – Keys are uniformly distributed into buckets – O(1+N/B): N is the number of elements in a hash table, B is the number of buckets. – If N = O(B), then O(1) time per operation. – N/B is called the load factor of the hash table. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 70 The Efficiency of Hashing The relative efficiency of four collision-resolution methods 13 B-71 Commonly Used Hash Functions Suppose that each key is a string. The following C++ function uses the division method to compute the address of the key: int hashFunction(char *key, int keyLength) { int sum = 0; for(int j = 0; j <= keyLength; j++) sum = sum + static_cast<int>(key[j]); return (sum % HTSize); }//end hashFunction Note that static_cast<int> operator truncates key[j] into an integer. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 72 Important Factors When Designing Hash Tables To Minimize Collisions: 1. Distribute the elements evenly. – – Use a hash function that distributes keys evenly Make the table size, m, a prime number not near a power of two if using a division method hash function 2. Use a load factor, λ = n / m, that’s appropriate for the implementation. – – 1.0 or less for chaining ( i.e., n ≤ m ). 0.5 or less for linear or quadratic probing or double hashing ( i.e., n ≤ m / 2 ) ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 73 Hashing in the real world • Search time varies, but on average it is great – no matter how much information! • Udi Manber (chief scientist, ): ”The most important techniques behind hashing, hashing, hashing, and hashing”. Yahoo! are: hashing, • Lots of other critical applications in databases, search engines, algorithms, etc. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 74 Comparison : other representations versus hashing • Hash tables are very good if there is a need for many searches in a reasonably stable table • Hash tables are not so good if there are many insertions and deletions, or if table traversals are needed — in this case, AVL trees are better • If there are more data than available memory then use a B-tree • Also, hashing is very slow for any operations which require the entries to be sorted – e.g. Find the minimum key ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 75 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 76 Παραδείγματα & Ασκήσεις ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 77 Example: Collision Resolution Using the hash function: h (k) = 2k mod 10 Insert the numbers: 2, 20, 7, 15, 3, 0, 4, 14 into a one-dimensional array of 10 elements using linear probing to resolve collisions. ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ h (k) = 2k mod 10 2, 20, 7, 15, 3, 0, 4, 14 0 1 20 15 2 3 4 0 5 6 7 3 7 8 9 2 4 14 78 ΑΣΚΗΣΗ - 1 • Give the contents of the hash table that results when you insert items with the keys : EASYQUTION in that order into an initially empty table of M = 5 lists, using separate chaining with unordered lists. Use the hash function k mod M to transform the kth letter of the alphabet into a table index, e.g., hash(I) = hash(9) = 9 % 5 = 4. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 79 ΑΣΚΗΣΗ - 2 • Give the contents of the hash table that results when you insert items with the keys EASYQUTION in that order into an initially empty table of size M = 16 using linear probing. • Use the hash function 11k mod M to transform the kth letter of the alphabet into a table index. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 80 ΑΣΚΗΣΗ - 3 • The following eight strings are to be inserted into an initially empty static hash table that has 13 available addresses (0 to 12). Use linear probing to resolve collisions. The value returned when the hash function is applied to each string is shown in the column labelled Home address. String sharks dragons eagles knights eels raiders broncos warriors • Home address 2 7 1 12 0 1 9 12 Draw the hash table after the strings are inserted in the order given. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 81 ΛΥΣΗ άσκησης - 3 0 eels 1 eagles 2 sharks 3 tigers [home=1] 4 warriors [home=12] 5 6 7 dragons 8 9 bulldogs 10 11 12 knights ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 82 ΑΣΚΗΣΗ -4 • Δίνεται ο παρακάτω πίνακας κατακερματισμού με αλυσίδα : • Να εισάγετε τον αριθμό 53 στον πίνακα με βάση τη συνάρτηση κατακερματισμού που χρησιμοποιείται. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 0 1 2 3 4 5 6 7 8 9 10 23 24 36 1 14 16 17 7 29 31 20 56 42 83 Λύση άσκησης - 4 53 = 4 x 11 + 9 53 mod 11 = 9 0 1 2 3 4 5 6 7 8 9 10 23 24 36 1 56 14 16 17 7 29 31 20 42 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 0 1 2 3 4 5 6 7 8 9 10 23 24 36 1 14 16 17 7 29 53 20 56 42 31 84 ΑΣΚΗΣΗ - 5 • Ποια είναι η πολυπλοκότητα της μεθόδου κατακερματισμού με αλυσίδα; ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 85 ΑΣΚΗΣΗ - 6 • Δίνεται ο πίνακας κατακερματισμού με τη μέθοδο linear probing: – Να εισάγετε τον αριθμό 12 στον πίνακα με βάση τη συνάρτηση κατακερματισμού που χρησιμοποιείται. – Στη συνέχεια να αναζητήσετε τον αριθμό 15 (να βρείτε αν υπάρχει η όχι στον πίνακα) – Στο τέλος να διαγράψετε τον αριθμό 9 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 0 1 2 3 4 5 6 7 8 9 10 42 1 24 14 16 28 7 31 9 86 Linear Probing (insert 12) 12 = 1 x 11 + 1 12 mod 11 = 1 0 1 2 3 4 5 6 7 8 9 10 42 1 24 14 16 28 7 31 9 0 1 2 3 4 5 6 7 8 9 10 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 42 1 24 14 12 16 28 7 31 9 87 Search with linear probing (Search 15) 15 = 1 x 11 + 4 15 mod 11 = 4 0 1 2 3 4 5 6 7 8 9 10 42 1 24 14 12 16 28 7 31 9 NOT FOUND ! ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 88 Deletion with linear probing: Delete 9 9 = 0 x 11 + 9 9 mod 11 = 9 0 1 2 3 4 5 6 7 8 9 10 42 1 24 14 12 16 28 7 31 9 FOUND ! ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 0 1 2 3 4 5 6 7 8 9 10 42 1 24 14 12 16 28 7 31 D 89 ΑΣΚΗΣΗ - 7 h(k,i)=(h(k)+ i2) mod m Offsets: 0, 1, 4, … With H=h(k), we try the following cells with wraparound: H, H + 12, H + 22, H + 32 … Insert Keys: 10, 23, 14, 9, 16, 25, 36, 44, 33 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 0 1 2 3 4 5 6 7 8 9 90 Double Hashing: Example h1(k)=k mod 13 h2(k)=1+(k mod 11) h(k,i)=(h1(k)+i*h2(k)) mod 13 • Insert key 14: h1(14,0)=14 mod h(14,1)=(h1(14) = (1 + h(14,2)=(h1(14) = (1 + 13 = 1 + h2(14)) mod 13 4) mod 13 = 5 + 2 h2(14)) mod 13 8) mod 13 = 9 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 0 1 2 3 4 5 6 7 8 9 10 11 12 79 69 98 72 14 50 91 ΑΣΚΗΣΗ – 8 0 1 2 3 4 5 6 7 8 9 Hash Functions: H(K) = K mod M G(K)= 1 + ((K/M) mod (M-1)) M = Insert these values into the hash table in this order. Resolve any collisions with double hashing: 13, 28, 33, 147, 43 ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 92 Review : Open Addressing Facts • In general, primes give the best table sizes. • With any open addressing method of collision resolution, as the table fills, there can be a severe degradation in the table performance. • Load factors between 0.6 and 0.7 are common. • Load factors > 0.7 are undesirable. • The search time depends only on the load factor, not on the table size. • We can use the desired load factor to determine appropriate table size: ΔΠΘ - ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ 93