Download hash 2 (x)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Java ConcurrentMap wikipedia , lookup

Array data structure wikipedia , lookup

Control table wikipedia , lookup

Comparison of programming languages (associative array) wikipedia , lookup

Bloom filter wikipedia , lookup

Hash table wikipedia , lookup

Rainbow table wikipedia , lookup

Transcript
CE 221
Data Structures and Algorithms
Chapter 5: Hashing
Text: Read Weiss, §5.1-5.5
Izmir University of Economics
1
Introduction
• Search Tree ADT was disccussed.
• Now, Hash Table ADT
• Hashing is a technique used for performing
insertions and deletions in constant average
time.
• Thus, findMin, findMax and printAll are
not supported.
Izmir University of Economics
2
Tree Structures
find,
insert,
delete
worst
case
average
case
BST
N
log N
AVL
log N
log N
Izmir University of Economics
3
Goal
• Develop a structure that will allow users to
insert / delete / find records in
constant average time
–
–
–
–
structure will be a table (relatively small)
table completely contained in memory
implemented by an array
capitalizes on ability to access any element of
the array in constant time
Izmir University of Economics
4
Hash Function
• Determines position of a key in the array.
• Assume table (array) size is N (TableSize).
• Hash function hash(x) maps any key x to an int
between 0 and N−1
• For example, assume that N=15, that key x is a
non-negative integer between 0 and MAX_INT,
and hash function hash(x) = x % 15.
Izmir University of Economics
5
Choosing the Hash Functions (1)
• If keys are integers
key % TableSize ex:all keys=10*i,TableSize=10
So, It’s a good idea to make TableSize a prime
• If keys are strings, then ASCII codes of chars
keySize1
key[i] % TableSize, when TableSize=10,007 and

i 0
keys are at most 8 chars in length (127*8=1,016)
• The base 26+1=27 representation of the first 3
letters of the key
(key[0] +27* key[1] + 729* key[2] ) % TableSize
(263=17,576 but only 2,851 combinations in English)
Izmir University of Economics
6
Choosing the Hash Functions (2)
• Another hash
function involving
all characters
keySize 1
 key[keySize  i  1] * 37
i
i 0
• compute this by
Horner’s Rule
Izmir University of Economics
7
Hash Function
Let hash(x) = x % 15. Then,
if x = 25 129
35 2501
f(x) = 10
9
5
11
47
2
36
6
Storing the keys in the array is straightforward:
0
_
1
_
2
47
3
_
4
_
5
35
6
36
7
_
8
9
_ 129
10
11
25 2501
12
_
13
_
14
_
• Thus, delete and find can be done in O(1), and
also insert, except…
Izmir University of Economics
8
Hash Function
What happens when you try to insert: x = 65 ?
x
= 65
hash(x) = 5
0
_
1
_
2
47
3
_
4
_
5
6
35 36
65(?)
7
_
8
9
_ 129
10
11
25 2501
12
_
13
_
14
_
• If, when an element is inserted, it hashes to the same
value as an already inserted element, this is called a
collision.
Izmir University of Economics
9
Handling Collisions
• Separate Chaining
• Open Addressing
– Linear Probing
– Quadratic Probing
– Double Hashing
Izmir University of Economics
10
Handling Collisions
Separate Chaining
Izmir University of Economics
11
Separate Chaining
• Keep a list of elements
that hash to the same
value
• New elements can be
inserted at the front of
the list
• ex: x=i2 and
hash(x)=x%10
Izmir University of Economics
12
Performance of Separate Chaining
• Load factor of a hash table, λ
• λ = N/M (# of elements in the table/TableSize)
• So the average length of list is λ
• Search Time = Time to evaluate hash function +
the time to traverse the list
• Unsuccessful search= λ nodes are examined
• Successful search=1 + ½* (N-1)/M (the node searched +
half the expected # of other nodes) =1+1/2 *λ
• Observation: Table size is not important but load
factor is.
• For separate chaining make λ  1
Izmir University of Economics
13
Separate Chaining: Disadvantages
• Parts of the array might never be used.
• As chains get longer, search time increases
to O(N) in the worst case.
• Constructing new chain nodes is relatively
expensive (still constant time, but the
constant is high).
• Is there a way to use the “unused” space
in the array instead of using chains to
make more space?
Izmir University of Economics
14
Handling Collisions
Open Addressing
Izmir University of Economics
15
Handling Collisions
Linear Probing
An alternative to resolving collisions with linked
lists is to try alternative cells until an empty cell is
found.
Alternative Cells are h0(x), h1(x), h2(x),... where
hi(x)=(hash(x)+f(i)) % TableSize
with f(0)=0, f is the collision resolution strategy.
Because all the data go inside the table
For Linear probing, λ < 0.5
Izmir University of Economics
16
Linear Probing
In linear probing f(i)=i (Popular Choice)
Let key x be stored in element hash(x)=t of the array
0
1
2
47
3
4
5
6
35 36
65(?)
7
8
9
129
10
11
25 2501
12
13
14
What do you do in case of a collision?
• If the hash table is not full, attempt to store key in
the next array element (in this case (t+1)%N,
(t+2)%N, (t+3)%N …)
until you find an empty slot.
Izmir University of Economics
17
Linear Probing
Where do you store 65 ?
0
1
2
47
3
4
5
6
7
35 36 65



attempts
8
9
129
Izmir University of Economics
10
11
25 2501
12
13
14
18
Linear Probing Performance (1)
• If the table is relatively empty, blocks of
occupied cells start forming (primary
clustering)
• Expected # of probes
• for insertions and unsuccessful searches is
½(1+1/(1-λ)2)
• for successful searches is
½(1+1/(1-λ))
Izmir University of Economics
19
Linear Probing Performance (2)
• Assumptions: If clustering is not a problem, large table,
probes are independent of each other
• Expected # of probes for unsuccessful searches
(=expected # of probes until an empty cell is found)


• Expected # of probes for successful searches
i * (1   ) * i 1 
i 0
1
(fraction of empty cells = 1 -λ)
1 
(=expected # of probes when an element was inserted)
(=expected # of probes for an unsuccessful search) 1
• Average cost of an insertion
I ( ) 
1
1 


1
1
1
dx  ln
(earlier insertion are cheaper)
 1 x
 1 
0
Izmir University of Economics
20
Linear Probing
• Eliminates need for separate data structures
(chains), and the cost of constructing nodes.
• Leads to problem of clustering. Elements tend
to cluster in dense intervals in the array.



 
• Search efficiency problem remains.
• Deletion becomes trickier….
Izmir University of Economics
21
Deletion Problem -- SOLUTION
• Standard deletion cannot be performed in a probing hash
table, because the cell might have caused a collison to go
past it. “Lazy” deletion
• Each cell is in one of 3 possible states:
– active
– empty
– deleted
• For Find or Delete
– only stop search when EMPTY state detected (not DELETED)
Izmir University of Economics
22
Deletion-Aware Algorithms
• Insert
– call Find, if cell = active
– if Cell = deleted or empty
already there
insert, cell = active
• Find
– cell empty
– cell deleted
– cell active
NOT found
if key == key -> NOT FOUND
else H = (H + 1) mod TS
if key == key -> FOUND
else H = (H + 1) mod TS
• Delete
–
–
–
–
call Find,
cell active
cell deleted
cell empty
DELETE; cell=deleted
NOT found
NOT found
Izmir University of Economics
23
Handling Collisions
Quadratic Probing
Izmir University of Economics
24
Quadratic Probing
In quadratic probing f(i)=i2 (Popular Choice)
Let key x be stored in element hash(x)=t of the array
0
1
2
47
3
4
5
6
35 36
65(?)
7
8
9
129
10
11
25 2501
12
13
14
What do you do in case of a collision?
• If the hash table is not full, attempt to store key in
array elements (t+12)%N, (t+22)%N, (t+32)%N …
until you find an empty slot.
Izmir University of Economics
25
Quadratic Probing
Where do you store 65 ? hash(65)=t=5
0
1
2
47
3
4
5
6
7
35 36


t t+1
attempts
8
9
129

t+4
Izmir University of Economics
10
11
25 2501
12
13
14
65

t+9
26
Quadratic Probing
• Theorem: If quadratic probing is used, and the table size is prime,
then a new element can always be inserted if the table is at least
half empty.
• Proof: Let TableSize be an odd prime > 3,
-prove first TableSize / 2 alternative locations are all distinct.
-Assume two of these locations are the same and i  j . Then,
h( x)  i 2  h( x)  j 2 (mod TableSize)
i 2  j 2 (mod TableSize)
i 2  j 2  0 (mod TableSize)
(i  j )(i  j )  0 (mod TableSize)
• Since TableSize is prime and i and j are distinct (also less than
floor(TableSize)), this is not possible. It follows that the first M/2
alternative are all distinct, and an insertion must succeed if the
table is at least half full.
Izmir University of Economics
27
Quadratic Probing
• Tends to distribute keys better than linear probing,
alleviates the problem of clustering (primary
clustering.
• There remains the problem of secondary clustering
in which elements that hash to the same position will
probe the same alternative cells
• Runs the risk of an infinite loop on insertion, unless
precautions are taken.
• E.g., consider inserting the key 16 into a table of size
16, with positions 0, 1, 4 and 9 already occupied.
• Therefore, table size should be prime.
Izmir University of Economics
28
Handling Collisions
Double Hashing
Izmir University of Economics
29
Double Hashing
• f(i)=i*hash2(x) is a popular choice
• hash2(x)should never evaluate to zero
• Now the increment is a function of the key
– The slots visited by the hash function will vary even if
the initial slot was the same
– Avoids clustering
• Theoretically interesting, but in practice slower
than quadratic probing, because of the need to
evaluate a second hash function.
Izmir University of Economics
30
Double Hashing
• Typical second hash function
hash2(x)=R − ( x % R )
where R is a prime number, R < N
Izmir University of Economics
31
Double Hashing
Where do you store 99 ? hash(99)=t=9
Let hash2(x) = 11 − (x % 11), hash2(99)=d=11
Note: R=11, N=15
• Attempt to store key in array elements (t+d)%N,
(t+2d)%N, (t+3d)%N …
Array:
0
14
1
2
16 47

t+22
attempts
3
4
5
35

t+11
6
36
7
65
8
9
129

t
10
11
25 2501
12
99

t+33
13
14
29
Where would you store: 127?
Izmir University of Economics
32
Double Hashing
Let f2(x)= 11 − (x % 11)
hash2(127)=d=5
Array:
0
14
1
16
2
47

t+10
attempts
3
4
5
35
6
36
7
65

t
8
9
129
10
11
25 2501
12
99

t+5
13
14
29
Infinite loop!
Izmir University of Economics
33
Rehashing
• If the table gets too full, the running times for the
operations will start taking too long.
• When the load factor exceeds a threshold, double
the table size (smallest prime > 2 * old table size).
• Rehash each record in the old table into the new
table.
• Expensive: O(N) work done in copying.
• However, if the threshold is large (e.g., ½), then
we need to rehash only once per O(N) insertions,
so the cost is “amortized” constant-time.
Izmir University of Economics
34
Factors affecting efficiency
– Choice of hash function
– Collision resolution strategy
– Load Factor
• Hashing offers excellent performance for
insertion and retrieval of data.
Izmir University of Economics
35
Comparison of Hash Table & BST
Average Speed
Find Min/Max
Items in a range
Sorted Input
BST
O(log2N)
Yes
Yes
Very Bad
HashTable
O(1)
No
No
No problems
• Use HashTable if there is any suspicion of
SORTED input & NO ordering information is
required.
Izmir University of Economics
36
Homework Assignments
• 5.1, 5.2, 5.12, 5.14
• You are requested to study and solve the
exercises. Note that these are for you to
practice only. You are not to deliver the
results to me.
Izmir University of Economics
37