Download HashingFinal

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Array data structure wikipedia , lookup

B-tree wikipedia , lookup

Java ConcurrentMap wikipedia , lookup

Binary search tree wikipedia , lookup

Bloom filter wikipedia , lookup

Control table wikipedia , lookup

Comparison of programming languages (associative array) wikipedia , lookup

Hash table wikipedia , lookup

Rainbow table wikipedia , lookup

Transcript
Dictionaries, Tables
Hashing
TCSS 342
1/51
The Dictionary ADT
• a dictionary (table) is an abstract model of a
database
• like a priority queue, a dictionary stores
key-element pairs
• the main operation supported by a
dictionary is searching by key
2/51
Examples
•
•
•
•
Telephone directory
Library catalogue
Books in print: key ISBN
FAT (File Allocation Table)
3/51
Main Issues
• Size
• Operations: search, insert, delete, ??? Create
reports??? List?
• What will be stored in the dictionary?
• How will be items identified?
4/51
The Dictionary ADT
• simple container methods:
– size()
– isEmpty()
– elements()
• query methods:
– findElement(k)
– findAllElements(k)
5/51
The Dictionary ADT
• update methods:
– insertItem(k, e)
– removeElement(k)
– removeAllElements(k)
• special element
– NO_SUCH_KEY, returned by an unsuccessful
search
6/51
Implementing a Dictionary
with a Sequence
• unordered sequence
– searching and removing takes O(n) time
– inserting takes O(1) time
– applications to log files (frequent insertions,
rare searches and removals) 34 14 12 22 18
34
14
12
22
18
7/51
Implementing a Dictionary
with a Sequence
• array-based ordered sequence
(assumes keys can be ordered)
- searching takes O(log n) time (binary search)
- inserting and removing takes O(n) time
- application to look-up tables
(frequent searches, rare insertions and removals)
12
14
18
22
34
8/51
Binary Search
• narrow down the search range in stages
• “high-low” game
• findElement(22)
2
low
4
5
7
8
9 12 14 17 19 22 25 27 28 33 37
mid
high
9/51
Binary Search
2
4
5
7
8
9 12 14 17 19 22 25 27 28 33 37
low
2
4
5
7
8
mid
high
9 12 14 17 19 22 25 27 28 33 37
low mid high
2
4
5
7
8
9 12 14 17 19 22 25 27 28 33 37
low = mid = high
10/51
Pseudocode for Binary Search
Algorithm
BinarySearch(S, k, low, high)
if low > high then
return NO_SUCH_KEY
else
mid (low+high) / 2
if k = key(mid) then
return key(mid)
else if k < key(mid) then
return BinarySearch(S, k, low, mid-1)
else
return BinarySearch(S, k, mid+1, high)
11/51
Running Time of Binary
Search
• The range of candidate items to be searched
is halved after each comparison
Comparison
Search Range
0
n
1
n/2
…
…
2i
n/2i
log 2n
1
12/51
Running Time of Binary
Search
• In the array-based implementation, access
by rank takes O(1) time, thus binary search
runs in O(log n) time
• Binary Search is applicable only to Random
Access structures (Arrays, Vectors…)
13/51
Implementations
• Sorted? Non Sorted?
• Elementary: Arrays, vectors linked lists
– Orgainization: None (log file), Sorted, Hashed
• Advanced: balanced trees
14/51
Skip Lists
• Simulate Binary Search on a linked list.
• Linked list allows easy insertion and
deletion.
• http://www.epaperpress.com/s_man.html
15/51
Hashing
• Place item with key k in position h(k).
• Hope: h(k) is 1-1.
• Requires: unique key (unless multiple items
allowed). Key must be protected from
change (use abstract class that provides only
a constructor).
• Keys must be “comparable”.
16/51
Key class
•
•
•
•
•
•
•
public abstract class KeyID {
Private Comparable searchKey;
Public KeyID(Comparable m) {
searchKey = m;
}//Only one constructor
public Comparable getSearchKey() {
return searchKey;
}
• }
17/51
Hash Tables
• RT&T is a large phone company, and they
want to provide enhanced caller ID
capability:
–
–
–
–
given a phone number, return the caller’s name
phone numbers are in the range 0 to R = 1010–1
n is the number of phone numbers used
want to do this as efficiently as possible
18/51
Alternatives
• There are a few ways to design this dictionary:
• Balanced search tree (AVL, red-black, 2-4 trees,
B-trees) or a skip-list with the phone number as
the key has O(log n) query time and O(n) space --good space usage and search time, but can we
reduce the search time to constant?
• A bucket array indexed by the phone number has
optimal O(1) query time, but there is a huge
amount of wasted space: O(n + R)
19/51
Bucket Array
• Each cell is thought of as a bucket or a container
– Holds key element pairs
– In array A of size N, an element e with key k is inserted
in A[k].
Table operations without searches!
(null)
(null)
000-000-0000 000-000-0001
…
Roberto
401-863-7639 ...
Note: we need 10,000,000,000 buckets!
…
(null)
999-999-9999
20/51
Generalized indexing
• Hash table
– Data storage location associated with a key
– The key need not be an integer, but keys must
be comparable.
21/51
Hash Tables
• A data structure
• The location of an item is determined
– Directly as a function of the item itself
– Not by a sequence of trial and error comparisons
• Commonly used to provide faster searching.
• Comparisons of searching time:
– O(n) for linear searches
– O (logn) for binary search
– O(1) for hash table
22/51
Examples:
•
•
A symbol table constructed by a compiler.
Stores identifiers and information about
them in an array.
File systems:
– I-node location of a file in a file system.
•
Personal records
– Personal information retrieval based on key
23/51
Hashing Engine
• itemKey
Position
Calculator
24/51
Example
• Insert item (401-863-7639, Roberto) into a table of size 5:
• calculate: 4018637639 mod 5 = 4, insert item (401-8637639, Roberto) in position 4 of the table (array, vector).
• A lookup uses the same process: use the hash engine to
map the key to a position, then check the array cell at that
position.
401863-7639
Roberto
0
1
2
3
4
25/51
Chaining
• The expected, search/insertion/removal time
is O(n/N), provided the indices are
uniformly distributed
• The performance of the data structure can
be fine-tuned by changing the table size N
26/51
From Keys to Indices
• The mapping of keys to indices of a hash table is
called a hash function
• A hash function is usually the composition of two
maps:
– hash code map: key  integer
– compression map: integer  [0, N - 1]
• An essential requirement of the hash function is to
map equal keys to equal indices.
• A “good” hash function is fast and minimizes the
probability of collisions
27/51
Perfect hash functions
• A perfect hash function maps each key to a
unique position.
• A perfect hash function can be constructed
if we know in advance all the keys to be
stored in the table (almost never…)
28/51
A good hash function
1. Be easy and fast to compute
2. Distribute items evenly throughout the
hash table
3. Efficient collision resolution.
29/51
Popular Hash-Code Maps
• Integer cast: for numeric types with 32 bits
or less, we can reinterpret the bits of the
number as an int
• Component sum: for numeric types with
more than 32 bits (e.g., long and double),
we can add the 32-bit components.
30/51
Sample of hash functions
• Digit selection:
h(2536924520) = 590
(select 2-nd, 5-th and last digits).
This is usually not a good hash function. It
will not distribute keys evenly.
A hash function should use every part of the
key.
31/51
Sample (continued)
• Folding: add all digits
• Modulo arithmetic:
h(key) = h(x) = x mod table_size.
The modulo arithmetic is a very popular basis
for hash functions. To better the chance of
even distribution table_size should be a
prime number. If n is the number of items
there is always a prime p, n < p < 2n.
32/51
Popular Hash-Code Maps
• Polynomial accumulation: for strings of a
natural language, combine the character
values (ASCII or Unicode) a 0 a 1 ... a n-1 by
viewing them as the coefficients of a
polynomial: a 0 + a 1 x + ...+ a n-1 x n-1
• For instance, choosing x = 33, 37, 39, or 41
gives at most 6 collisions on a vocabulary
of 50,000 English words.
33/51
Popular Hash-Code Maps
• Why is the component-sum hash code bad
for strings?
34/51
Popular Compression Maps
• Division: h(k) = |k| mod N
– the choice N =2 k is bad because not all the bits are
taken into account
– the table size N is usually chosen as a prime
number
– certain patterns in the hash codes are propagated
• Multiply, Add, and Divide (MAD):
– h(k) = |ak + b| mod N
– eliminates patterns provided a mod N 0
– same formula used in linear congruential (pseudo)
random number generators
35/51
Java Hash
• Java provides a hashCode() method for the
Object class, which typically returns the 32bit memory address of the object.
• This default hash code would work poorly
for Integer and String objects
• The hashCode() method should be suitably
redefined by classes.
36/51
Collision
• A collision occurs when two distinct items
are mapped to the same position.
• Insert (401-863-9350, Andy)  0
• And insert (401-863-2234, Devin).
4018632234  4. We have a collision!
401863-9350
Andy
0
401863-7639
Roberto
1
2
3
4
37/51
Collision Resolution
• How to deal with two keys which map to
the same cell of the array?
• Need policies, design good Hashing engines
that will minimize collisions.
38/51
Chaining I
• Use chaining
– Each position is viewed as a container of a list
of items, not a single item. All items in this list
share the same hash value.
39/51
Chaining II
0
1
2
3
4
40/51
Collisions resolution policies
• A key is mapped to an already occupied table
location
– what to do?!?
• Use a collision handling technique
– Chaining (may have less buckets than items)
– Open Addressing (load factor < 1)
• Linear Probing
• Quadratic Probing
• Double Hashing
41/51
Linear Probing
• If the current location is used, try the next
table location
• linear_probing_insert(K)
if (table is full) error
probe = h(K)
while (table[probe] occupied)
probe = (probe + 1) mod M
table[probe] = K
42/51
Linear Probing
• Lookups walk along table until the key or an
empty slot is found
• Uses less memory than chaining
– don’t have to store all those links
• Slower than chaining
– may have to walk along table for a long way
• Deletion is more complex
– either mark the deleted slot
– or fill in the slot by shifting some elements down
43/51
Linear Probing Example
• h(k) = k mod 13
• Insert keys:
• 18 41 22 44 59 32 31 73
0
1
2
3
4
5
41
0
1
2
6
7
8
9
10 11 12
18 44 59 32 22 31 72
3
4
5
6
7
8
9
10 11 12
44/51
Double Hashing
• Use two hash functions
• If M is prime, eventually will examine every
position in the table
• double_hash_insert(K)
if(table is full) error
probe = h1(K)
offset = h2(K)
while (table[probe] occupied)
probe = (probe + offset) mod M
table[probe] = K
45/51
Double Hashing
• Many of same (dis)advantages as linear
probing
• Distributes keys more uniformly than linear
probing does
46/51
Double Hashing Example
• h1(K) = K mod 13
• h2(K) = 8 - K mod 8
– we want h2 to be an offset to add
– 18 41 22 44 59 32 31 73
– h1(44) = 5 (occupied) h2(0) = 8 44  5+8 Mod 13
0
1
44
0
2
3
4
41 73
1
2
3
5
6
7
8
9
10 11 12
18 32 53 31 22
4
5
6
7
8
9
10 11 12
47/51
Why so many Hash functions?
• “Its different strokes for different folks”.
• We seldom know the nature of the object
that will be stored in our dictionary.
48/51
A FAT Example
• Directory: Key: file name. Data (time, date, size
…) location of first block in the FAT table.
• If first block is in physical location #23 (Disk
block number) look up position #23 in the FAT.
Either shows end of file or has the block number
on disk.
• Example: Directory entry: block # 4
• FAT: x x x F 5 6 10 x 23 25
3
The file occupies blocks 4,5,6,10, 3.
49/51