* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt - EECG Toronto
Survey
Document related concepts
Transcript
Data Structures &
Computational Complexity
ECE 244
2013
Vaughn Betz
Data Structure Complexity
Operation
Unordered List
Vector
Binary Search
Tree
Build (N items) O(N)
O(N)
O(N log N)
Find item by
value/key
O(N) in general
O(1) at head / tail
O(N)
O(log N)
Insert item
O(1)
O(N) in general
O(1) at back
O(log N)
Delete item
O(N) in general
O(1) at head / tail
O(N) in general
O(1) at back
O(log N)
• BST: all operations reasonably fast
– Good match to database (find, insert, delete common)
– Search through everything to find entry
• vector:
– Slow to insert/delete except at end/back
– Slow find by value/key
• Unordered List: fast insert, slow find and delete
– Good if you only find/delete at head and tail
– Stacks and queues
2
Important Vector Special Case
• If key is an integer
• AND range of key limited: [0 to N-1]
Can store in array/vector, use key as index
array
MyObject 0
MyObject 1
MyObject 2
MyObject 3
MyObject array[N];
int key;
MyObject object_with_key = array[key];
MyObject 4
...
MyObject N-3
• Build: O(N)
• Find, Insert, Delete: O(1)
• Fastest data structure!
MyObject N-2
MyObject N-1
3
Hash Tables
ECE 244
2013
Hash Table - Idea
• Stores data in a specially indexed array
• Enables O(1) find, insert, delete operations
– Very fast!
– Trade-off: wastes some space
• Each item has a unique key
– Key range large (e.g. student numbers: 900123456)
– Many more keys than items to store
– Can’t just make array[MAX_KEY] too big
• Know roughly how many items to store
– E.g. number of students in ECE department
• Create array with more entries than we have items to
store
• “hashing function” h(key) to map from key to array entry
• Reference: Algorithms by Cormen et al, Chap. 11
5
Hash Table - Implementation
MyObject hash_array[M];
hash_array
key (0 .. 999999999)
h(key) index [0..M-1]
– Example:
– h(key) = key % M;
– M = 10
– key = 900200204
– h(key) = 4
MyObject M-1
MyObject M-2
MyObject M-3
...
MyObject 4
MyObject 3
MyObject 2
MyObject 1
MyObject 0
6
Hashing Collisions
const int M = 10;
MyObject hash_array[M];
– h(key) = key % M;
– M = 10
– key0 = 789222308
– h(key0) = 8
– key1 = 900200205
– h(key1) = 5
– key2 = 988777335
– h(key2) = 5 collision!
• Problem: can’t overwrite data of
another object/key
hash_array
MyObject 9
MyObject 8
MyObject 7
MyObject 6
MyObject 5
MyObject 4
MyObject 3
MyObject 2
MyObject 1
MyObject 0
7
Solution 1: Hashing with Chaining
•
•
•
Also known as “open hashing”
Don’t store object at array entry i
Instead store pointer to linked list of objects
–
Every object with a key such that h(key) == i
ListofObjects hash_array[M];
hash_array
key0 = 789222304
h(key0) = 4
key1 = 900200201
h(key1) = 1
key2 = 988777331
h(key2) = 1
...
List 5 = NULL
List 4 = NULL
List 3 = NULL
key = 789222304
next = NULL
List 2 = NULL
List 1 = NULL
List 0 = NULL
key
= 900200201
988777331
next = NULL
8
Open Hashing
• If hash function is “good”
– Spreads out keys across array evenly
• And if hash table (array) big enough
– More array locations than keys we want to insert
• Then average length of list is approx. 1
– Achieves O(1) access
9
Open Hashing & Hash Quality
• Bad hash function example:
– Instead of h(key) = key % M, could use h’(key) = key /
100000000
• E.g.
– key = 988777334
– h’(key) = 9
– If keys are student numbers, and many start with
same digit (entered university at same time)
h’ will not map evenly across array
Lists will be longer, and access slower
10
Solution 2: Closed Hashing
• If h(key) == i, and location i occupied
– check location (i+1) % M in hash table
– store key/object there if empty
hash_array
– else check location (i+2) % M
–…
key=789222308
const int M = 10;
MyObject hash_array[M];
key0 = 789222308
h(key0) = 8
key1 = 900200204
h(key1) = 4
key2 = 988777334
h(key2) = 4
9
8
7
6
key=988777334 5
key=900200204 4
3
2
1
0
11
Closed Hashing
• Consider:
– h(key) == i
– i occupied
• Linear probing:
– Check (i+1) % M, (i+2) %M, …
– Until we find an empty slot
– Problem: used entries tend to cluster, even with a
good hash function
• More probes per insert or find
• Slows down average insert and average find
• Quadratic probing:
– Check (i+1) % M, (i+4) % M, (i+9) % M, …
– Less clustering (find an empty slot faster)
12
Closed Hashing
• Double hashing (or re-hashing)
– Defines two hash functions:
– h(key) and h2(key)
– Example:
• h(key) = key % M
• h2(key) = (key * 7) % M
–
–
–
–
–
First map to h(key)
If collision, try [h(key) + h2(key)] % M
Then [h(key) + 2 * h2(key)] % M
Then [h(key) + 3 * h2(key)] % M
…
• +ve: reduces clustering
• -ve: more time to compute hash functions
13
Closed Hashing: Class
// Storing Element type objects, using an integer key.
// Using closed hashing with linear probing in this class.
Class HashTable {
private:
Element *htable;
// [0..m_table_size-1]
int m_table_size;
int n_items;
// Num items stored in table.
int hash (int key); // Helper function: the hash
bool is_empty (int index);
// Helper functions
void set_to_empty (int index);
public:
HashTable(int table_size);
~HashTable();
bool insert (const Element &item);
bool find (int key, Element &item);
bool delete (const Element &item);
};
14
Closed Hashing: Constructor
HashTable::HashTable (int table_size) {
n_items = 0;
m_table_size = table_size;
htable = new Element[m_table_size];
// Need to store a special value for empty items,
// so we can tell where we have spots to insert things
// in the table.
for (int i = 0; i < m_table_size; i++)
set_to_empty (i);
}
15
Closed Hashing: Insert
bool HashTable::insert (const Element &item) {
if (n_items == m_table_size - 1)
return (false); // Table full!
int index = hash (item.key);
Need to resize.
// “Home” spot for item.
while (!is_empty(index)) {
if (item.key == htable[index].key)
return (false); // Already in table.
// Look for free entry, using linear probing.
index = (index + 1) % m_table_size;
}
htable[index] = item;
n_items++;
return (true);
// Found free entry. Store item.
}
16
Closed Hashing: Find
bool HashTable::find (int key, Element &item) {
int index = hash (key);
// “Home” spot for item.
while (!is_empty(index)) {
if (key == htable[index].key) {
// Found item!
item = htable[index];
return (true);
}
// Keep looking, using linear probing.
index = (index + 1) % m_table_size;
}
// Hit an empty spot, without finding item with key
// Will have at least one empty spot, since insert
// only allowed m_table_size – 1 items
return (false);
}
17
Closed Hashing: Delete
HashTable my_hash(10);
item0.key = 789222308;
my_hash.insert (item0);
item1.key = 900200205;
my_hash.insert (item1);
item2.key = 988777335;
my_hash.insert (item2);
htable
9
<empty>
<empty>
8
key=789222308
7
<empty>
key=988777335
6
<empty>
<empty>
key=900200205
5
my_hash.delete (900200205);
my_hash.find (988777335, found_item);
<empty>
<empty>
4
<empty>
2
<empty>
1
<empty>
0
3
18
Closed Hashing: Delete
HashTable my_hash(10);
item0.key = 789222308;
my_hash.insert (item0);
item1.key = 900200205;
my_hash.insert (item1);
item2.key = 988777335;
my_hash.insert (item2);
htable
9
<empty>
key=789222308 8
7
<empty>
key=988777335 6
<empty>
key=900200205
5
my_hash.delete (900200205);
my_hash.find (988777335, found_item);
Find will fail (return false)!
Problem: didn’t reproduce insert’s probe sequence
How can we fix?
<empty>
<empty>
4
<empty>
2
<empty>
1
<empty>
0
3
19
Closed Hashing: Delete
Solution: can’t mark deleted items as empty.
Mark as “deleted” (another special value)
– OK for insert to re-use.
– But find must keep looking past them, as it probes.
– Ensures original (insert) probe sequence reproduced
by find
htable
9
<empty>
key=789222308 8
7
<empty>
key=988777335 6
my_hash.delete (900200205);
my_hash.find (988777335, found_item);
Now finds item in index 6
What do we need to change in insert & find code?
<deleted>
key=900200205
5
<empty>
<empty>
4
<empty>
2
<empty>
1
<empty>
0
3
20
Closed Hashing: Fixed Insert
bool HashTable::insert (const Element &item) {
int index;
if (n_items == m_table_size - 1)
return (false); // Table full! Need to resize.
index = hash (item.key); // “Home” spot for item.
while (!is_empty(index) && !is_deleted(index)) {
if (item.key == htable[index].key) {
return (false); // Already in table.
// Look for free entry, using linear probing.
index = (index + 1) % m_table_size;
}
htable[index] = item;
n_items++;
return (true);
// Found free entry. Store item.
}
21
Closed Hashing: Fixed Find
bool HashTable::find (int key, Element &item) {
int index = hash (key); // “Home” spot for item.
while (!is_empty(index)) {
if (!is_deleted(index) && key == htable[index].key) {
// Found item!
item = htable[index];
return (true);
}
// Keep looking, using linear probing.
index = (index + 1) % m_table_size;
}
// Hit an empty spot, without finding item with key
return (false); // Can’t find key in table.
}
22
String Hash Functions
• Can hash more than integers
• pair of ints, strings, …
string s = “Hash tables enable fast find operations!”;
int hash_index = hash_func (s);
int hash_func (string s) {
return (s[0] % m); // m is the hash table size
}
Good?
No!
1. Indices all from 0 to 255. If m = 5000, most indices never used.
2. Some letters (e.g. ‘e’) more likely than others
–
Poor spreading over indices even from 0 to 255
3. Empty string will crash (no s[0])
23
Better String Hash Function
int hash_func (string s) {
int hash = 0;
int length = s.length();
for (int i = 0; i < length; i++)
hash = hash * 127 + s[i];
// Multiply by a non-power of 2 for “bit scrambling”
}
hash = hash % m;
// m is the hash table size
return (hash);
}
1. Produces larger numbers: uses all indices
– With strings of length 4, indices up to 500 million possible
2. Each bit of hash is a scramble of the string characters
Spreads various strings better across indices
24
Complexity Analysis – Open Hashing
• Let = n / m
– n = items in table
– m = table size (array size)
– “load factor”
• Worst case?
– All items could hash to one location
– Linked list with n items to search
– O(n) find and delete
– O(1) insert (at head)
25
Complexity Analysis – Open Hashing
• Average case
–
–
–
–
Good hash function: all m locations equally likely
Assume O(1) time to compute hash function
Expected length of each linked list: n/m =
Find and delete are: O(1 + )
• 1 to compute hash and go to “home” index
• to look through a linked list of length
– If kept small (constant):
• O(1) find and delete
• Already had O(1) insert
– To keep small, keep m within a constant factor of n
• Often want m >= n for highest speed
26
Complexity Analysis – Closed Hashing
• Worst case
– Insert, find, delete could all take O(n) time
• Average case
– Good hash function: all m locations equally likely
– Insert and unsuccessful find:
1
• O(1−)
– Successful find
• O(1 ∗ ln(1−1))
– O(1) if kept well below 1
• = 0.5: average 2 probes per insert and 1.4 per successful find
–
–
–
–
Degrades rapidly as approaches 1
Table fails (cannot insert) when == 1
Usually slightly faster than open hashing for small
Degrades more rapidly than open hashing for larger
27
Average Probes vs. Load Factor
25
20
Probes
(Comparisons)
per Find
15
Open hashing: Find
Closed Hashing: Find
10
5
0
0
0.2
0.4
0.6
0.8
1
(load factor)
28