Download Array implementation of binary trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bloom filter wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Linked list wikipedia , lookup

Rainbow table wikipedia , lookup

Comparison of programming languages (associative array) wikipedia , lookup

Quadtree wikipedia , lookup

Hash table wikipedia , lookup

Red–black tree wikipedia , lookup

Interval tree wikipedia , lookup

Array data structure wikipedia , lookup

B-tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
Array implementation of binary trees
Array implementation allows
Store elements of the tree by levels
g
b
• to travel in the tree from a parent to a
child,
i
• to travel from a child to a parent.
a
l
c
very efficient for complete binary trees,
j
f
0
4
2
g b i a c
1
3
8
6
l
5
10
12
9
14
16
18
j
f
7
very inefficient for trees that are not close to
complete,
11
13
not a dynamic data structure.
15
17
1
Data structure Heap
2
Heap ADT
• a complete binary tree
class heap {
private:
ELEM* Heap; //pointer to array
int size;
//max size of heap
int n;
//current size of heap
void siftdown(int); //helper for insert/delete
public:
heap(ELEM*, int,int); // constructor
int heapsize() const; // return current size
bool isLeaf(int pos) const; //True if a leaf posit.
int leftchid(int pos) const; //return leftchild
int rightchild(int pos) const; //return rightchild
int parent(int pos) const; // return parent of pos
void insert(const ELEM val); // insert val
ELEM removemax();
ELEM remove(int pos)
void buildheap(); // arrange elements into a heap
};
• nodes partially sorted
Max heap:
for any node v,
value in v ≥ value of any child of v
Min heap:
for any node v,
value in v ≤ value of any child of v
Heaps are used:
As a priority queue
In a heap sort
We talk mostly maxheaps.
3
4
Function siftdown places one element in its
proper place in a heap.
void heap::siftdown(int pos) {
while (!isLeaf(pos)){
insert an element into a heap:
• find j, the position of the larger of the children of pos;
• if Heap[j] ≤ Heap[pos] return;
Otherwise swap (Heap[j], Heap[pos]);
pos =j;
}
time required is O(log2n)
void heap :: buildheap() {
for (int i = n/2-1; i >=0; i--) siftdown(i);
}
void heap::insert(const ELEM val) {
assert(n <size);
int curr = n++;
Heap[curr]= val;
//go upwards to put it in right place
while ((curr!=0) && (Heap[curr]>Heap[parent(curr)]))
{ swap(Heap[curr], Heap[parent(curr)]);
curr = parent[curr]
}
}
time required is O(log2n)
time required is O(n)
5
6
AVL Trees
delete the maximal element from a heap
ELEM heap::removemax() {
assert(n >0);
swap(Heap[0], Heap[--n]); //put max
// element in last place
if (n !=0) siftdown(0);
return Heap[n];
}
The height of a binary tree depends greatly on
the order in which the nodes are inserted.
If eight nodes
50, 30, 40, 70, 20, 60, 80
are inserted in the above order, we obtain a
complete binary tree T1 of height 3.
time required is O(log2n)
If, however the nodes are inserted in order
Very suitable for a priority queue,
a queue in which deletions are done in order of
the priority of elements.
20, 30, 40, 50, 60, 70, 80
then we obtain a tree T2 of height 7.
insertion, deletion is O(log2n)
In T2 a search for a node corresponds to a
linear search.
7
8
It is important to created a binary tree so that
it is “close” to a complete binary tree.
AVL tree
(named after Adelson, Velskii and Landis)
or balanced binary search tree is a tree on n
nodes whose height is always O(log n).
20
80
20
30
Definition We say that a binary search tree T
is an AVL tree, or balanced tree if for every
node v of the tree the difference in heights of
the left and right sub-trees of v is at most one.
40
40
a) balanced
60
30
90
70
80
60
b) unbalanced at 80
70
9
An AVL tree is not necessarily the best possible search tree, but an AVL property guarantees that the height of the tree is not far from
optimal:
Theorem Let T be an AVL tree on n nodes.
Then height(T ) ≤ 1.44 log n.
We now discuss how the make an insertion of
a node into an AVL tree so that
1. the tree remains a binary search tree,
2. the tree remains balanced,
40 −1
20 −1
30 0
80 1
60 −1
90 0
70 0
We have in each node of the tree the balance
factor:
the difference of the heights of the two subtrees of the node.
3. a search in the tree is O(log n)
4. the insertion of a new node, after the search
requires O(1) operations.
10
11
Insertion in AVL tree
Initially, insert a node u into an AVL tree like
in ordinary binary search trees, as a leaf.
The balanced factor of A before the insertion
was made was = 0,
the insertion of one node can change any balance factor at most by + − 1.
If the tree remains balanced, then insertion terminates.
Four different types of rotations may be performed:
If the tree is not balanced then a “rotation” is
performed that makes the tree balanced again.
1. bf (A) = 2 and bf (left child B) = 1:
Perform the right rotation around A.
This “balancing” involves the node A such that
2. bf (A) = −2 and bf (right child B) = −1:
Perform the left rotation around A.
1. A is on the path from the root to the node
u that was inserted,
2. BF (A) = 2 or −2,
3. bf (A) = 2 and bf (left child B) = −1:
Perform left-right rotation around A.
3. A is closer to u than any other node with
balance factor 2 or −2.
4. bf (A) = −2 and bf (right child B) = 1:
Perform the right-left rotation around A.
12
2
13
2
0
A
0
A
B
-1
0
1
B
C
0
B
A
1
T3
C
T
1
T2
T
T2
T
T3
4
T3
T
2
T1
T1
1
A
B
Right rotation around A
T
4
T3
T2
Double left-right rotation around A
-2
A
0
-1
B
B
-2
A
0
0
0
A
B
14
T1
A
15
1
C
T2
C
T
1
T
2
T
3
T1
T
B
Any of the rotations is done by changing 2
pointers in a single rotation and by changing 4
pointers in a double rotation.
Summary of important facts to remember:
1. A rotation is done in the node that
Any of these rotations is Θ(1).
(a) has balance factor 2 or -2.
(b) is closest to where the new node was
inserted.
Deletion of a node from an AVL tree can be
done so that the tree remains balanced and so
that O(log n) operations are needed for it.
2. In any rotation, the subtrees T1 T2 T3 in
single rotations, or subtrees T1, T2, T3, T4
in double rotations, remain in the same order.
We only change the positions of the nodes
A and B in single rotations or of the nodes
A, B, C in double rotations.
Code for insertion and deletion on node in
AVL trees can be found in many data structure books, we can mention the book
Example of the construction of an AVL tree by
inserting the following items in the given order.
3. The type of rotation depends on the signs
of the balance factors.
20, 30, 40, 50, 60, 45
16
a) Insert 20
17
nsert 60
b)Insert 30
−2
30
30
0
20
20
−1
−1
0
0
20
40
−2
0
20
50
0
30
0
−1
0
60
40
50
c) Insert 40
20
0
t of balance at 40, do left rotation
−2
f)
60
Out of balance at 20, do a left rotation
30
−1
ert 45.
f)
30
−1
−2
30
0
40
0
0
0
20
1
50
20
50
e) Insert 50
0
0
30
0
60
−1
0
60
40
30
0
0
20
40
−1
40
0
20
18
40
19
−1
0
45
Out of balance at 30, do a double right−left
40
d)
0
50
0
rotati
Sorting
We will allow duplicate keys.
We have a collection of records and we need
to sort them by one field of the record the
sort key.
Example:
Collection of student records, each record contains id-number, name, specialization
If we sort then by id-numbers then id-numbers
are the sort keys.
If we sort then by names then names are the
sort keys.
Internal Sorting:
All records are available in the main memory
in an array.
External Sorting:
All records are available in a file on a disc memory.
We will do mostly internal sorting.
Sorting is stable if the sorting does not change
the original relative order of records with identical keys.
Internal sorting algorithms
Some trivial sorting algorithms:
Insertion sort
Each record is put in the proper place of records
already processed.
Bubble sort
Keep on swapping any two records which are
out of place.
Selection sort
In the ith pass put ith record in its proper
place.
20
21
Heapsort
All trivial sorting algorithms require Θ(n2) average time to sort n records.
Basic idea:
1. Arrange the records of the ARRAY into a Max-heap.
We will discuss better sorting algorithms that
require O(n log2 n) time to sort n records.
If n = 10000,
n2 = 100000000, n log2 n ≈ 13300,
i.e. n log2 n is lower by a factor of 750.
If n = 100000,
n2 = 10000000000, n log2 n ≈ 1670000,
i.e. n log2 n is lower by a factor of ≈ 6000.
2. for (int i = n; i > 0; i--) {
swap (ARRAY[0], ARRAY[i]);
siftdown ARRAY[0] to the correct place;
consider only the heap in locations 0 to i;
}
Step 1 need Θ(n) operations,
Siftdown needs at most log2 n operations
Loop is repeated n times,
Heapsort needs Θ(n + n log2 n) = Θ(n log2 n)
operations.
22
23
Quicksort
The fastest known method on average.
Very widely used.
Using the heap member functions we can code
it as follows:
Similarly like a binary search, it is a
divide and conquer algorithm.
void heapsort(ELEM* ARRAY, int n) {
heap H(ARRAY, n, n);
for (int i = n; i > 0; i--)
H.removemax();
}
It repeatedly splits the problem of sorting an
array into sorting two half-size arrays.
(invented by C.A.R. Hoare).
qsort(ELEM* array, int i, int j){
// sort array from position i to j
1. Select one value P from the array;
// P is the pivot
2. Partition the records so that first we have
all records with keys < pivot,
followed by the pivot;
followed by records with values >= pivot;
let k be the position of the pivot;
3. if ((k-i) > 1) qsort(array, i, k-1);
if ((j-k) > 1) qsort(array, k+1, j);
}
Best case, average case, worst case needs
Θ(n log2 n) operations.
Heapsort only needs a constant number of additional memory locations.
25
24
25
13 40 17
63
5
81 33 10
18
55 11
11
13 40 17
63
5
81 33 10
18
55
11
13 18 17
10
5
81
33 63
40
55
2
11
13 18 17
10
5
25 33 63
40
55
8
55
8
55
3
55
3
25
sort it
sort it
11
11
11
13 18 17 63
13 18 17 10
13 18 17 10
5
5
81
81 33 10 40
55
25
81 33 63 40
55
25
5
33 63
40
26
55
11
13 18 17
10
5
13
18
5
10
18
40
pivot = 33
pivot = 11
5
25 33 63
17
10
11
81
17 13
11
81 63
63 40
25
27
40
Code for the quicksort
5 10 18
17 13
11
void qsort(ELEM* array, int i, int j) {
int pivotindex = findpivot(array, i,j);
5 17
18 10
13
11
swap(array[pivotindex], array[j]);
undo last swap.
// put pivot at the end
int k = partition(array, i-1, j,
5
10 18
17 13
11
key(array[j]));
// k is where pivot should be placed
place pivot.
swap(array[k],array[j]);
5
10
11
17 13 18
if ((k-i) > 1) qsort(array, i, k-1);
if ((j-k) > 1) qsort(array, k+1, j);
etc.
}
29
28
Find Pivot .... O(1) time.
Partitioning of the array:
Partitioning of an array with n records:
O(n) time.
int partition(ELEM* array, int l, int r,
KEY pivot) {
do {
while (key(array[++l]) < pivot);
while (r && key(array[--r])> pivot);
swap(array[l],aray[r]);
}
If each pivot divides the array evenly,
we have to do O(log2 n) stages.
Each stage involves in total partitioning of an
array with n records.
Total time: O(n log2 n)
in the best and average case.
while (l < r);
// reverse unnecessary last swap
swap(array[l],aray[r]);
return l; //position for the pivot
If each pivot divides the array very unevenly
(one part could be empty)
we have to do O(n) stages.
Total time: O(n2)
}
in the worst case.
30
31
Choice of pivot is important so that it breaks
the array evenly.
Many possibilities for pivot choice.
Mergesort
A good possibility:
Select the pivot as the median of keys of three
randomly chosen records.
Quicksort can be implemented without recursion by using a stack to store the indices of
parts of the array to be sorted later.
Quicksort needs additional space O(log2n)
General idea
Sort the left-half of the array;
Sort the right-half of the array;
Merge the two sorted sub-arrays;
It is not an in-place sorting algorithm.
It uses additional space of the same size as the
array to be sorted.
Quicksort and Heapsort are in-place sorting
algorithms. The values of records stay in the
originally given array.
32
33
// merge sorted subarrays
int i1 = left; int i2=mid+1;
int curr=left;
void mergesort(ELEM*array, ELEM* temp,
while (i1 <= mid) && (i2 <= right)
int left, int right){
if key(temp[i1]) <= key(temp[i2])
if (left == right) return;
array [curr++] = temp[i1++];
int mid = (left+right)/2;
else array [curr++] = temp[i2++];
mergesort(array,temp,left,mid);
// copy remaining records from left part
mergesort(array,temp,mid+1,right);
for (i=i1; i <= mid; i++);
// copy array into temp
array [curr++] = temp[i1];
for (int i = left; i <= right; i++)
// copy remaining records from right part
temp[i] = array[i];
for (i=i2; i <= right; i++);
array [curr++] = temp[i2];
}
34
35
Radix Sort
In the best, average, worst case, Mergesort is
Θ(n log2 n)
It is a stable sort.
Heap and Quicksort are not stable.
Needs O(n) additional space.
Theorem
Any sorting algorithm that is based on comparisons of keys needs at least
n log2 n
This is a sorting method that was used on
mechanical sorting machines before computers
were available.
General idea
(1) Sort all records by the last digit of the
keys;
(2) Sort all records by the second from last
digit of the keys;
(3) Sort all records by the third from last digit
of the keys;
(4) Sort all records by the fourth from last digit
of the keys;
etc.
The sorting by a digit is done by placing the
records into 0,1,2,3, .. 9 “bins” and then
putting the records from the bins back into
the array.
comparisons on average.
37
36
void radix(ELEM* A, ELEM* B, int n,
int k, int r, int * count)
Not an in-place sorting method,
// sort A of n records , each key contains
needs O(n) additional space,
//
a stable sorting algorithm.
k digits, we
use radix r
for (int i=0; rtok=1; i<k; i++) {
In our examples, we use bin[0], bin[1],,..bin[9],
but on computers, we usually use 8 bins corresponding to the last 3 digits of binary number
representation.
// repeat for k digits
for (int j=0;j<r, j++) //initialize counts
count[j]=0;
// count records in each bin
We put all bins together to save space and use
count[i] to indicate the available place in bin[i].
for (j=0; j <n; j++)
count[(key(A[j])/rtok)%r]++;
We sort by the remainder or radix of each key.
Variable rtok is rk .
// count the bottom of each bin
for (j=1;j<r; j++)
count[j]=count[j-1]+count[j];
38
39
// scan A from right to left and
// put records into bins in right to left
// direction
Radix sort time depends on the
(1) number of records . . . n
(20 number of digits in the keys . . . k
O(kn)
for (j=n-1; j >=0; j--)
B[--count[(key(A[j])/rtok)%r]] = A[j];
// copy B into A
for (j=0; j<n; j++) A[j] =B[j];
}
k ≥ logr n
but it could be more! (if some keys are very
long)
not an in-place sort.
}
it is stable.
40
41
Empirical comparison
of Sorting algorithms
PC computer:
Algorithm
Insert. sort
bubble sort
quicksort
mergesort
heapsort
radix sort8
100
.18
.23
.05
.07
.17
.06
array size
10000 30000
1847 16544
2274 20452
7.3
24
10.7
35
36
122
6.0
17
There are many other algorithms,
we considered the most common and the best.
Excellent reference on sorting algorithms:
D.E. Knuth: Sorting and Searching, AddisonWesley.
DEC station:
Algorithm
Insert. sort
bubble sort
quicksort
mergesort
heapsort
radix sort8
100
.017
.026
.006
.013
.017
.018
array size
10000 100000
168
23382
257
41874
.9
12
2.3
30
3.5
49
1.6
18
42
43
Search Problem:
Given a key K, find an object Ai so that
Search:
Ai.key() = K
We have a collection of compound objects
A1, A2, . . . , An
Each object contains one element that is used
for search, the search key ....... Ai = (ki, In)
We assume that
Ai.key()
Ai.inf o()
gives ki
gives Ii,
Search can be
successful if an object containing K is found,
unsuccessful if no object containing K is found,
exact match query:
We search only for objects Ai.key() = K
range query
We search for objects K − c ≤ Ai.key() ≤ K + c
Different techniques are used when objects are
stored on a disk rather than in the RAM
44
45
Assume we have n objects:
We have seen:
1. lists:
2. Binary Search Trees, linked implementation:
(a) array implementation, unordered:
sequential search O(n),
insert, delete (after search) O(1)
search O(height),
insert, delete O(height)
(b) array implementation, ordered:
binary search O(log2 n),
insert, delete (after search) O(n)
Height can be made Θ(log2n) with balanced
search trees (AVL trees)
(c) linked implementation, unordered:
sequential search O(n),
insert, delete (after search) O(1)
Any search based on comparisons of keys
needs Θ(log2n) comparisons.
(d) linked implementation, ordered:
sequential search O(n),
insert, delete (after search) O(1)
46
47
A way to improve the binary search in an array.
Ways to improve sequential search
Interpolation or Dictionary Search:
Given a key K, calculate an index of the array
where the object might be.
left = 0; // search objects between
right = n-1; // left and right, inclusive
while ( lef t <= right) {
if ((right − lef t) == 0) pos = lef t;
else { q = f loat((K − key().A[lef t])/
(A[right].key() − A[lef t].key()));
pos = lef t + q ∗ (right − lef t + 1);
if (K < A[pos].key()) right = pos − 1;
else if (K > A[pos].key()) lef t = pos +
1;
else return pos;
}
}
(1)
If we know the frequency of accesses of each
element, we can store them in the list in the
order of frequency of accesses, starting with
the highest frequency.
Frequency is given as a number between 0 and
1.
0.75 frequency for x means that 75% of the
time we access x.
It can give a good performance in some cases.
Frequencies are often not known.
Can work well if the distribution of key values
is uniform.
48
(2)
49
Example of Count method:
Self-Organizing Lists
Lists in which the order of objects in the list is
changed based on searches which are done.
Place the objects that are being searched at
the head, or closer to the head of the list.
This can speed-up the search
Basic Strategies
1. Count method: each object contains a count
of searches for the object. Order objects
by decreasing count.
2. Move to Front: Any object found is placed
first.
3. Transpose: Any object found is interchanged
with the preceding object
4. there are other possibilities.
Assume we have a list of keys and counts, we
omit the information stored in objects.
After some searches, we have the list:
(354, 3), (567, 2), (999, 2), (102, 1), (122, 1), (234, 1)
Search for 122: 5 comp.
(354, 3), (567, 2), (999, 2), (122, 2), (102, 1), (234, 1)
Search for 122: 4 comp.
(354, 3), (122, 3), (567, 2), (999, 2), (102, 1), (234, 1)
Search for 999: 4 comp.
(354, 3), (122, 3), (999, 3), (567, 2), (102, 1), (234, 1)
Search for 122: 2 comp.
(122, 4), (354, 3), (999, 3), (567, 2), (102, 1), (234, 1)
Total comparisons = 15 comparisons
50
51
Example of Move to Front:
Example of Transpose:
Assume we have a list of keys, we omit the
information stored in objects.
After some searches, we have the list:
Assume we have a list of keys, we omit the
information stored in objects.
After some searches, we have the list:
354, 567, 999, 102, 122, 234, 189, 333
Search for 122: 5 comp.
354, 567, 999, 102, 122, 234, 189, 333
Search for 122: 5 comp.
354, 567, 999, 122, 102, 234, 189, 333
Search for 122: 4 comp.
122, 354, 567, 999, 102, 234, 189, 333
Search for 333: 8 comp.
354, 567, 122, 999, 102, 234, 189, 333
Search for 999: 4 comp.
333, 122, 354, 567, 999, 102, 234, 189
Search for 333: 1 comp
354, 567, 999, 122, 102, 234, 189, 333
Search for 122: 4 comp.
333, 122, 354, 567, 999, 102, 234, 189
Search for 234: 7 comp.
354, 567, 122, 999, 102, 234, 189, 333
234, 333, 122, 354, 567, 999, 102, 189
Total comparisons: 17 comparisons.
52
Count method is better when some objects
are accessed very frequently, but these accesses
are not grouped together. Does not react very
dynamically to changes of frequency of access
patterns.
It needs additional space.
Move to front is better when some accesses
are clustered together. It reacts very dynamically to changes of access patterns. Easy to
implement for linked lists only.
Transpose is easy to implement for both, arrays and linked lists.
There are always some access patterns that
make any of the above rather bad.
A self-organizing list can perform well in some
situations.
54
53
Hashing
It has, on average, and with high probability,
a constant search and insertion time
O(1)
Objects are stored in an array Table in n locations
T able[0], T able[1], . . . , T able[n − 1]
where n is usually larger than the number of
objects we deal with.
There is a hash function h,
an object A is stored at the location
T able[h(A.key())]
whenever possible.
i.e., the location of the object containing a
given key is calculated by h.
h only takes O(1) operations to calculate.
55
Example: Store student information for a class.
key: I.D. numbers.
hash function:
int H(int id)
{ return id % 89;}
T able[0], T able[1], . . . , T able[88]
id’s:
2670860
3661156
3810151
3930688
3954765
4051254
4079469
4094603
4106091
4129245
4136977
4158601
4198948
3331423
3734927
3836991
3930947
4032292
4057279
4083733
4097890
4110226
4129350
4137876
4158695
4208072
3400034
3736660
3893715
3936007
4036417
4060695
4086899
4101901
4110625
4129911
4138104
4187490
4208668
3464121
3782239
3895440
3952002
4043316
4072626
4090985
4105370
4111265
4130928
4139739
4190238
4209389
3526100
3791238
3929388
3952746
4047958
4079124
4094026
4106032
4117182
4131037
4149882
4193830
4210344
2670860mod89 = 59,
3400034 mod89 = 56,
3526100 mod89 = 9,
3734927 mod89 = 42,
3782239 mod89 = 6,
3810151 mod89 = 61,
3893715 mod89 = 54,
3929388 mod89 = 38,
3930947 mod89 = 84,
3952002 mod89 = 46,
3954765 mod89 = 50,
4036417 mod89 = 0,
4047958 mod89 = 60,
4057279 mod89 = 36,
4072626 mod89 = 75,
4079469 mod89 = 65,
4086899 mod89 = 19,
4094026 mod89 = 26,
56
4097890 mod89 = 63,
4106032 mod89 = 17,
4110226 mod89 = 28,
4111265 mod89 = 88,
4129245 mod89 = 1,
4129911 mod89 = 44,
4131037 mod89 = 13,
4190238 mod89 = 29,
4198948 mod89 = 17,
4208668 mod89 = 36,
4210344 mod89 = 21
4101901 mod89 = 69,
4106091 mod89 = 76,
4110625 mod89 = 71,
4117182 mod89 = 42,
4129350 mod89 = 17,
4130928 mod89 = 82,
4136977 mod89 = 79,
4193830 mod89 = 61,
4208072 mod89 = 63,
4209389 mod89 = 45,
Two keys that hash to the same locations are
said to collide
4129350mod89 = 17
4198948mod89 = 17
4106032mod89 = 17
They collide.
We need to have a policy how to solve the
collisions.
58
3331423 mod89 = 64,
3464121 mod89 = 63,
3661156 mod89 = 52,
3736660 mod89 = 84,
3791238 mod89 = 16,
3836991 mod89 = 23,
3895440 mod89 = 88,
3930688 mod89 = 3,
3936007 mod89 = 71,
3952746 mod89 = 78,
4032292 mod89 = 58,
4043316 mod89 = 46,
4051254 mod89 = 63,
4060695 mod89 = 70,
4079124 mod89 = 76,
4083733 mod89 = 57,
4090985 mod89 = 11,
4094603 mod89 = 69,
57
In total, we had
3
3
2
2
2
3
2
2
2
2
2
keys
keys
keys
keys
keys
keys
keys
keys
keys
keys
keys
hash
hash
hash
hash
hash
hash
hash
hash
hash
hash
hash
to
to
to
to
to
to
to
to
to
to
to
17
36
42
46
61
63
69
71
76
84
88
Number of keys hashing to one location is typically small. (like above).
59
-1 ... no object stored in that slot
Open hashing
Objects that collide are stored outside the main
table.
A good open hashing strategy:
store all objects hashed to the same location
in a linked list.
It is called Separate Chaining.
0
4036417
1
4129245
2
-1
3
3930688
4
-1
.
.
.
16
3791238
17
4106032
4198948
36
4057279
4208668
88
4111265
4129350
.
.
Objects in lists may be ordered:
(1) by keys
(2) by the entry time
(3) frequency of access
3895440
60
The Separate Chaining is an excellent choice
when we have additional space for the linked
lists.
Disadvantage:
Space in the array not fully used, no control
on how much additional space is needed!
For hashing, the number of comparisons needed
to find/insert an object does not depend on the
number of objects we search among.
It depends on the load factor
61
In the example:
Load factor = 65/89 = 0.73
Search for an object that is in the table:
3 objects that need 3 key comparisons
8+3 objects that need 2 key comparisons
51 objects that need 1 key comparison
Average number of comparisons:
3∗3+11∗2+51∗1
65
of objects
load factor= no.table
size
= 82
65 = 1.27
The binary search tree would need on average
5 comparisons.
We often distinguish:
Search for an object in the table.
Search for an object not in the table.
Search for an object that is not in the table:
Insert Ai = Search for an object not in the
table, it stops when we reach an empty location that should contain Ai, and assigning Ai
to that location.
62
3∗3+8∗2+78∗1
89
= 103
89 = 1.16
63
Closed Hashing
Bucket Hashing
All elements are stored in the hash table T able
of size m.
To avoid infinite loop, we need to have at least
one position that does not contain an object.
Hash table consists of B buckets
each bucket contain m/B slots.
+ one overflow bucket following the table.
h pos is the bucket address of Ai.
home position of object Ai with key value
Key = Ai.key() is h(Key)
Insert:
store Ai in the first available slot in the bucket,
if a slot is available;
else store it in the first available slot in
the overflow bucket
(1) h pos = h(Key);
(2) If (T able[h pos] == N o Entry)
T able[h pos] = Ai; // store it at home
else apply collision resolution;
We now discuss different collision resolution for
close hashing.
good for hash tables on a disk, we read in the
whole bucket.
bad for tables in RAM as a lots of space is often wasted, and the search can be slow for
items in the overflow bucket.
64
Linear Probing
65
Some improvements:
collision resolution puts the object in the next
unoccupied slot going downward in the table
(with a wrap-around)
probe sequence is
T able[h pos],
T able[(h pos + 1)% T able size],
T able[(h pos + 2)% T able size],
etc.,
quadratic probing, its probe sequence is
T able[h pos],
T able[(h pos + 1)% T able size],
T able[(h pos + 22)% T able size],
T able[(h pos + 32)% T able size],
T able[(h pos + 42)% T able size],
etc.,
pseudo-random probing, its probe sequence
is based on a fixed pseudo-random sequence
r1 , r 2 , r 3 , . . .
This has many drawbacks:
It tends to cluster objects together, which increases the probability of further collisions and
larger clusters.
Large clusters imply a high probability of long
searches.
primary clustering
66
T able[h pos],
T able[(h pos + r[1])% T able size],
T able[(h pos + r[2])% T able size],
T able[(h pos + r[3])% T able size],
etc.,
Problem: Secondary clustering: two keys having the same home address have the same
probe sequence.
67
Further improvement:
Example on closed hashing:
double hashing
Main hash function: h(key) = key mod 13
We have one more hash function h2 > 0 which
is used to calculate the steps in the probe sequence.
hash function used for double hashing:
h2(key) = key mod 7 + 1
step = h2(Key) its probe sequence is
a table of keys and their home positions.
key
267
133
340
246
352
166
373
275
178
479
139
T able[h pos],
T able[(h pos + step)% T able size],
T able[(h pos + 2 ∗ step)% T able size],
T able[(h pos + 3 ∗ step)% T able size],
It avoids most problems like primary and secondary clustering.
Double Hashing is a good method for Closed
Hashing.
home pos.
267 mod 13 = 7
133 mod 13 = 3
340 mod 13 = 2
246 mod 13 = 12
352 mod 13 = 1
166 mod 13 = 10
373 mod 13 = 9
275 mod 13 = 2
178 mod 13 = 9
479 mod 13 = 11
139 mod 13 = 9
step for double hash
275 mod 7+1 = 3
178 mod 7+1 = 4
139 mod 7+1 = 7
68
Tables obtained:
quadrat.
linear
0
479
3
1
352
2
3
4
5
3
1
178
352
340
133
1
340
1
133
1
1
275
3
139
10
6
1
69
double hash
479
7
1
352
6
178
352
2
2
3
340
133
5
1
340
4
275
3
133
1
1
5
139
2
139
4
6
2
7
275
4
3
275
267
1
139
0
4
Linear probing
no. of probes (unsuccessful search)
1
267
2
1
8
267
1
267
1
1
10
373
166
373
166
1
1
373
166
1
1
11
178
479
246
1
1
479
246
1
1
Average = 69/13 = 5.3 comparisons.
12
3
1
7
8
9
1
246
no. of probes (successful search)
70
11
10
373
166
11
178
12
246
9
8
9
10
In close hash tables, deleted items are replaced
by tombstone.
71
Expected number of comparisons of keys
S ... successful search
U ... unsuccessful search
collision
resolution
Number of comparison
for different load factors
.25
.50
.75
.9
.95
Open Hashing
separate
chaining
Closed Hashing
linear
probing
double
hashing
S
U
1.12
1.03
1.25
1.11
1.38
1.22
1.45
1.31
1.48
1.34
S
U
S
U
1.17
1.39
1.15
1.33
1.50
2.50
1.39
2.00
2.50
8.50
1.85
4.00
5.50
50.5
2.56
10.00
10.50
200.5
3.15
20.00
Some conclusions:
Open hashing with separate chaining works well
even when the load is high.
Closed hashing should be used with:
Important:
It is the load factor and not the number of objects that is important for a good performance.
double hashing,
load not more than .9
to have a good performance.
The numbers are valid on average, worst case
could be worse.
72
Hash Functions
73
(2) mid-square method
(1) mod method.
h(key) = key%n;
where n is the table size.
Used for integer keys,
Performs very well when keys are random.
Does not perform very well when many keys
end with the same suffix (sometimes last digits
in i.d numbers indicate the area.)
n must be a prime number .
Example
If we need to have a table for 8000 items and
closed hashing, choose n ≈ 1.12 ∗ 8000 = 8960
Can choose n = 8963 from prime number tables.
74
h(key) = (((key ∗ key) >> b)%2a);
Extracts a bits at positions a + b, . . . b + 1 from
the left.
Choose a, b so that we take the middle bits.
Performs very well,
good for integer keys.
Table is of size 2a.
Example
If we need to have a table for 8000 items and
open hashing, choose n = 213 = 8192
Each key is of length 16 bits,
key 2 is of length 32 bits,
Choose a = 13, b = 10
if key = 11234 = 0010101111100010,
key 2 = 000001111|0000101101100|1110000100
h(key) = 0000101101100 = 364
75
(3) Folding Method
Used for strings,
h(key) = ((int)s[1]+(int)s[2]+· · ·+(int)s[i])%n;
Add all characters together, take it mod table
size;
Example:
n = 557 key = Smith, Jamie
83 + 109 + 105 + 116 + 104 + 42 + 32 + 74 +
109 + 105 + 101 = 980 mod 557 = 423
(4)
other variations of the previous methods, see
the textbook.
When we know more about the keys, we can
custom design a hash function that minimizes
the collisions.
It table size is bigger, add the numerical values
of pairs of characters.
The above explains the use of term hashing:
chop identifier into pieces and mix it up
76
77
Evaluation of hash-based search techniques
Excellent method in many cases.
Search, insert, delete O(1) on average.
-1 ... no object stored in that slot
0
4036417
1
4129245
2
3
Drawbacks:
4
Some space wasted,
.
..
Performance O(1) cannot be guaranteed,
worst case could be O(n).
obj. 2
-1
obj. 4
3930688
obj. 3
-1
obj. 7
16 3791238
17 4106032
obj. 5
4198948
4129350
.
.
Not suitable for range queries,
36
obj. 1
4057279
4208668
obj. 6
Cannot print items in order without sorting
them first.
In many cases, the hash table does not contain
the objects, but only the keys and pointers to
the objects.
78
obj. 8
88
4111265
obj. 67
3895440
79
Trees are not sufficient to represent or model
some situations.
Example: course prerequisites:
C326
Sometimes we need a graphs
C346
A graph consists of
C353
C354
set of vertices ... V
set of edges .... E
C352
each edge is a pair of vertices, E ⊆ V × V
If edge (u, v) is directed from u to v, we talk
about a directed graph. digraph
C229
C335
C249
C239
If edge (u, v) is considered to be between u and
v, we talk about an undirected graph.
Whether we use a directed or undirected graph
depends on an application.
C228
A graph whose edges are assigned a cost or
weight are called weighted
C238
C248
directed, no weights on edges.
80
81
We need to have a computer representation of
the course prerequisites for student’s registration.
example:
Air Canada Network
Sud.
Th. B. 800
St. J.
Que.
Ott.
200
200
400
800
350
Review of some terminology:
250
800
250
550
Tor.
150
Montr.
500
650
London
We need to have a computer representation of
an airline network to find flights between cities.
1000
path: A sequence of vertices v1, v2, . . . , vn where
(v1, v2), (v2, v3), . . . , (vn−1, vn) are edges in the
graph.
Hal.
Weighted, each weight represents the distance,
a simple path: all vertices of the path are
distinct.
cycle: a path that begins and ends in the same
vertex.
undirected.
a subgraph of G: a subset S of vertices of G
and a subset E1 of edges of G, E1 ⊆ S × S.
82
83
Graph Representations
b
connected graph: There is a path between
any two vertices.
a
c
component: a connected subgraph that cannot be made larger.
e
acyclic graph: a graph without a cycle.
f
d
DAG a directed acyclic graph.
tree can be viewed as connected, undirected
graph with no cycle.
the trees that we studied earlier are considered
rooted trees since one node is designated as
the root.
0
1
a
e
e
f
a
d
f
2
3
4
5
d
a
e
c
b
a
e
e
c
d
b
a
f
b
c
e
b
c
d
Adjacency list for a graph
85
84
b
a
c
The space cost of each representation
e
Adjacency matrix: Θ(|V |2)
f
d
a
b
c
d
e
f
a ne
1
ne
1
1
1
b 1
ne
1
ne
1
ne
ne
1
1
ne
Adjacency list for a graph Θ(|V | + |E|)
Notice that |E| in the worst case is Θ(|V |2),
but is often only Θ(|V |) if the graph is sparse
ne = no edge
c
ne 1
d
1
ne
1
ne
1
ne
e
1
1
1
1
ne
1
f
1
ne
ne ne
1
ne
If the graph is undirected then the adjacency
matrix is symmetric.
Adjacency matrix: a 0-1 matrix, 1 indicate
an edge. we need one bit per entry.
86
87
Graph ADT
class Graph {
public:
Graph(); // constructor
~Graph(); // destructor
int n(); // number of vertices
int e(); // number of edges
Edge first(int); // get the first edge for a vertex
bool isEdge(Edge); // true if this is an edge
Edge next(Edge); // get the next edge for a vertex
int v1(Edge); // return the origin vertex of edge
int v2(Edge); // return the target vertex of edge
int weight(int,int); // return weight of edge
int weight(Edge); // return weight of edge
}
88
Matrix is a one-dimensional array of (numV ertex)2
N OEDGE indicates no edge there
rows, columns numbered
0, 1, . . . , numV ertex − 1
Edge id the address of the entry in the adjacency matrix.
element in row i, column j of the adjacency
matrix is
matrix[i ∗ numV ertex + j]
90
typedef int* Edge;
class Graph {
// Adjacency matrix implementation
private:
int * matrix
// the edge matrix
int numVertex; // Number of vertices
int numEdge;
// Number of edges
bool* Mark;
// used for marking visited vertices
// in some algorithms
public:
Graph(); // constructor
~Graph(); // destructor
int n(); // number of vertices
int e(); // number of edges
Edge first(int); // get the first edge for a vertex
bool isEdge(Edge); // true if this is an edge
Edge next(Edge); // get the next edge for a vertex
int v1(Edge); // return the origin vertex of edge
int v2(Edge); // return the target vertex of edge
int weight(int,int); // return weight of edge
int weight(Edge); // return weight of edge
}
89
EdgeGraph::first(int v) { // Get the first edge of v
int stop = (v+1)*numVertex;
for (int pos = v*numVertex; pos <stop; pos++)
if (matrix[pos] !=NOEDGE) return & matrix[pos];
return NULL;
}
EdgeGraph::next(Edge e) { // Get the first edge of v
int stop = (v1(e)+1)*numVertex; //
for (int pos = e-matrix+1; // pos is left of Edge
pos <stop; pos++)
if (matrix[pos] !=NOEDGE) return & matrix[pos];
return NULL;
EdgeGraph:: v1(Edge e) // return the origin of e
{ return (e-matrix) / NumVertex;}
// find the row of the matrix containing e
91
Graph Traversal
Many algorithms require to traverse all vertices
of a graph.
Sometimes we want to traverse vertices of a
graph from a given vertex using the edges of
the graph until some specific goal is achieved.
Example of goals:
- visit all vertices in an order,
- reach another given vertex (shortest path
problem),
- reach a vertex having some specific property,
Must avoid running in a cycle in the graph.
To traverse all vertices:
repeat the traversal from a specific vertex until
all vertices are visited.
void graph_traverse(Graph *G) {
for (v=0; v<G.n(); v++)
G.Mark[v] = UNVISITED; // initialize the marks
for (v=0; v<G.n(); v++)
if (G.Mark[v]== UNVISITED)
do_traverse(G,v);
}
do-traverse visits all vertices reachable from v
Use the Mark array by marking every vertex
already visited,
92
93
Depth-First Search
(like preorder)
When we visit a vertex v, do a depth first
search for each vertex adjacent to v.
h
// Depth-First Search
void DFS(Graph & G, int v) {
PreVisit(G,v); // can print, change etc
e
g
f
G.Mark[v]= VISITED; // avoid looping
for (Edge e = G.first(v); G.isEdge(e);
b
i
c
e=G.next(e))
d
if (G.Mark[G.v2(e)] == UNVISITED)
a
DFS(G, v2(e));
j
PostVisit(G,v); // can change, update,
Recursively explore all nodes accessible from
an edge before going sideways.
}
If we start from h with edge (h, e) we get:
h, e, b, a, j, c, d, g, f, i,
94
95
Breadth-First Search
(like level order traversal for trees)
When we visit a vertex v, examine all nodes
connected to v before going any further.
h
e
b
g
f
i
a
c
d
j
We get
h, e, f, g, b, i, c, d, a, j
// Breadth-First Search
void BFS(Graph & G, int v) {
Queue Q(G.n()); // create a queue of
// of sufficient length
Q.enqueue(v);
G.Mark[v]= VISITED; // avoid looping
while (!Q.isEmpty()){
int u = Q.dequeue():
PreVisit(G,u); // can print, change etc
for (Edge e = G.first(u); G.isEdge(e);
e=G.next(e))
if (G.Mark[G.v2(e)] == UNVISITED){
(G.Mark[G.v2(e)] =VISITED;
Q.enqueue(G.v2(e));
}
PostVisit(G,v); // can change, update,
}
}
We use a queue for this traversal.
96
97
C326
C346
C353
C352
C354
C335
Topological Sort
Given a DAG G, find a linear order of all vertices of G
u1 , u 2 , u 3 , . . . , u n
C229
C249
C239
so that there is no edge (ui, uj ), i < j in G.
C228
C238
C248
Possible orders:
c228, c238, c248, c229, c239, c249, c335,
c352, c346, c353, c354, c326 or
c248, c238, c228, c239, c229, c249, c352,
c335, c354, c346, c326, c353, or ...
98
99
Method 2: use a queue Q:
1. count no. of edges going into each node.
Method 1:
1. Do DFS
2. enqueue in Q all nodes with no incoming
edge.
2. when preVisit, do nothing
3. while !isEmpty(Q)
(a) dequeue a node v
3. when posVisit, print out the node.
(b) print the node v
4. it prints nodes in the reversed of a topological sort.
(c) decrease the count of any node pointed
at by v
(d) if any count of a node becomes 0,
enqueue the node.
100
101
Memory management
There are many important graph algorithms:
When a computer program is running, space
must be allocated to
• Shortest path problem
• Minimum-cost spanning tree problem
• each function call for the variables associated with the function,
• Travelling Salesman problem
• each data item created by a call to new
When a function terminates or we call delete,
we have to reclaim the space for reuse.
• Maximum flow problem
• many other problems
This must be done dynamically, as the execution of the code procedes.
they all use the data-structure graph.
This allocation, de-allocation of space is called
dynamic storage allocation.
102
103
For the variables associated with each function
call we use a stack.
(was discussed earlier.)
Stack is not appropriate for the dynamically
created variables.
(to be discussed now.)
We assume that to each program being executed, a contiguous segment of memory locations is allocated (in addition to the run-time
stack).
From this space we assign segments of it for
dynamically created variables.
Traditionally, this space is called a heap.
Heap ≡ a messy pile of items of all sizes.
It is different form data structure heap we used
earlier.
(Example of name overloading). Do not confuse the two different notions!
104
Any operating system also contains a memory management system to handle the memory requests for each process and deallocate
the space of processes that terminate.
Since the allocation/deallocation is dynamic,
the heap can look like the following:
heap:
occupied block
Some blocks are not being used at the time
- free blocks.
The status of blocks changes dynamically. In
C + +:
Each call of new ⇔ call the memory manager
to create a new reserved block of appropriate
size.
Each call of delete ⇔ call the memory manager
to deallocate a reserved block into a free block.
The memory manager must keep track of the
blocks so that the process of creation of a reserved/free block can be done efficiently.
105
Block allocation policy: Method used by the
memory manager to select what free block or
what part of a free block is given to a memory
request.
Failure policy: The policy used by the memory manager when no suitable free block is
available for a request.
A memory manager usually puts all free blocks
into a linked list. freelist.
Nodes in the freelist are of variable size.
If m is close enough to n than the whole block
is allocated.
Some blocks are allocated to variables
- reserved blocks
To a request of size n manager must allocate
a block of size m with m ≥ n.
We see heap as an array of memory locations.
It is divided into blocks of various sizes,
free block
Memory request of size n: A call of the memory manager to allocate a block of n consecutive words.
106
This creates internal fragmentation: some
free space is located inside allocated blocks.
If m is quite larger than n than the block is
split: A block of size n is allocated and a free
block of size m − n is kept in the freelist.
107
This creates external fragmentation: many
free blocks are too small for future requests.
Sequential Fit Method
The free blocks are organized into a doubly
linked (circular) list.
Each allocated block contains 3 fields used by
the memory manager:
Request of size n creates an occupied block of
size n + 3.
These additional fields are used to minimize
external fragmentation.
(We assume below that each field is one word,
but in practice it can be less.)
Assume we allocate a block of size n to a request of size m:
if n − m > M IN EXT RA then split the free
block in two parts:
the first part remains in the free list with changed
size,
second part is marked allocated and its address
is given to the request.
if n−m < M IN EXT RA then allocate the whole
block:
a node is removed from the linked list and
marked allocated.
108
Tag
Size
+
m
#begin{slide}
#define STARTTAG 0
#define SIZE 1
#define PREV 2
#define NEXT 3
#define ENDSIZE 4
#define ENDTAG 5
Llink Rlink
m
Free block
+
Size Tag
Tag
Size
−
m
Reserved block
m
−
Size Tag
109
110
int*
{ //
//
if
allocate(int m) // Return a block of size >= m
The size field store the actual number of free
spaces, not including maintenance fields
(m < 3) m = 3; // Must be big enough to be
//a free block later
int* temp = pick_free_block(m);
// the address of block is in temp
// Must be at least m+3 units
if (temp[SIZE] >= m+MINEXTRA) {
// Split block, save excess
int start = temp[SIZE] - m + 3;
// First unit of reserved block
111
temp[start] = temp[temp[SIZE] + ENDTAG] = RESERVED
temp[start+SIZE] = m;
temp[SIZE] -= m+3; // This much was reserved
temp[temp[SIZE] + ENDSIZE] = temp[SIZE];
temp[temp[SIZE] + ENDTAG] = FREE;
return &temp[start];
}
else { // give over the whole block,
// remove it from free list
temp[STARTTAG]=temp[temp[SIZE] + ENDTAG]=RESERVED;
temp[SIZE] += 3; // for the extra maint. fields
// Freelist pointers point directly to positions
// of neighboring blocks in array MemoryPool.
MemoryPool[temp[PREV]] = temp[NEXT];
MemoryPool[temp[NEXT]] = temp[PREV];
return temp;
}
De-allocation of a block:
If a de-allocated block at address start is adjacent to a free block on either side
(check the tags at temp[start − 1] and at
temp[start + SIZE + 3])
merge it with on or two free blocks.
If a de-allocated block at address start is not
adjacent to a free block on either side then
change its tag fields,
insert it into the list of free nodes as a new
node.
}
112
113
• best fit: go down the list and allocate the
space from the free block whose size n is
larger than m and for which n − m is the
smallest from among all free blocks.
Possible Allocation strategies
Assume that we have a request of size m. We
have to allocate a block of size n with n ≥ m.
If n − m > M IN EXT RA then leave in the list
a free block of size n − m.
• first fit: go down the list and allocate the
space from the first free block whose size
n is larger than m.
Possible disadvantage: it may allocate a large
block that would be more suitable for subsequent requests.
Possible disadvantage: must examine the
whole list in most cases and can create bad
external fragmentation.
• worst fit: go down the list and allocate
the space from the largest free block.
Possible disadvantage: must examine the
whole list and might fail when some request are very large.
Which strategy is best: none, it depends on
the expected sizes of requests.
If expected sizes not known, the first fit is used.
114
115
Buddy Method:
Initially lk contains one node,
all other lists are empty.
Assumes the available memory is of size 2k for
some k.
Any free or reserved block is of size 2i for i ≤ k.
Notice that any buddies are adjacent and of
the same size.
De-allocation is inverse of allocation:
It keeps k lists of free blocks,
list li keeps free blocks of size 2i.
For a request of size m, we allocate a block of
size 2j where j = log2 m
.
if list lj is not empty: allocate a block from it;
else if list lj+1 is not empty:{
take a block from it;
split it in two buddies;
allocate one of them;
put the other in lj ;
}
else repeat the process recursively with lj+2.
To free a block of size 2j :
inspect list lj ;
if the list does not contain the buddy of
the block, insert it in lj ;
else {remove the buddy from lj ;
merge two buddies in block of size 2j+1;
proceed recursively with list lj+1;
}
Advantage: less of external fragmentation,
easy merging of free blocks.
Disadvantage: more of internal fragmentation.
116
117
Failure policies
It refers to actions taken when a memory request cannot be satisfied.
Compaction: move reserved blocks so that
all free blocks are merged.
(We must take care to adjust addresses of data
items that are moved.)
When out of memory, the memory manager
collects all nodes that are not accessible through
variables in the program.
This is referred to as the garbage collection.
It can recover the memory space lost through
dangling pointers.
Two typical algorithms:
Garbage collection:
When all blocks are of the same size, the memory manager links all free nodes into a freelist.
For any request a node is detached from the
freelist.
No attempts are made to return nodes to the
freelist until the memory manager runs out of
memory.
118
reference count:
each node contains the count of pointers pointing to a node.
When count of a node is 0, the node can be
recovered.
(cannot handle recursive references)
119
mark and sweep:
each node contains a mark bit.
1. visit sequentially all nodes and turn their
mark bits off.
2. in each node reachable through variables in
the program turn the mark on.
3. visit sequentially all nodes and append all
unmark nodes into the freelist.
This can handle most general cases, but can
be slow.
120