Download BIN SORT ALGORITHM (Also called Radix Sort)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
BIN SORT ALGORITHM (Also called Radix Sort & Bucket Sort)
The algorithm below requires an array of linked lists. Assume DLLs are used for all the
linked lists in this description. All the examples below use positive data. A mix of
positive and negative data seriously complicates the task of writing a correct Bin Sort.
The program begins with the unsorted data in a linked list. In this description, this will
be called the master list. In Java, this could be an array of a reference type instead.
Copies of the pointers in the array could be stored in the bins. Thus the data would
never be moved.
There will also be an array of ten bins 0..9. Each bin will be the head pointer of a
linked list. Thus there are eleven lists in all, the master list and the lists in the 10 bins.
The program begins by breaking apart the master list, and placing each node at the rear
of one of the bin lists. The master list is then reassembled from the 10 bins by linking
them head to tail. This breakdown and reassembly is guided by the data in the nodes,
and continues until the list is sorted.
Assume this is the initial content of the master list with 52 at the head and 40 at the tail.
Master: 52, 94, 56, 32, 78, 3, 17, 92, 47, 39, 8, 28, 80, 53, 25, 58, 82, 21, 12, 40
Part 1. While the master list is not empty:
Detach the head node from the master list.
Attach this node to the tail of the bin corresponding to the ones digit. (%10)
(Move 52 to the 2 bin, then 94 to the 4 bin, and so on.)
Producing: Bin
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
List
80, 40
21
52, 32, 92, 82, 12
3, 53
94
25
56
17, 47
78, 8, 28, 58
39
Part 2. Rebuild the master list by attaching the bins head to tail producing:
Master: 80, 40, 21, 52, 32, 92, 82, 12, 3, 53, 94, 25, 56, 17, 47, 78, 8, 28, 58, 39
Master: 80, 40, 21, 52, 32, 92, 82, 12, 3, 53, 94, 25, 56, 17, 47, 78, 8, 28, 58, 39
Repeat Parts 1 & 2 on this list, but use the tens digit (/10 then %10):
After part 1:
Bin
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
List
3, 8
12, 17
21, 25, 28
32, 39
40, 47
52, 53, 56, 58
78
80, 82
92, 94
And after part 2:
Master: 3, 8, 12, 17, 21, 25, 28, 32, 39, 40, 47, 52, 53, 56, 58, 78, 80, 82, 92, 94
Clearly for three digit numbers we would make three passes, etc. Inserting at the
rear of the bin maintains the ones digit ordering while imposing tens digit
ordering.
Analysis of the Algorithm: (First, realize this is an address calculation sort.)
Part 1 - filling the bins while breaking apart the master list, takes time O(N) assuming
N items in the master list.
Part 2 - reassembling the master list from the bins, takes time O(B) assuming B bins, as
we do not have to scan the contents of the bins to re-link them. Watch out for
empty bins.
The width of the key determines the number of times parts 1 and 2 must be performed.
Let us call it the number of passes (P). Thus the total time required for the Bin Sort is:
O { P(N+B) }
The list of N items is divided into B bins ( O(N) ), then reassembled
into one list ( O(B) ). These two steps are repeated P times.
The speed of the sort is dependent upon the sizes of P and B relative to N. Remember
the fast comparison sorts are O{ N * log(N) }, so if B is relatively small and P is on the
order of log(N) this sort can be very fast.
How many bins should you use?
P( N +B ) can be your guide.
Notice first, there is a tradeoff between the number of bins and the number of
passes. Suppose you are writing a bin sort of 500 positive integers all less than
1,000,000. You could use 6 passes and 10 bins or perhaps 3 passes and 100 bins, or
perhaps 2 passes and 1000 bins or even 1 pass and 1,000,000 bins. Using our formula
of P( N + B) we get the following predictors (DSPP means - digits sorted per pass):
DSPP
N = 500, P = 6, B = 10
N = 500, P = 3, B = 100
N = 500, P = 2, B = 1,000
N = 500, P = 1, B = 1,000,000
P(N+B) = 6(500+10)
P(N+B) = 3(500+100)
P(N+B) = 2(500+1,000)
P(N+B) = 1(500+1,000,000)
= 3,060
= 1,800
= 3,000
= 1,000,500
1
2
3
6
With only 500 numbers, 100 bins appears to be the best choice. The fourth option,
where B is much larger than N, shows what happens when the tail wags the dog.
On the other hand, assume there are now 10,000 numbers in the same range.
N = 10,000, P = 6, B = 10
N = 10,000, P = 3, B = 100
N = 10,000, P = 2, B = 1,000
N = 10,000, P = 1, B = 1,000,000
P(N+B) = 6(10,000+10)
P(N+B) = 3(10,000+100)
P(N+B) = 2(10,000+1,000)
P(N+B) = 1(10,000+1,000,000)
= 60,060
= 30,300
= 22,000
= 1,010,000
Now, the third alternative, with 1000 bins, looks to be the best. In general, the table
entry with the largest B less than N usually generates the best predictor.
Actually, you should make the number of bins a power of 2 in
most cases, as this allows the use of bitwise operations RIGHT
SHIFT and AND. They are faster than divide and modulo.
When the number of bins is not a multiple of 10, it is usually called a Radix
Sort, but the algorithm is the same. This allows you to use RIGHT SHIFT(>>) and
AND(&) operations rather than DIVIDE(/) and MODULO(%). Again suppose you
need to sort 500 integers each under a million. Integers are generally stored as 32 bits.
You could use 16 bins and sort 4 bits per pass for 8 passes. Alternatively, you could use
256 bins, and sort 8 bits per pass for 4 passes. Suppose you decide to use 16 bins and
sort 4 bits per pass. Here is how the bitwise operations work. Instead of
MODULO(%) you can mask off all but the lower four bits of an integer using the
AND(&) operation. Instead of DIVIDE(/) you can shift an integer over four bits using
the RIGHTSHIFT(>>) operation.
int lower4bits
int upper28bits
= theInt & 0x0000000f;
= theInt >> 4;
32 bit plans. 500 posints under 1M sorted using 16 and 256 bins:
N = 500, P = 8, B = 16
P(N+B) = 8(500+16)
= 4,128
N = 500, P = 4, B = 256
P(N+B) = 4(500+256)
= 3,024
BSPP
4
8
There is nothing wrong with this analysis, but we might be able to do better still. As we
know all the numbers are positive and less than 1,000,000, and as 220 = 1,048,576, we
can produce a 20 bit plan. Any of the following will work: 5 passes over 16 bins
(sorting 4 bits per pass), 4 passes over 32 bins (sorting 5 bits per pass), or 2 passes over
1024 bins (sorting 10 bits per pass).
20 bit plans. 500 posints under 1M sorted using 16, 32, and 1024 bins: BSPP
N = 500, P = 5, B = 16
P(N+B) = 5(500+16)
= 2,580
4
N = 500, P = 4, B = 32
P(N+B) = 4(500+32)
= 2,128
5
N = 500, P = 2, B = 1024
P(N+B) = 2(500+1024)
= 3,048
10
Maximum cleverness: a 21 bit plan. 3 passes over 128 bins:
N = 500, P = 3, B = 128
P(N+B) = 3(500+128)
= 1,884
BSPP
7
Clearly, exploiting the fact that 12 of the 32 bits are all zero can speed up the sort. The
worst option in the 20 bit table is essentially the same as the best option in the 32 bit
table. Note that comparing the two binary tables is reasonable, but comparing a binary
table to a decimal table is not. If you look only at the best choices in the tables above,
you get 1800 from the DECIMAL table and 1884 from the BINARY tables. You might
conclude from this, erroneously, that the DECIMAL scheme is a little better.
Remember the DECIMAL scheme requires DIVIDE and MOD operations, while the
BINARY scheme requires RIGHT SHIFT and AND operations, which are much faster
operations on most computers. Thus, these predictor tables are to be considered a valid
guide only within the same table or table radix. They are not a direct measure of how
long it will take to sort the data.
Can we sort array of records rather than an LL? Use LL of pointers.
Clearly, the Bin Sort works best if sorting a linked list of records as the data is
never moved. If the data is stored instead in an array of records, then create a linked list
of pointers to those records, and perform the Bin Sort on that list instead. This is quite
straightforward in Java as an array of any reference type is already an array of pointers.
The data can then be output through the rearranged pointer list, or the array can be
rearranged using the pointer list. I saw one book that performed a bin sort on an array
of records using 10 bins each as big as the original array, but that is not reasonable.
There is a file on the website describing the construction of a pointer sort. There are
also example programs on the website that perform pointer sorts.
Can strings be sorted using the Bin Sort?
Yes, but there is a caveat.
A 12 character name could be processed in 12 passes, but there must be at least
26 bins (more if the name may contain non-letters). Processing two characters at a time
would require at least 676 bins (26*26). We have seen that passes over integers in the
bin sort move from the least significant digit to the most significant. This means that a
bin sort of strings must move right to left using the same index on each string. Shorter
strings must be filled out with NULLs to reach the length of the longest, or the extractor
must simulate this so that all the strings appear to be the same length.