Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript

BIN SORT ALGORITHM (Also called Radix Sort & Bucket Sort) The algorithm below requires an array of linked lists. Assume DLLs are used for all the linked lists in this description. All the examples below use positive data. A mix of positive and negative data seriously complicates the task of writing a correct Bin Sort. The program begins with the unsorted data in a linked list. In this description, this will be called the master list. In Java, this could be an array of a reference type instead. Copies of the pointers in the array could be stored in the bins. Thus the data would never be moved. There will also be an array of ten bins 0..9. Each bin will be the head pointer of a linked list. Thus there are eleven lists in all, the master list and the lists in the 10 bins. The program begins by breaking apart the master list, and placing each node at the rear of one of the bin lists. The master list is then reassembled from the 10 bins by linking them head to tail. This breakdown and reassembly is guided by the data in the nodes, and continues until the list is sorted. Assume this is the initial content of the master list with 52 at the head and 40 at the tail. Master: 52, 94, 56, 32, 78, 3, 17, 92, 47, 39, 8, 28, 80, 53, 25, 58, 82, 21, 12, 40 Part 1. While the master list is not empty: Detach the head node from the master list. Attach this node to the tail of the bin corresponding to the ones digit. (%10) (Move 52 to the 2 bin, then 94 to the 4 bin, and so on.) Producing: Bin [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] List 80, 40 21 52, 32, 92, 82, 12 3, 53 94 25 56 17, 47 78, 8, 28, 58 39 Part 2. Rebuild the master list by attaching the bins head to tail producing: Master: 80, 40, 21, 52, 32, 92, 82, 12, 3, 53, 94, 25, 56, 17, 47, 78, 8, 28, 58, 39 Master: 80, 40, 21, 52, 32, 92, 82, 12, 3, 53, 94, 25, 56, 17, 47, 78, 8, 28, 58, 39 Repeat Parts 1 & 2 on this list, but use the tens digit (/10 then %10): After part 1: Bin [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] List 3, 8 12, 17 21, 25, 28 32, 39 40, 47 52, 53, 56, 58 78 80, 82 92, 94 And after part 2: Master: 3, 8, 12, 17, 21, 25, 28, 32, 39, 40, 47, 52, 53, 56, 58, 78, 80, 82, 92, 94 Clearly for three digit numbers we would make three passes, etc. Inserting at the rear of the bin maintains the ones digit ordering while imposing tens digit ordering. Analysis of the Algorithm: (First, realize this is an address calculation sort.) Part 1 - filling the bins while breaking apart the master list, takes time O(N) assuming N items in the master list. Part 2 - reassembling the master list from the bins, takes time O(B) assuming B bins, as we do not have to scan the contents of the bins to re-link them. Watch out for empty bins. The width of the key determines the number of times parts 1 and 2 must be performed. Let us call it the number of passes (P). Thus the total time required for the Bin Sort is: O { P(N+B) } The list of N items is divided into B bins ( O(N) ), then reassembled into one list ( O(B) ). These two steps are repeated P times. The speed of the sort is dependent upon the sizes of P and B relative to N. Remember the fast comparison sorts are O{ N * log(N) }, so if B is relatively small and P is on the order of log(N) this sort can be very fast. How many bins should you use? P( N +B ) can be your guide. Notice first, there is a tradeoff between the number of bins and the number of passes. Suppose you are writing a bin sort of 500 positive integers all less than 1,000,000. You could use 6 passes and 10 bins or perhaps 3 passes and 100 bins, or perhaps 2 passes and 1000 bins or even 1 pass and 1,000,000 bins. Using our formula of P( N + B) we get the following predictors (DSPP means - digits sorted per pass): DSPP N = 500, P = 6, B = 10 N = 500, P = 3, B = 100 N = 500, P = 2, B = 1,000 N = 500, P = 1, B = 1,000,000 P(N+B) = 6(500+10) P(N+B) = 3(500+100) P(N+B) = 2(500+1,000) P(N+B) = 1(500+1,000,000) = 3,060 = 1,800 = 3,000 = 1,000,500 1 2 3 6 With only 500 numbers, 100 bins appears to be the best choice. The fourth option, where B is much larger than N, shows what happens when the tail wags the dog. On the other hand, assume there are now 10,000 numbers in the same range. N = 10,000, P = 6, B = 10 N = 10,000, P = 3, B = 100 N = 10,000, P = 2, B = 1,000 N = 10,000, P = 1, B = 1,000,000 P(N+B) = 6(10,000+10) P(N+B) = 3(10,000+100) P(N+B) = 2(10,000+1,000) P(N+B) = 1(10,000+1,000,000) = 60,060 = 30,300 = 22,000 = 1,010,000 Now, the third alternative, with 1000 bins, looks to be the best. In general, the table entry with the largest B less than N usually generates the best predictor. Actually, you should make the number of bins a power of 2 in most cases, as this allows the use of bitwise operations RIGHT SHIFT and AND. They are faster than divide and modulo. When the number of bins is not a multiple of 10, it is usually called a Radix Sort, but the algorithm is the same. This allows you to use RIGHT SHIFT(>>) and AND(&) operations rather than DIVIDE(/) and MODULO(%). Again suppose you need to sort 500 integers each under a million. Integers are generally stored as 32 bits. You could use 16 bins and sort 4 bits per pass for 8 passes. Alternatively, you could use 256 bins, and sort 8 bits per pass for 4 passes. Suppose you decide to use 16 bins and sort 4 bits per pass. Here is how the bitwise operations work. Instead of MODULO(%) you can mask off all but the lower four bits of an integer using the AND(&) operation. Instead of DIVIDE(/) you can shift an integer over four bits using the RIGHTSHIFT(>>) operation. int lower4bits int upper28bits = theInt & 0x0000000f; = theInt >> 4; 32 bit plans. 500 posints under 1M sorted using 16 and 256 bins: N = 500, P = 8, B = 16 P(N+B) = 8(500+16) = 4,128 N = 500, P = 4, B = 256 P(N+B) = 4(500+256) = 3,024 BSPP 4 8 There is nothing wrong with this analysis, but we might be able to do better still. As we know all the numbers are positive and less than 1,000,000, and as 220 = 1,048,576, we can produce a 20 bit plan. Any of the following will work: 5 passes over 16 bins (sorting 4 bits per pass), 4 passes over 32 bins (sorting 5 bits per pass), or 2 passes over 1024 bins (sorting 10 bits per pass). 20 bit plans. 500 posints under 1M sorted using 16, 32, and 1024 bins: BSPP N = 500, P = 5, B = 16 P(N+B) = 5(500+16) = 2,580 4 N = 500, P = 4, B = 32 P(N+B) = 4(500+32) = 2,128 5 N = 500, P = 2, B = 1024 P(N+B) = 2(500+1024) = 3,048 10 Maximum cleverness: a 21 bit plan. 3 passes over 128 bins: N = 500, P = 3, B = 128 P(N+B) = 3(500+128) = 1,884 BSPP 7 Clearly, exploiting the fact that 12 of the 32 bits are all zero can speed up the sort. The worst option in the 20 bit table is essentially the same as the best option in the 32 bit table. Note that comparing the two binary tables is reasonable, but comparing a binary table to a decimal table is not. If you look only at the best choices in the tables above, you get 1800 from the DECIMAL table and 1884 from the BINARY tables. You might conclude from this, erroneously, that the DECIMAL scheme is a little better. Remember the DECIMAL scheme requires DIVIDE and MOD operations, while the BINARY scheme requires RIGHT SHIFT and AND operations, which are much faster operations on most computers. Thus, these predictor tables are to be considered a valid guide only within the same table or table radix. They are not a direct measure of how long it will take to sort the data. Can we sort array of records rather than an LL? Use LL of pointers. Clearly, the Bin Sort works best if sorting a linked list of records as the data is never moved. If the data is stored instead in an array of records, then create a linked list of pointers to those records, and perform the Bin Sort on that list instead. This is quite straightforward in Java as an array of any reference type is already an array of pointers. The data can then be output through the rearranged pointer list, or the array can be rearranged using the pointer list. I saw one book that performed a bin sort on an array of records using 10 bins each as big as the original array, but that is not reasonable. There is a file on the website describing the construction of a pointer sort. There are also example programs on the website that perform pointer sorts. Can strings be sorted using the Bin Sort? Yes, but there is a caveat. A 12 character name could be processed in 12 passes, but there must be at least 26 bins (more if the name may contain non-letters). Processing two characters at a time would require at least 676 bins (26*26). We have seen that passes over integers in the bin sort move from the least significant digit to the most significant. This means that a bin sort of strings must move right to left using the same index on each string. Shorter strings must be filled out with NULLs to reach the length of the longest, or the extractor must simulate this so that all the strings appear to be the same length.