Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SEARCHING Searching is an information retrieval process. The information may be required for inspection or it may be an integral part of the process of manipulating a data structure. In either case, this can be a time-consuming part of any algorithm and emphasis should be placed on providing effective searching mechanisms. Three major methods for searching will be considered: 1. sequential searching 2. binary searching 3. searching by hashing which will be dealt with near the end of the course Searching usually focuses on finding a specific item which identifies the information required. This item is called a key and can identify the information such as a record, an array, etc. Since large data bases commonly have multiple records, the key is used to identify a specific record -- a process much simpler than searching records to find the information required. If the records exist in the memory then the search is internal, if they are too extensive they may exist on disk or tape, in which case the search is an external one. This discussion will concern itself with internal searches. Sequential Searching A sequential search is one in which the search begins at the beginning of the data and moves sequentially through the list. At each step, the key in the data is compared to the key to be found. The search ceases when the key is found or the data is exhausted. The following steps are therefore involved: - locate first data element - compare the key to the appropriate part of the data - if this is the desired element, return the position of the element - if this is not the desired element obtain the next one if it exists - if the desired element does not exist in the data return a default value The complexity is dependent on the size of the list. The worst case for the search is if the key is not found. At that point the whole list has been traversed so if the list has n elements, the algorithm is O(n). The best case is also simple. The key is found in the first element so it is O(1). The average case is n/2. What this says is that on the average the key will be found in the first half of the list --not an unreasonable expectation. NOTE that the performance of the algorithm does not depend on the order of the keys in the list. On average, a key will be found in n/2 tries and it will take n tries to find that the key is not in the list. If some information is known about the data and the keys which are used most often, then there can be some improvements made. For example, if the search is usually made for a key with a large value, then if the list is in decreasing order, the chances of finding the key early in the search is improved. However, if the key does not occur, the whole list must still be searched. For ordered lists, an additional comparison can be made to see if the key value has been passed. In this case, the search efficiency can be improved because the list is only processed when there is a chance of finding the key. However, if the key doesn't exist, it will make 2n comparisons instead of n. In other words, the savings are not necessarily worth the effort unless most of the searches are for numbers in a range that is close to the beginning of the list. Worst case is still O(n) in all cases even though the number of comparisons could be doubled. Notice that we have looked at two features -- the number of items to be accessed and the number of comparisons that must be made. This will be a usual procedure when we look at algorithms and data structures. We may also be interested in other features as well but these two are common to most analyses. Binary Searches =============== When the list is ordered, a sequential search may become more efficient, but it does not decrease the computational complexity nor does in guarantee that the search will be more efficient. In practice, most data structures are in some order, in fact, ordering can become one of the requirements for using certain types of searches. This is true of the binary search which may become much more efficient than a sequential search. A binary search is one in which the search is on an ordered list, either in ascending or descending order(although this must be known before doing the search). The list is successively divided in half so that on each division the key is in the portion which remains. This effectively cuts the search in half at each comparison. There are two major ways in which binary search is done. The first is called a a dumb algorithm. The algorithm consists of defining top and bottom, and finding the mid-point. If the key is greater than the mid-point, then the bottom becomes the mid-point plus 1, otherwise the top becomes the midpoint. This continues until only one element remains and this is checked to see if it is equal to the key. Typical code for an array implementation is shown below: int top = n-1, bot = 0, mid; // make sure the array has some elements! if (top < 0) return -1; while (top > bot) {mid = (top + bot) / 2; if (x[mid] < key) bot = mid + 1; else top = mid; } if (x [top] == key) return top; else return -1; A second method uses the same procedure to successively reduce the size of the list being searched, but there is one difference. When a mid-point is chosen, this might be equal to the key. Therefore, the algorithm checks to see if the key is equal to the mid-point element. This algorithm has the potential of terminating sooner than the other if the mid-point is equal to the key rather than to continue reducing the search list until only one element remains. On the other hand, each time the list is divided, an additional check must be done. int top = n-1, bot = 0, mid; // make sure the array has some elements! if (top < 0) return -1; while (top > bot) {mid = (top + bot) / 2; if (x[mid] == key) return mid; if (x[mid] < key) bot = mid + 1; else top = mid; } if (x [top] == key) return top; else return -1; Analysis of the binary search algorithm is not as obvious as sequential search. Essentially, the while is the critical feature of the code since it determines how many iterations there will be if all other features are O(1). To determine what happens the variable mid is the critical part. Essentially, the function of mid is to divide the list in half because we only consider one of the two portions remaining and the midpoint is the point which cuts the total into two equal pieces. This is obviously a log2 n operation so the binary search has the complexity of O(log2 n). UNLESS OTHERWISE STATED, ALL LOGS WILL BE ASSUMED TO BE BASE 2 LOGS AND WILL BE WRITTEN AS LG or log2. In comparing the two methods, the only difference is whether one check or two are made at each choice of a new midpoint. For the first case, the algorithm will continue through the whole process even when the first midpoint is the required element so the best, average, and worst case are all O(log2 n). For the second algorithm, the first midpoint chosen may be the required one so the best case is O(1). The worst case is still O(log2 n) when the required element is the last midpoint chosen but you have doubled the number of comparisons. The average case is close to log2 n. That means that you will be doubling the number of comparisons almost lg n times most of the time. It is worth looking at what happens with a list that is 1 million items. The chances that you will hit the item you're looking for on the first try is 1 in 1 million, then it is 1 in 500000, 1 in 250000 1 in 125000, and so on. It is not until the list is small, say 4 long, that the odds are really very good. In fact, with 2 items left in the list you only have a 1 in 2 chance of getting the right element. If we consider just how quickly large amounts of data are reduced with a log2 n algorithm, it really doesn't make much sense to put in the extra checks and make the algorithm longer. The idea is good but the pay-off is poor. Searches can be made on arrays or linked lists, but in the case of linked lists, a binary search is not a reasonable approach. Notice that we never get something for nothing. The binary search method requires that the list be ordered and if we use dynamic allocation for each item, the binary search is actually more likely to be a sequential search.provided that we know how many items there are in the list, finding the midpoint in an array implementation is a simple calculation (midpoint = bottom + (top - bottom) div 2). In the linked representation, the task is different. Even if we assume that we know how many elements there are and we have pointers to the first and last element, finding the midpoint is not done through simple calculation. Suppose that the key is in the last element. To find the first midpoint we must traverse total div 2 elements. The next midpoint is at total div 4 beyond the current midpoint. The next midpoint is total div 8 more down the list and so on. Therefore, by the time we get to the end we have traversed all n elements so the binary search is actually O(n). With the array implementation, only lg n elements were examined, but in the linked list n elements may have to be examined and even in the best case n/2 elements will have to be examined if the search quits when the midpoint equals the key. A good algorithm (like a binary search) when combined with the wrong data structure produces a poor implementation.