Download SEARCHING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Array data structure wikipedia , lookup

B-tree wikipedia , lookup

Linked list wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
SEARCHING
Searching is an information retrieval process. The information may be
required for inspection or it may be an integral part of the process of
manipulating a data structure. In either case, this can be a time-consuming
part of any algorithm and emphasis should be placed on providing effective
searching mechanisms. Three major methods for searching will be considered:
1. sequential searching
2. binary searching
3. searching by hashing which will be dealt with near the end of the course
Searching usually focuses on finding a specific item which identifies the
information required. This item is called a key and can identify the
information such as a record, an array, etc. Since large data bases
commonly have multiple records, the key is used to identify a specific
record -- a process much simpler than searching records to find the
information required. If the records exist in the memory then the search is
internal, if they are too extensive they may exist on disk or tape, in which
case the search is an external one. This discussion will concern itself
with internal searches.
Sequential Searching
A sequential search is one in which the search begins at the beginning of
the data and moves sequentially through the list. At each step, the key in
the data is compared to the key to be found. The search ceases when the key
is found or the data is exhausted. The following steps are therefore
involved:
- locate first data element
- compare the key to the appropriate part of the data
- if this is the desired element, return the position of the element
- if this is not the desired element obtain the next one if it exists
- if the desired element does not exist in the data return a default value
The complexity is dependent on the size of the list. The worst case
for the search is if the key is not found. At that point the whole list has
been traversed so if the list has n elements, the algorithm is O(n). The
best case is also simple. The key is found in the first element so it is
O(1). The average case is n/2. What this says is that on the average the
key will be found in the first half of the list --not an unreasonable
expectation. NOTE that the performance of the algorithm does not depend on
the order of the keys in the list. On average, a key will be found in n/2
tries and it will take n tries to find that the key is not in the list.
If some information is known about the data and the keys which are used most
often, then there can be some improvements made. For example, if the search
is usually made for a key with a large value, then if the list is in
decreasing order, the chances of finding the key early in the search is
improved. However, if the key does not occur, the whole list must still be
searched. For ordered lists, an additional comparison can be made to see if
the key value has been passed. In this case, the search efficiency can be
improved because the list is only processed when there is a chance of finding
the key. However, if the key doesn't exist, it will make 2n comparisons
instead of n. In other words, the savings are not necessarily worth the
effort unless most of the searches are for numbers in a range that is close
to the beginning of the list. Worst case is still O(n) in all cases even
though the number of comparisons could be doubled.
Notice that we have looked at two features -- the number of items to be
accessed and the number of comparisons that must be made. This will be a
usual procedure when we look at algorithms and data structures. We may also
be interested in other features as well but these two are common to most
analyses.
Binary Searches
===============
When the list is ordered, a sequential search may become more efficient, but
it does not decrease the computational complexity nor does in guarantee that
the search will be more efficient. In practice, most data structures are in
some order, in fact, ordering can become one of the requirements for using
certain types of searches. This is true of the binary search which may
become much more efficient than a sequential search.
A binary search is one in which the search is on an ordered list, either in
ascending or descending order(although this must be known before doing the
search). The list is successively divided in half so that on each division
the key is in the portion which remains. This effectively cuts the search
in half at each comparison.
There are two major ways in which binary search is done. The first is
called a a dumb algorithm. The algorithm consists of defining top and
bottom, and finding the mid-point. If the key is greater than the
mid-point, then the bottom becomes the mid-point plus 1, otherwise the top
becomes the midpoint. This continues until only one element remains and
this is checked to see if it is equal to the key.
Typical code for an array implementation is shown below:
int top = n-1, bot = 0, mid;
// make sure the array has some elements!
if (top < 0) return -1;
while (top > bot)
{mid = (top + bot) / 2;
if (x[mid] < key)
bot = mid + 1;
else
top = mid;
}
if (x [top] == key)
return top;
else
return -1;
A second method uses the same procedure to successively reduce the size of
the list being searched, but there is one difference. When a mid-point is
chosen, this might be equal to the key. Therefore, the algorithm checks to
see if the key is equal to the mid-point element. This algorithm has the
potential of terminating sooner than the other if the mid-point is equal to
the key rather than to continue reducing the search list until only one
element remains. On the other hand, each time the list is divided, an
additional check must be done.
int top = n-1, bot = 0, mid;
// make sure the array has some elements!
if (top < 0) return -1;
while (top > bot)
{mid = (top + bot) / 2;
if (x[mid] == key)
return mid;
if (x[mid] < key)
bot = mid + 1;
else
top = mid;
}
if (x [top] == key)
return top;
else
return -1;
Analysis of the binary search algorithm is not as obvious as sequential
search. Essentially, the while is the critical feature of the code since
it determines how many iterations there will be if all other features are
O(1). To determine what happens the variable mid is the critical part.
Essentially, the function of mid is to divide the list in half because we
only consider one of the two portions remaining and the midpoint is
the point which cuts the total into two equal pieces. This is obviously
a log2 n operation so the binary search has the complexity of O(log2 n).
UNLESS OTHERWISE STATED, ALL LOGS WILL BE ASSUMED TO BE
BASE 2 LOGS AND WILL BE WRITTEN AS LG or log2.
In comparing the two methods, the only difference is whether one check or
two are made at each choice of a new midpoint. For the first case, the
algorithm will continue through the whole process even when the first
midpoint is the required element so the best, average, and worst case are
all O(log2 n). For the second algorithm, the first midpoint chosen may be the
required one so the best case is O(1). The worst case is still O(log2 n) when
the required element is the last midpoint chosen but you have doubled the
number of comparisons. The average case is close to log2 n. That means that
you will be doubling the number of comparisons almost lg n times most of the
time. It is worth looking at what happens with a list that is 1 million
items. The chances that you will hit the item you're looking for on the
first try is 1 in 1 million, then it is 1 in 500000, 1 in 250000 1 in
125000, and so on. It is not until the list is small, say 4 long, that the
odds are really very good. In fact, with 2 items left in the list you only
have a 1 in 2 chance of getting the right element. If we consider just how
quickly large amounts of data are reduced with a log2 n algorithm, it really
doesn't make much sense to put in the extra checks and make the algorithm
longer. The idea is good but the pay-off is poor.
Searches can be made on arrays or linked lists, but in the case of linked
lists, a binary search is not a reasonable approach. Notice that we never
get something for nothing. The binary search method requires that the list
be ordered and if we use dynamic allocation for each item, the binary search
is actually more likely to be a sequential search.provided that we know how
many items there are in the list, finding the midpoint in an array
implementation is a simple calculation (midpoint = bottom + (top - bottom)
div 2). In the linked representation, the task is different. Even if we
assume that we know how many elements there are and we have pointers to the
first and last element, finding the midpoint is not done through simple
calculation. Suppose that the key is in the last element. To find the
first midpoint we must traverse total div 2 elements. The next midpoint is
at total div 4 beyond the current midpoint. The next midpoint is total div
8 more down the list and so on. Therefore, by the time we get to the end we
have traversed all n elements so the binary search is actually O(n). With
the array implementation, only lg n elements were examined, but in the linked list
n elements may have to be examined and even in the best case n/2 elements will
have to be examined if the search quits when the midpoint equals the key.
A good algorithm (like a binary search) when combined with the wrong data
structure produces a poor implementation.