Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Algorithms and data structures 6.5.2017. Protected by http://creativecommons.org/licenses/by-nc-sa/3.0/hr/ Creative Commons You are free to: share — copy and redistribute the material in any medium or format adapt — remix, transform, and build upon the material Under the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes. ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. Notices: You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material. Text copied from http://creativecommons.org/licenses/by-nc-sa/3.0/ Algorithms and data structures, FER 6.5.2017. 2 / 40 Addressing techniques Basics Retrieval procedures Hashing 6.5.2017. Basics Knowing the key of some record, the question arises how to find this record Primary key Defines a record uniquely – Concatenated (composite) keys necessary for unique identification of some types of records – E.g. StudentID E.g. StudentID & CourseCode & ExamDate uniquely define a record of examination (a possible examination by a committee due to student’s complaint is neglected!) Secondary key Need not uniquely define the record, but it points at some attribute value – E.g. YearOfStudy in the record with course data Algorithms and data structures, FER 6.5.2017. 4 / 40 Sequential search Searching the file record by record - the most primitive way It is used for sequential files where all the records have to be read anyhow Other terms: linear, serial search The records need not be sorted Repeat for all the records If the current record equals the searched one Record is found Leave the loop On average n/2 records are read Complexity: O(n) – – IspisiTrazi(PrintSearch) Best case: O(1) Worst case: O(n) Algorithms and data structures, FER 6.5.2017. 5 / 40 Sequential searching of sorted records How to improve the sequential search? Sort the records according to a key! Sort the records Repeat for all records If the current record equals the searched one Record is found Leave the loop If the current record is larger than the searched one Record does not exist Leave the loop What are the complexities in the best, worst and average case while searching sorted records? Algorithms and data structures, FER 6.5.2017. 6 / 40 Questions and exercises 1. A colleague tells you that s/he wrote a sequential search algorithm with complexity O(log n). Shall you congratulate him or laugh at him? 2. In the best case, the record will be found after the minimum number of comparisons. Where was this record located? 3. Where was the record found after the maximum number of comparisons? Algorithms and data structures, FER 6.5.2017. 7 / 40 Block wise reading In direct/random access files (all the records are of the same length!) sorted by the primary key, it is not indispensable to check all the records E.g. only each hundredth record is examined When the block of the record with the searched key is located, the block is searched sequentially How to find the optimal block size? Algorithms and data structures, FER 6.5.2017. 8 / 40 Example – places in Croatia We are looking for the city of Malinska in a list of F=6935 places; each page contains B=60 places There are F / B leading records (and corresponding pages) - F / B = 116 115 1 Ada Adamovec Adžamovci . . . Bair Bajagić Bajčići 2 Bajići Bajkini Bakar-dio . . . Barilović Barkovići Barlabaševec Algorithms and data structures, FER 60 Mali Gradac Mali Grđevac Mali Iž . Malinska . Manja Vas Manjadvorci Manjerovići 61 Maovice Maovice Maračići . . . Martin Martina Martinac 6.5.2017. Zvijerci Zvjerinac Zvoneća . . . Žitomir Živaja Živike 116 Živković Kosa Živogošće Žlebec Gorički . Žutnica Žužići Žužići CitanjePoBlokovima (ReadingBlockWise) 9 / 40 Optimal block size In case of F records, and the block size B, there are F / B leading records and blocks It is expected that during the search by blocks it is necessary to read a half of the existing leading records of the blocks On average, the searched leading record will be found after having read (F / B) / 2 = F /( 2 B) records – Within the located block there are B records so it can be expected that the searched record would be found after on average B / 2 (sequential!) readings within that block The total expected number of readings is F / (2 B) + B / 2 After the derivation by B is equalled to zero, optimal block size is obtained: B = √F What is the optimal block size for the list of places in Croatia? Algorithms and data structures, FER 6.5.2017. 10 / 40 Binary search The binary search starts at the half of the file/array and continues with constant halving of the interval where the searched record could be found prerequisite: data sorted! Average number of searching steps log2 n The procedure is inappropriate for disk memories with direct access due to time/consuming positioning of reading heads. It is recommendable for core memory. The fact is used that the array is sorted and in each step the search area is halved Number of elements = n Complexity is O(log2n) Search steps = log2n Algorithms and data structures, FER 6.5.2017. The searched element 11 / 40 Example of binary search Looking for 25 2 5 6 8 9 12 15 21 23 25 31 39 Algorithms and data structures, FER 6.5.2017. 12 / 40 Algorithm for binary search lower_bound = 0 upper_bound = total_elements_count Repeat Find the middle record If the middle record equals to the searched one Record is found Leave the loop If the lower bound is greater than or equal to the upper bound Record is not found Leave the loop If the middle record is smaller then the searched one Set the lower bound to the position of the current record + 1 If the middle record is larger then the searched one Set the upper bound to the position of the current record - 1 (BinarnoPretrazivanje) BinarnySearch Algorithms and data structures, FER 6.5.2017. 13 / 40 Questions 1. What are the execution times in binary search of n records for the best, worst and average case? Average case is just for 1 step simpler than the worst case. The proof can be seen on: http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm, March 24th, 2014 2. For the search of places in Croatia (list of 6935 places), what is the maximum number of steps necessary to locate the searched place? 3. Shall the binary search be always faster then the sequential, even for a large set of data? Algorithms and data structures, FER 6.5.2017. 14 / 40 Problem Suppose that n unsorted data in a set can be sorted in time O(n log2 n) . You have to perform n searches in this data set. What is better: To use the sequential search? To sort data and then apply the binary search? Solution: it makes more sense to sort and then search binary! – n sequential searches: n * O(n) = O(n2) – sort + binary: O(n log2 n) + n * O (log2 n) = O(n log2 n) O(n log2 n) < O(n2) Algorithms and data structures, FER 6.5.2017. 15 / 40 Index-sequential files Every record contains a key as a unique identifier If a file is sorted by the key, it is appropriate to form a look-up table Input for the table is the key of the searched record Output is the information regarding the more precise location of the searched record Such a table is called index The index need not point to each record but only to a block – In the example – places in Croatia, it was shown that the optimal block size was √F For large files there are indices on multiple levels – with the optimal organisation, in the worst case in each of the indices and in the data file, the same number of records is read Optimal sizes of indices on 2 levels are 3√F i 3√F 2 O (3√F ) For k levels: k+1√F , k+1√F2, k+1√F3 ,… k+1 √F k-1 , k+1 √F k O (k+1√F ) Algorithms and data structures, FER 6.5.2017. 16 / 40 Index-sequential files Insertion and deletion In traditional sequential file (magnetic tape), they are possible only in the way by copying the whole file with addition and/or deletion of records In direct (random) access files, deletion is performed logically, i.e. a tag is written to mark a deleted record After a certain number of revisions, the file has to be reorganised Data are written sequentially and by the key value, indexing is repeated (so called file maintenance) Algorithms and data structures, FER 6.5.2017. 17 / 40 Search procedures Index non-sequential files If the search by multiple keys is required or if addition and deletion of records is frequent (volatile files), in the first case it is very difficult (if not impossible) and in the latter case difficult, to maintain the requirement that the records are sorted and contained within their initial block In that case, the index should contain the address (relative or absolute) of each single record The key contains the address The simplest case is to form the key so that some part of it contains the record address – – E.g. At some entrance examination for enrolment, the application number can serve as the key, and simultaneously it can be the ordinal number of the record in a direct access file Very often, such a simple procedure is not possible because the coding scheme cannot be adapted to each single application Algorithms and data structures, FER 6.5.2017. 18 / 40 The idea of hashing Problem: a company employs about hundred thousand employees, every person has his or her own unique identifier ID, generated from the interval [0, 1 000 000]. The records read & write must be fast. How to organise the file? Direct file with the key equals the ID? – 1 000 000 x 4 bytes~ 4MB, 90% of space is unused! It is possible to devise procedures to transform the key into address, or, even better, into some ordinal number The position of the record is stored under this ordinal number This modification improves the flexibility Algorithms and data structures, FER 6.5.2017. 19 / 40 Hashing Let us suppose to have M buckets (blocks of records with the same starting address of the block) available A pseudo-random number from the interval 0 to M-1 is calculated from the key value using a hash-function This number is the address of a group of data (of a bucket) where all of the respective keys are transformed into the same pseudo-random number Collision happens when two different keys are transformed into a same address If a bucket is full, it is possible to insert a pointer to the overflow area, or insertion is attempted in the next neighbouring bucket - Bad neighbour policy In hashing the following parameters can vary: Bucket capacity Packing density Algorithms and data structures, FER 6.5.2017. 20 / 40 Example Store the names into a hash table hash-function = (sum of ASCII codes) % (number of buckets) 0 1 2 3 Vanja Matija Andrea Doris Saša Alex ? Sandi Perica Iva Algorithms and data structures, FER 6.5.2017. 21 / 40 Bucket capacity A pseudo-random number is generated through a transformation of the key, yielding the bucket address If the bucket capacity equals 1, overflow is frequent With increasing of the bucket size, decreses the probability of overflow, but reading of a single bucket is more time consuming and the amount of sequential search within a bucket increases It is recommendable to match the bucket size with the physical size of the record on external memory (disk block) Algorithms and data structures, FER 6.5.2017. 22 / 40 Packing density After the bucket size has been chosen, the packing density can be selected, i.e. the number of buckets to store the foreseen number of records To reduce the number of overflows, a larger capacity is chosen Packing density = number of records / total capacity – – – N = number of records to be stored M = number of buckets C = number of records within a bucket Packing density = N / (M *C) Algorithms and data structures, FER 6.5.2017. 23 / 40 How to deal with overflow?” Using the primary area If a bucket is full, use the next one, etc. After the last one, comes the first one Efficient if the bucket size exceeds 10 Separate chaining Buckets are organised as linear lists Algorithms and data structures, FER 6.5.2017. 24 / 40 Statistics of hashing Let M be the number of buckets, and N the amount of input data. The probability for directing x records into a certain bucket obeys the binomial distribution: Px x N! 1 1 1 x! N x ! M M N x probability for Y overflows: P(C + Y) the expected number of overflows from a given bucket: s PC Y Y Y 1 Total expected number of overflows: 100s M/N The average number of records to be entered into the hash table before collision ~ 1.25 √M The average total number of entered records before every bucket contains at least 1 record is M ln M Algorithms and data structures, FER 6.5.2017. 25 / 40 Transformation of the key into address Generally, the key is transformed into the address in 3 steps: If the key is not numeric, it should be transformed into a number, preferably without loss of information An algorithm is applied to transform the key, as uniformly as possible, into a pseudo-random number with an order of magnitude of the bucket count The result is multiplied by a constant 1 for transformation into an interval of relative addresses, equal to the number of buckets – relative addresses are converted into the absolute ones on a concrete physical unit and, as a rule, that is the task of system programs An ideal transformation: probability of transforming 2 different keys in a table of size M into the same address is 1/M Algorithms and data structures, FER 6.5.2017. 26 / 40 Characteristics of a good transformation The output value depends only on the input data If it were dependent also on some other variable, its value should be also known at the search The function uses all the information from the input data If it were not using it, under a small variation of the input data a large number of equal outputs would be achieved – the distribution would depart from the wished one Uniformly distributes the output values Otherwise efficiency is decreased For similar input data, results in very different output values In reality, the input data are often very similar, while a uniform distribution at the output is required Which of these requirements are not fulfilled by our hash function from the previous example? Algorithms and data structures, FER 6.5.2017. 27 / 40 Usage of hashing When is it appropriate? Compilers use it for recording of declared variables For spelling checkers and dictionaries In games to store the positions of players For equality checks (in information security) – If two elements result in different hash values, they must be different When quick and frequent search is required When is it not appropriate? When records are searched by a non-key attribute value When the data need to be sorted – E.g. To find the minimum value of the key Algorithms and data structures, FER 6.5.2017. 28 / 40 Example: Transformation of key into address 6 digit key, 7000 buckets; key: 172148 Method: middle digits of the key squared Key squared yields a 12 digits number. Digits from the 5th to the 8th are used 1721482 = 029634933904 The middle 4 digits should be transformed into the interval [0, 6999] – As the pseudo-random number obtains values from the interval [0, 9999], while the bucket addresses can be from the interval [0, 6999], the multiplication factor is 6999/9999 0.7 Bucket address = 3493 * 0.7 = 2445 The results correspond to behaviour of the roulette Algorithms and data structures, FER 6.5.2017. 29 / 40 Methods for transformation of key into address Root from the middle digits of the key squared Like in the previous example but after the square operation the root from the 8 middle digits is calculated and cut-off to obtain a four digit integer: sqrt (96349339) = 9815 Division The key is divided by a prime number slightly less or equal to the number of buckets (e.g. 6997) Remainder after division is the bucket address – Bucket address = 172148 mod (6997) = 4220 Keys having a sequence of values are well distributed Shifting of digits and addition E.g. key= 17207359 – 1720 + 7359 = 9079 Algorithms and data structures, FER 6.5.2017. 30 / 40 Methods for transformation of key into address Overlapping Overlapping resembles shifting, but is more appropriate for long keys E.g. the key= 172407359 – 407 + 953 + 271 = 1631 Change of the counting base The number is calculated as if having another counting base B E.g. B = 11, key = 172148 – 1*115 + 7*114 + 2*113 + 1*112 + 4*111 + 8*110 = 266373 The necessary number of least significant digits is selected and transformed into the address interval: bucket address = 6373 * 0.7 = 4461 The best method can be chosen after simulation for a concrete application Generally, division gives the best results Algorithms and data structures, FER 6.5.2017. 31 / 40 Determination of parameters Example: There are 350 students enrolled in a study. Their ID - identification number (11 characters) and the family name (14 characters) should be stored, under the requirement to retrieve the records fast using their ID. Remark: ID contains 11 digits – The last digit is for control and it can but need not be stored, if the rule how to calculate it from the rest of the digits is known Algorithms and data structures, FER 6.5.2017. 32 / 40 Solution A record contains11+1 + 14+1 = 27 bytes Let the physical block on the disk be 512 bytes The bucket size should be equal or less than that – Therefore, a bucket shall contain data for 18 students and 26 bytes of unused space A somewhat larger table capacity, e.g. 30%, shall be provided to reduce the number of expected overflows – – 512/27 = 18,963 That means there are 350/18 *1.3 = 25 buckets ID should be transformed into a bucket address from the interval [0, 24] ID is rather long, so overlapping can be considered – Methods can be combined – after overlapping division can be performed – Address shall be calculated by division with a prime number close to the bucket number, e.g. 23 Algorithms and data structures, FER 6.5.2017. 33 / 40 Writing of records into buckets of the hash table ID C FamilyName 0 1 HASH M M-1 BLOCK on disk Algorithms and data structures, FER 6.5.2017. 34 / 40 Examples of key transformation ID = 5702926036x 2926 630 075 3631 mod (23) = 20 ID = 6702926036x 2926 630 076 3632 mod (23) = 21 ID = 6702926037x 2926 730 076 3732 mod (23) = 6 ID = 5702926037x 2926 730 075 3731 mod (23) = 5 If the bucket is full, entering into the next bucket is attempted cyclically (bad neighbour policy) Algorithms and data structures, FER 6.5.2017. 35 / 40 Algorithm - 1 Create an empty table on a disk Read ID and family name sequentially, until there are data If the control digit is not correct “Incorrect ID" Else Set the tag that the record is not written Calculate the bucket address Remember it as the initial address Repeat Read the existing records from the bucket Repeat for all the records from the bucket If the record is not empty If the already written ID is equal to the input one “Record already exists" Put the tag that the record is written Leave the loop Else Algorithms and data structures, FER 6.5.2017. 36 / 40 Algorithm - 2 Write the input record Put the tag that the record is written Leave the loop If the record is not written Increment the bucket address for 1 and calculate the modulo(buckets count) If the achieved address equals the initial Table is full Until record written or table full End Hash Algorithms and data structures, FER 6.5.2017. 37 / 40 Exercises Update Hash such that instead JMBG (13 character ID) it uses OIB (11 character ID). Update the function “Kontrola” for checking the control digit. Implement deletion by ID. Let products be the name of the unformatted file organised using hashing. Each record consists of ID (int), name (50+1 char), quantity (int) and price (float). The record is considered empty if its ID equals zero. Count the number of non-empty records in a file. The size of the block on the disk is BLOCK, and expected maximum number of records is MAXREC. These parameters are contained in parameters.h. Let an unformatted file be organised using hashing. Each record of the file consists of name (50+1 char), quantity (int) and price (float). The record is considered empty if its quantity equals zero. The size of the block on the disk is BLOCK what is contained in parameters.h. Write the function that will find the packing density. The prototype of the function is: float density (const char *file_name); Algorithms and data structures, FER 6.5.2017. 38 / 40 Exercises Write the function for the insertion of the ID (int) and the name (char 20+1) into the memory resident hash table with 500 buckets. Each bucket consists of a single record. If the bucket is full, the next bucket is used (cyclically). The input arguments are the bucket address (previously computed), ID and name. The function returns 1 if the insertion is successful, 0 if the record already exists, and -1 if the table is full so the record cannot be inserted. Write the function that will find the ID (int) and the company name (30+1) in the memory resident hash table with 200 buckets. Each bucket consists of a single record. If the bucket is full and does not contain the searched key value, the next bucket is used (cyclically). Input arguments are the bucket address (previously computed) and the ID. The output argument is the company name. The function returns 1 if the record is found, and 0 if it is not found. Algorithms and data structures, FER 6.5.2017. 39 / 40 Exercises Let a key be 7-digit telephone number. Write the function for transformation of the key into address. Hash table consists of M buckets. The division method should be used. The prototype of the function is: int address (int m, long teleph); Let products be the name of the unformatted file organised using hashing. A record consists of ID (4-digit integer), name (up to 30 char) and price (float). The record is considered empty if its ID equals zero. Write the function for emptying of the file. The size of the block on the disk is BLOCK, and expected maximum number of records is MAXREC. These parameters are contained in parameters.h. Write the function that will compute the bucket address in the table of 500 buckets. The key is 4-digit ID, and the method is the root from the central digits of the key squared. Algorithms and data structures, FER 6.5.2017. 40 / 40