* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download of data access
Survey
Document related concepts
Transcript
SELECTED METHODS OF DATA ACCESS prepared by M. Bach / R. Tutajewicz, adapted by M. Kuczora / September 2002 1. INTRODUCTION Nowadays the quality of computer memory is very high. Due to the bigger capacity and better access time of new memory models it is possible to store and process larger amount of information at the same time. Database systems are one of the basic computer science application. The efficient data organisation requires a special kind of software. Next paragraphs would present all concepts and problems connected with collection and procession a computer data. The first concept that has to be defined is a database. Database is a set of elements that has a specified structure, is stored in additional computer memory and is able to satisfy requirements of many users who use it in a selective way at any time they want (this definition was created in 1962/1963) A software that allows user to communicate with the database is called Database Management System (DBMS). The main function of this king of software is data organisation in memory and processing management. There are three main classes of data models which could be distinguished owing to different kinds of connections between data. These classes of models are: hierarchical, network and relational one. The most common model is a relational one, where dependences between data are represented by data records containing adequate keys. In general it is necessary to consider three main basic problems as follows: 1. Record layout problem; relying on giving such an algorithm A, which on the base of the Ki key, included in the Ri record, will allocate a place (an address) for the given record in the specified S structure. 2. Finding Ri record inside S structure relying on calculating the address of this record, on the base of the Ki key and the A algorithm. 3. Search problem occurs when there is a need to find a suitable record on the base of different attributes than the key. A solution relies on testing fields of all record comparing their contents with a given attribute value. 2. DESCRIPTION OF SELECTED DATA STRUCTURES Existing data structure could be in general divided into two groups: static structures, which are independent from the collected inside them elements, and dynamic variables, which depend on elements that are collected inside them. Saying static structures we refer to vectors, matrixes, records and tables, we call dynamic structure linear lists, non-linear list and nets. Let’s begin with a table. Table is a homogenous structure and consists of components that are the same basic type, and is a free access structure because all elements could be selected in any order and they are accessible in the same way. One dimensional table of n size is called T[n]. Two-dimensional table is similar to a rectangular matrix. It could be presented as m serially situated n size tables. A record differs from a table because it may contain elements of different type. Above described structures are often called basic ones, because they are elements that more complicated structures consist of. Definition of data type and preliminary specification of this type variables means that accepted range of values and at the same time memory standard are fixed once and forever. The variables defined this way are called static ones. Nevertheless there are many more issues that require more complicated data structures. One of feature of this data types, which differs it from other basic structures (tables, records), is a possibility of changing its size. So there is no need to allocate the whole accessible memory for such a type of structure. As result the compiler can not specify the specific memory addresses for particular items that are part of such variables. This type of structures is called dynamic. The main representatives of such structures are lists, trees and nets. Every records that is included in this type of structures has one or more additional fields, called connecting field or pointer fields. In practice these fields contain addresses of elements connected with a given record. A set where each element has at least once predecessor and one successor is called a linear list. The particular examples of linear list are: cycle list, where it is not possible to distinguish the first or the last element and the lists that require a special methods of access such as: stack – where adding and deleting elements is possible only from one end, queue – where records are inserted in one end and deleted from the other (i.e. only the oldest elements could be erased) A structure, whose element can have many successors, but only one predecessor is called a non-linear list or tree. The tree which number of successors is zero, one or two is called a binary tree. In case there are some elements that have many successors and many predecessors a structure is called a net. 3. DESCRIPTION OF SELECTED DATA ACCESS METHODS After a short description the most common data structures there must be discussed the efficiency of different data access methods. The organisation of data in the computer memory systems is connected with established search method, but the choice of method depends strongly on the manner of data set processing. For example if the processing requires often browsing the whole set a sequentional search method and a linear organisation that’s enough. In other case if there is a need to looked for chosen, single records very often there must be considered an index method, a binary tree search method or hashing function method and as a consequence a suitable organisation of data sets. In order to compare described method there should be established some comparison parameter. It would be average number of comparisons that must be done in order to find any record or in other words average time of search. This parameter is expressed by the formula: n L= ci pi i 1 where: L – average time of search ; ci – amount of executed tests made in order to find i record; pi – probability of i record access. Making assumption that the probability of each record access is the same (pi=p=1/n=const) the above formula can be presented as follows: n L= p ci i 1 3.1. Sequentional search As it was said before in case of a need of often browsing the whole set it is to be applied a sequentional search method and a linear data organisation. An algorithm of sequentional search is reduced to testing consecutive records one by one and checking if the record has been found which key is identical to a given one. Average search time in this method is: n L= p ci i 1 where: ci= i p=1/n so as the result of above: L= 1 n i n i 1 L= n 1 2 2 3.2. Binary Tree Search method If the organisation of tree guarantees that for each node ti all keys from the left sub-tree are smaller than keys of the ti, node, and all keys from the right sub-tree are bigger (greater) than such tree is called search tree. In such a tree a specified key could be found moving forward search way, starting from the root and moving along from the left to the right sub-tree of a given node depending only on a key value in this node. Using recursive method (going down the tree), after every key value test there is always reduced (approximately) the half of number of elements (from this “wrong” tree half ). This algorithm is more efficient than linear lists and sequentional search methods. After some elements reallocation, which are identified by keys, each binary tree may become binary search tree and the statement also concerns balanced tree type. The tree is well (precisely) balanced if the number of nodes in its left and right subtree differs not more than one. Average number of tests that are necessary to locate a given key in the precisely balanced tree, which has n nodes, approximately amounts to h=log(n)-1. Unfortunately it isn’t so that every search tree is balanced. Let’s calculate the average search time for the tree of any shape. Making assumption that each key may become a root, the probability that i key is a root is equal to 1/n (where n is a number of records in set). Than the left sub-tree contains i-1 keys, the right one n-i. On the base of meaning of below symbols: ai-1 – average way length in the left sub-tree an-i - average way length in the right sub-tree and that all permutations created from the other n-1 keys are equally possible. The average way length in the n-nodes tree is a sum of each level products and probability of access to this level: an = 1 n pi n i 1 where pi is length of the way for i-node: In the tree presented above nodes can be divided to three classes: 1. i-1 nodes in the left sub-tree have an average way length ai-1+1; 2. the root way length is equal to 1; 3. n-i nodes in the right sub-tree have an average way length an-i+1. According to above statement the previous equation could be presented as the sum of three elements: a(ni ) = (a i-1+1) i 1 n + 1 n + (an-i+1) ni n After some transformations using harmonic function properties it is possible to obtain: an 2(ln(n)+) -3 = 2 ln(n) - c where n – number of keys, - Euler constant 0.577 Due to the fact that the average way length in the precisely balanced tree approximately amounts to: ' n a = log(n)-1 limn-> an a'n = 2 ln(n) log(n) then we get = 2 ln 2 1.386 3 Above presented result confirms that thanks to the work connected with the precisely balanced tree creation, the expected reduction of the average search way length is about 39%. It is very important to stress a word “average” because this profit could be much better, but in disadvantageous case the tree degenerates to the list. Such a case takes place if we use tree structure for already sorted (in any way) data. For example if we insert to the tree subsequently elements 1,2,3,4,5,6,7,8,9,10 than the structure would become not a tree (as we desire) but a linear list, because every time a new element (the biggest one) will be linked to the right successor of the most right element. 3.3. How do we organise B-tree data access? The method of binary trees is applied mainly then, when the entire search set is in RAM. If the set is stored in some external memory the most significant is the number of data transmissions between RAM and the external memory. It is estimated that this operation requires more then 104 times more time than comparison operation Let’s assume that tree nodes are stored in additional memory (e.g. disk). Usage of binary tree e.g. million object requires in average log2106 ≈ 20 steps of search. As each step requires the disk access (including time delay), it will be better to have such a memory organisation that reduces the number of calls. The best solution for this problem is multidirectional tree. Accessing single object in the auxiliary memory it is possible (without any additional costs) to read the whole group of data. This suggests the tree division into units that have simultaneous access. These subtrees will be called pages. Drawings show the binary tree divided into pages containing 3 nodes and corresponding multidirectional tree. 30 12 40 7 15 35 5 10 13 17 34 50 37 45 60 Binarne drzewo podzielone na strony po 3 elementy 12 5 7 30 40 10 13 45 15 17 34 35 50 60 37 Multidirectional tree created as a result of grouping of tops from the previous drawing The idea, which were described in above stated example, was a reason to suggest concept of B-tree. There are applied to looking for records in databases. The tree is called B-tree of t(h,m) class if it is an empty tree (h = 0) or the following conditions are fulfilled: 1. All paths from the tree root to leafs have the same length equal to h, where h number is called the height of tree 2. Each node, except for the root and leaves, has at least m + 1 descendants, the root is or the leaf or at least 2 descendants 3. Each node has not more than 2m + 1 descendants. 4 Maximum fulfilment of B-tree of t(h,m) class exist only when, there are 2m elements stored in each node, contrary to it the minimal fulfilment exists – when there are m elements stored in indirect nodes and leaves and there is only 1 element stored in the root. Concerning these two extreme possibilities of fulfilment, having defined N elements of index, there is a following limitation of h number, which indicates levels in the index organised in accordance to the structure of B-tree of t(h,m) class: log 2 m 1 ( N 1) h 1 log m 1 N 1 2 3.3.1. How do we search the B-trees? If we know the value of x key the problem of search could be treated as finding the page where the element of a key equal to x is situated. The page we look for can be a root, indirect node or a leaf. The search is executed in accordance to the below presented algorithm: 1. Assign the identifier of the root page to p variable 2. Is p different than null? - If so go to step 3 - If p is equal to zero than such an element does not exist in the index (the end of algorithm) 3. Download page pointed out by p. 4. Is x smaller than the smallest key included in the page p ? - If so, then assign the extreme left pointer of descendant to p, and go to step 2 - If not then go to step 5 5. Does page p contain the key of x value ? - If so then the end of algorithm, the element we were looking for has been found - If not go to step 6 6. Does page p contain key that has a value bigger than x ? - If so, then assign the extreme right side of descendant that includes values of keys smaller or equal to the smallest key which value is greater than x from the current page - If not then assign the extreme right pointer of descendant to p, and go to step 2 Concerning the fact that in the search process the number of ‘downloaded’ pages is not more than the height of B-tree so the time complexity of above algorithm is O(logmN). 3.3.2. Add the index of element Adding element is preceded by the search algorithm and there is a need to insert a new value if there is no result of the search process. Then we know the side of leaf to which a new element is to be added. This adding can be without collision or can cause overflowing of page (if there are already 2m elements stored). If the first case a new element is added in the way so that we keep ascending order of key values on the page, in the second case we apply the compensation or division method. The compensation method can be applied, when one of pages neighbouring with the overflowed one includes less than 2m elements. Then we sort elements of both pages taking into account inserting element as well as ‘father’ element (i.e. this element for which pointer at both sides point out pages that take part in the compensation process). The ‘middle’ element in the string becomes ‘father’. All smaller elements (than father) are inserted to one of compensation pages, the bigger to others (we keep the order on the page). If the compensation is not possible, then the method of overflowed page division into two pages should be applied. First we sort in ascending order all elements of considering page (we take in account a new element). As a result we obtain string of 2m + 1 elements. We add the ‘medium’ element to ‘father’ page, elements smaller creates one page and bigger the other descendant one. In the process of insertion of medium element to ‘father’ page the page can be also overflowed (if the father page was full). Then the process of division of this page should be done first which may cause the need of division on the higher level and so on up to the root. The consequence of the root division could be creation of 2 new pages and increase in the tree height. In order to keep the tree height on the possible low level if case of a new element adding there must be primly the compensation method applied. After than, when the compensation is not longer possible the division method may be applied. 3.3.2. Erasing the index of element Erasing is preceded by the search algorithm and makes sense only on condition when the element we looked for has been found. Then the s variable points out the page containing element intended for erasing. If this page is a leaf page then the element is just erased. It may cause shortfall, i.e. that number of elements can be reduced to the value less than m value. If the page is not a leaf, then the erased element is replaced with element Emin , which has the smallest value from the subtree shown by the p pointer, standing immediately at the right side of erased element. As the Emin element is inserted instead of erased element and deleted from the leaf side, it may cause a shortfall as well. In case of shortfall we apply one of possible methods – linking or compensation. 5 Linking Let’s assume that there is a shortfall on the s page, which means that it include j < m elements, and at the same time one of her neighbouring pages (indicated by symbol s1) contains k elements and j + k < 2m. Then the method of linking s and s1 pages could be applied. All elements are ‘transmitted’ on the s page (order is preserved). From the father side there is element that separates pages s and s1 erased (and transmitted to left page as well). Excluding the element from the father page may also cause a shortfall there. Elimination of shortfall there may also cause a shortfall at his father side and in the extremely case up to the B-tree root . Applying the linking method is the only method that allows for the tree height reduction. Compensation If among the neighbours of s page, where shortfall occurred, there is no candidate to be linked with s page (the required conditions are not fulfilled), the a new compensation method is applied. This method is similar to the one already described but there is no new element added. 3.4. B*-trees The concept of B*-trees is similar to B-trees. The main difference between B-trees and B*-trees relies on this that Btrees are tress which have ‘node orientation’, and B*-trees have ‘leaves orientation’ (in the nodes which are not leaves there are only key values, and corresponding data is only in leaves). 30 10 5 10 12 15 15 35 17 20 30 32 35 36 50 50 52 60 Example of B*-tree of t*(3,2,2) class containing z links between leafs Each B*-tree is characterised by three following parameters: h* – the height of B*-tree, i.e. the distance between the root and leaves (the length of path) m* – the number of possible values of key in the B*-tree root and nodes that are not leaves (the root contains from 1 to 2m* values of the key, and each indirect node contains from m* to 2m* values of the key) m – liczba elementów w liściach (liść zawiera od m do 2m elementów indeksu) The drawing, which is shown above, shows the example of organisation of the B*-tree. Let us consider that the main difference between the B*-trees and B-trees relies on the fact that in the B*-trees there are all data in the tree leaves (the keys that are situated in the root and indirect nodes are duplicated in leaves). Algorithms for B*-tree of processes of search, addition and erasing are almost identical to B-tree algorithms so it is no use to described them in details. Some interesting application that is worth trying is usage of B*-trees for sequentional browsing data that is sorted according to the key values (as well filtering according to key). In order to make it possible it is enough to add to the B*-tree structure the two-directional list that joins neighbouring leaves. The example of such a structure is shown on the drawing above. 3.5. Hash function method The next, very often used structure is a table. For the unsorted table the average number of tests needed to find an element is n/2, but for the sorted one - in case if the dychotonic method is applied (divide and conquer) - the number of tests is reduced till 1/2 log2n. Just mentioned method relies on testing the element situated in the half of a current range. If the element is not the one we look for than the next range is to be checked and it is a half of lately tested range and the next test is made. Other method applied in case of tables is a hashing function method. The main task of hashing function is the transformation of K keys into table indexes. The main difficulty in using key transformation is the fact that the set of possible key values is much bigger than the set of addresses (table indexes). If we consider a given K key, the first step of operation relies on a corresponding index calculation, the second one checks if the object of K key is really under the address appointed by h function. Every time when we find other key then we require under a given address is called collision. The algorithm finding an alternative index is called the algorithm removing collision. 6 If the number of possible addresses is N (base capacity), then hit in every address ADRi from the range [1,N] should be equally possible (probability is equal in all cases), i.e. that for each ADRi is true: P{h(Ki) = ADRi} = 1/N Let’s calculate the number of test which has to be done in order to place l+1 record in n- elements table l 1 Ll+1 = cjpj j 1 where: Ll+1 – numer of test that must be executed in order to placing (find) l+1 rekord cj – subsequently made tests pj – probability of record finding (record placing) nl n l nl p2 = n n 1 l l 1 n l p3 = n n 1 n 2 p1 = . . . pj = l l 1 l j 2 nl ... n n 1 n j 2 n j 1 Ll+1 = n1 n l 1 so: Due to the fact that the number of tests during placing is exactly the same as in case of searching, so the above value could be used to calculate the average number of tests needed to find any key in the table: L = 1 l lk l k 1 = n 1 l 1 l k 1 n k 2 = n 1 l (Hn+1 - Hn-l+1) where: Hn = 1+1/2+...+1/n is a harmonic function, approximately equal to ln(n) + , and is the Eulera contant 0.577. Using = l/(n+1) : L = -1/ ln(1-) - is approximately product of the number of busy and free addresses, and is called a filler coefficient: = 0 means that a table is empty, = n/(n+1) means that a table is full. In case if we compare the number of tests needed to search or insert, hash coding is more effective than binary search tree, neverthless is must be taken into consideration that the test (comparison) operation may require completely different period of time. For example hash function method requires in average two tests and the tree six but it doesn’t mean that hashing is three times faster, what’s more it may happen, that usage the binary search tree is faster, because i.e. hash function has a very big computational complexity. In comparison with methods of dynamic memory allocation some kind of disadvantage is a constant size of the table which makes it impossible to fit it to current needs. Therefore it is essential to specify a priori approximate number of objects in order to avoid inefficient memory usage or a weak efficiency (sometimes – table overflow). Even if the number of object is exactly knows (a very rare case) a concern with a high efficiency suggests a small increase of table (about 10%). A second very important disadvantage are difficulties that appear during deleting element from a table. To sum up it has to be admitted that a tree organisation is recommended always than, when the number of data is hard to determine or it changes very often. Examples of hashing functions and resolving collision algorithms: 1. – cutting out 3 digits from key and normalisation to memory range h(k) = ( cyfra klucza: 4,5 i 6-ta) *M/1000) 2. – partition, linking and division through the biggest prime number <=M h(k) = (lewe 12 bitów k + prawe 12 bitów k) mod M 3. – division though the biggest prime number<=M h(k) = k mod M 4. – multiplicative Fibonacci hashing h(k) = M *(0.618034*k) mod M Algorithms for collision resolving 7 1. - linear probing with a step amounts to 1 ai = (ai-1 + 1) mod M 2. – linear probing with a step amounts to 7 ai = (ai-1 + 7) mod M 3. - double independent hashing (probing with a step depending on h(k) value) ai = [ai-1 +h2(a0)] mod M where probing step: h2(a0) = 1 gdy a 0 0 M a 0 gdy a 0 0 4. – double independent hashing (probing with a step depending on k) ai = [ai-1 + h2(k)] mod M where probing step: h2(k) = 1+ [k mod (M-2)] where: k – key M – table size ai – i test for address calculation 4. TASKS TO BE DONE 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Prepare the set with test data. Carry out experiments using sequentional search on random data, and on sorted data. Compare obtained results. As result there should be shown the table containing data, number of search test, average experiment’s number of tests, minimum and maximum number of tests. Carry out the same test for the method of dychotonic divisions. Compare obtained results. Carry out experiments using binary trees. Compare results obtained for the “ordinary” trees (without balancing) and trees perfectly balanced. Does the average number of comparisons concerning both cases substantially differ? Find some example order of input of 31 numbers into the tree in order to obtain the perfectly balanced binary tree. Draw some conclusion how should be input data should be organised. Carry out experiments on B-trees, for different sizes of page. Determine optimal size of page taking into account search time. Calculate minimal and maximal number of elements that could be put into B-tree class t(h,m). Carry out experiments on B*-trees, for different sizes of pages. Determine optimal size of page taking into account search time. Compare results for B-trees and B*-trees. Carry out experiments concerning data input into hashing table. Check the influence of table size, choice of hashing functions as well as applied collision resolving functions on the number of occurred collisions in the process of data input. Compare obtained results with the theoretical predictions. 8