* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Some Data Structures
Survey
Document related concepts
Transcript
Some Data Structures Chapter 5 Agenda • Arrays, Stacks and queues • Records and pointers • Lists • Graphs • Trees • Associative Tables • Heaps • Binomial Heaps Introduction • The use of well-chosen data structures is often crucial factor in the design of efficient algorithms. • However, this course is not intended to be an elementary data structure course. • In this chapter a brief review of the important and algorithmic aspects of the elementary data structures are covered: • Arrays, records, stacks, queues, heaps, associative & disjoint sets etc. Arrays • An array is a data structure • consisting of fixed number of items (elements) • of the same type. • Array elements are stored in contiguous memory locations. • In a one dimensional array, access to any particular item is obtained by specifying a single subscript or index. • E.g.tab: array [1..50] of integers • Here tab is an array of 50 integers indexed from 1 to 50; • tab[1] refers to the first item of the array, tab[50] to the last. Arrays • Array has an essential property – • We can calculate the address of any given element in constant time. • E.g., if we know that the array tab starts at address 5000, and that integers occupy 4 bytes of storage each, then • the address of an element with an index k is located at memory location (5000 + 4k). • Even if we think it worthwhile to check that k does indeed lie between 1 and 50, the • time required to compute the address can still be bounded by a constant. • i.e. the time required to read the value of a single element is in O(1). • also, the time required to modify such a value is again O(1). • i.e. we can treat such operations as elementary. • Elementary operation is one whose execution time can be bounded above by a constant (depending only on the particular implementation used – the machine, the programming language, and so on). • The constant doesn’t depend on either the size. (e.g. push & pop operations) Arrays • On the other hand, • any operation that involves all the items of an array, will • tend to take longer as the size of the array increases. • Suppose an array has n elements (i.e. array size = n): • • • • • Then, an operation such as initializing every item, or finding the largest item, usually takes a time proportional to the number of items to be examined. i.e. non-elementary operation. i.e. such operations take a time in Θ(n). Arrays • Another common situation is where we want to keep the values of successive items of the array in order (ascending, descending, etc.) • Now whenever we decide to insert a new value, we have to open up a space in the correct position, • either by copying all the higher values on position to the right, or else by copying all the lower values by one position to the left. • Whichever policy we adopt, in the worst case we may have to shift n/2 elements. • i.e. non-elementary operation. • Similarly, deleting an element may require us to move all or most of the remaining items in the array. • Such operations take a time in Θ(n). • i.e. non-elementary operation. Stack • A one-dimensional array provides an efficient way to implement the data structure- stack. • In stack, • Items are added, and removed, on a last-in-first-out (LIFO) basis. • The situation can be represented using an array called stack, whose index runs from 1 to the maximum required size of the stack, along with a counter. • To set the stack empty, the counter is given the value zero; • To add an item to the stack, the counter is incremented, and then the new item is written into stack[counter]; (push operation) • To remove an item from the stack, the value of stack[counter] is read out, and then the counter is decremented. (pop operation) • Tests can be added to ensure that no more items are placed on the stack than were provided for, and that items are not removed from an empty stack. Queue • The data structure called a queue can also be implemented quite efficiently in a1-D array. • In queue, • Items are added, and removed, on a first-in-first-out (FIFO) basis. • Adding an item is called an enqueue operation, while • Removing one is called dequeue. • For both stacks and queues, • one disadvantage of using an implementation in an array is that• space is to be allocated at the outset for the maximum number of items envisaged; • if ever this space is not sufficient, it is difficult to add more, • while, too much space is allocated, it results into memory waste. Arrays • The items of an array can be of any fixed-length type: • this is so that the address of any particular item can be easily calculated. • Generally, the array index is always an integer. • However, other so-called ordinal types can be used. E.g.• lettertab: array [‘a’.. ‘z’] of value • is one possible way of declaring an array of 26 values, indexed by the letters from ‘a’ to ‘z’. • It is not permissible to index an array using real integers, strings, structures etc. • If such things are allowed, accessing an array item can be no longer considered an elementary operation. • However, a more general data structure called an associative table, does allow such indices. Matrix • The examples seen so far have all involved one-dimensional arrays, • whose items are accessed using just one index. • Arrays with two or more indices can be declared in a similar way, e.g.• matrix: array [1..20, 1..20] of complex • is one way to declare an array containing 400 elements of type complex. • A reference to any particular item, such as matrix[7,9], now requires two indices. • However, the essential point remains – we can calculate the address of any given element in constant time, so reading or modifying its value can be taken to be an elementary operation. • Obviously, if both dimensions of a 2-D array depend on some parameter n, as in the case of n X n matrix, then operations such as initializing every item of the array, or finding the largest item, now take time in Θ(n2). Records & Pointers • While an array has a fixed number of items of the same type, • a record is a data structure comprising a fixed number of items, • often called fields, that are possibly of different types. • E.g., if the information about a person that interests us consists of their name, age, weight and gender; we might want to use a data structure of the following form: • Type person = record name: string age: integer weight: real male: Boolean Records & Pointers • Now if Ramesh is a variable of this type, we use the dot notation to refer to the fields. E.g. Ramesh.name is a string and Ramesh.age is an integer. • An array can appear as an item of records, and records can be kept in arrays. E.g. • class: array [1..50] of person • then the array class holds 50 records. • The attributes of the seventh member of the class are referred to as class[7].name, class[7].age, and so on. Records & Pointers • As with arrays, • provided a record holds only fixed-length items, • the address of any particular item can be calculated in constant time, • so consulting or modifying the value of a field may be considered as an elementary operation. • Many programming languages allow records to be created and destroyed dynamically. • This is one reason why records are commonly used in conjunction with pointers. • A declaration such astype boss = ↑person says that boss is a pointer to a record whose type is person. Such a record can be created dynamically by a statement such as – boss ←new person Records & Pointers • Now boss↑ means “the record that boss points to”. • To refer to the fields of this record, we use boss↑.name, boss↑.age, and so on. • If a pointer has a special value nil, then it is not currently pointing to any record. Lists • A list is a collection of items of information arranged in a certain order. • Unlike arrays and records, the number of items in a list is generally not fixed, nor is it usually bounded in advance. • The corresponding data structure should allow us to determine, • which is the first item in the structure, • which is the last, and • which are the predecessor and the successor (if they exist) of any given item. • On a machine, the storage corresponding to any given item of information is called a node. alpha beta gamma delta Lists • Lists are subject to a number of operations: • we might want to – • • • • insert an additional node, delete a node, to copy a list, to count the number of elements it contains, and so on. • The various computer implementations commonly used differ in the quantity of storage required, and in the ease of carrying out certain operations. • Here we content ourselves with mentioning the best-known techniques. • Implemented as an array by the declaration• type list = record counter : 0 .. maxlength value : array [1..maxlength] of information The items of the list occupy the slots value[1] to value[counter], and the order of items is the same as the order of their indices in the array. Lists • Using this implementation, • we can find the first and the last items of the list rapidly, • On the other hand, inserting a new item or deleing one of the existing items requires • a worst-case number of operations in the order of the current size of the list. • However, this implementation is particularly efficient for the important structure known as stack; • and stack can be considered as a kind of list where addition and deletion of items are allowed only at one designated end of the list. • Despite this, such an implementation of a stack may present the major disadvantage of requiring that all the storage potentially required be reserved throughout the life of the program. Lists • On the other hand, if pointers are used to implement a list structure, • the nodes are usually records with a form similar to the following: • type list = ↑node type node = record value: information next: ↑node • where each node except the last includes an explicit pointer to its successor. • The pointer in the last node has the special value nil, indicating that it goes nowhere. • To examine the kth item, it is required to follow k pointers, • and thus, to take time in O(k). Graphs • A graph is a set of nodes joined by a set of lines or arrows. • Two types: directed and undirected • In a directed graph, the nodes are joined by arrows called edges. • In the figure, there exists an edge • from alpha to gamma and • another from gamma to alpha. • In undirected graphs, the nodes • are joined by lines with no direction indicated. • In both directed & undirected graphs, • sequences of edges may form paths and cycles. Graphs • A graph is connected if you can get from any node to any other by following a sequence of edges; • in the case of a directed graph, you are allowed to go the wrong way along an arrow. • A directed graph is strongly connected if you • can get from any node to any other node by following a sequence of edges, but this time respecting the direction of the arrows. • There are not more than two arrows joining any two given nodes of a directed graph, and if there are two arrows, they must go in opposite directions; there is never more than one line joining any two given nodes of an undirected graph. Graphs • Formally, a graph is a pair G = <N,A>, where N is a set of nodes and A is a set of edges. • An edge from node a to node b of a directed graph is denoted by the ordered pair (a,b), whereas • an edge joining node a and b in an undirected graph is denoted by {a,b}. (Note: a set is an unordered collection of elements) • E.g.-figure 5.3 is an informal representation of the graph G=<N,A> where Graphs • There are at least two obvious ways to represent a graph on a computer. • The first way uses an adjacency matrix. type adjgraph = record value : array [1..nbnodes] of information adjacent : array [1..nbnodes, 1..nbnodes] of Boolean • If the graph includes an edge from node i to node j, then • adjacent[i,j] = true; • Otherwise adjacent[i,j] = false; • In case of undirected graph, the matrix is necessarily symmetric. • With this representation, it is easy to see whether or not there is an edge between two given nodes: • to look up a value in an array takes a constant time. Graphs • On the other hand, should we wish to examine all the nodes connected to some given node, • we have to scan a complete row of the matrix, in the case of an undirected graph, or • both a complete row and a complete column, in the case of a directed graph. • This takes a time in Θ(nbnodes), the number of nodes in the graph, • independent of the number of edges that enter or leave this particular node. Graphs • There are at least two obvious ways to represent a graph on a computer. • The second way of graph representation is: type lisgraph = array [1..nbnodes] of record value : information neighbours : list • Here we attach to each node i a list of its neighbors, • i.e. of those nodes j such that an edge from i to j (in the case of a directed graph), • or between i and j (in case of an undirected graph), exists. • If the number of edges in the graph is small, this representation uses less storage than the one given previously. • It may also be possible in this case to examine all the neighbours of a given node in less than nbnodes operations on the average. Graphs • On the other hand, • determining whether a direct connection exists between two given nodes i and j • requires us to scan the list of neighbors of node i (and possibly of node j too, in the case of a directed graph), which is less efficient than looking up a Boolean value in an array. trees • A tree (free tree) is an acyclic, connected, undirected graph. • Equivalently, a tree may be defined as an undirected graph, • in which there exists exactly one path between any given pair of nodes. • Since, tree is a kind of graph, • the same representations used to implement graphs can be used to implement trees. trees • Figure 5.4 (a) shows two trees, • each of which has four nodes. • You can easily verify that these are the only distinct trees with four nodes. • Properties: • A tree with n nodes has exactly n-1 edges. • If a single edge is added to a tree, then the resulting graph contains exactly one cycle. • If a single edge is removed from a tree, then the resulting graph is no longer connected. trees • rooted tree: these are trees in which one node, called the root, is special. • When drawing a rooted tree, it is customary to put the root at the top, like a family tree, with the other edge coming down from it. • Fig. 5.4(b) shows four rooted trees, each with four nodes. • these are the only rooted trees with four nodes that exist. • We’ll say ‘tree’ for ‘rooted tree’. trees trees trees • On a computer, any rooted tree may be presented using nodes of the following type: Type treenode1 = record value : information eldest-child, next-sibling : ↑treenode1 trees trees • Another representation suitable for any rooted tree uses nodes of the type type treenode2 = record value : information parent : ↑treenode2 • where now each node contains only a single pointer leading to its parent. • This representation is about as economical with storage space, but • it is inefficient unless all the operations on the tree involve starting from a node and going up, never down. • Moreover, it doesn’t represent the order of siblings. binary trees • In binary tree, each node can have 0, 1, or 2 children. • In fact, a node has two pointers, one to its left and one to its right, either of which can be nil. • The two trees in figure 5.8 are different. k-ary trees binary search tree • A binary tree is a search tree if • • • • the value contained in every internal node is greater than or equal to the values contained in its left child or any of that child’s descendants, and less than or equal to the values contained in its right child or • any of that child’s descendants. binary search tree binary search tree binary search tree : height, depth & level • Height of a node is the number of edges in the longest path from the node in question on leaf. • Depth of a node is the number of edges in the path from the root to the node in question. • Level of a node is equal to the height of the root of the tree minus the depth of he node connected. binary search tree : height, depth & level • Height of a the tree is the height of the root. • this is also the depth of the deepest leaf and the level of the root. Associative Tables • An associative table is just like an array, except • that the index is not restricted to lie between two prespecified bounds. • For instance, if T is a table, • you may use T[1] and then T[106] with • no need to reserve one million storage cells for the table. • It is valid to use strings rather than numbers for the index, e.g. • T[“Fred”] • Ideally, a table should not take much more space than • is needed to write down the indexes used so far, together with space for the values stored in those table locations. • The convenience of tables comes at a price: • unlike arrays, tables can not be implemented so that each access is guaranteed to take constant time. Associative Tables • The easiest way to implement a table is with a list. • With this implementation, access to T[“Fred”] is accomplished by • marching through the list until “Fred” is found in the index field of a node, or • until the end of the list is reached. • In the first case, the associated information is in the value field of the same node; • In the second case, we know that “Fred” is missing. • New entries are created when needed. Associative Tables • The previous implementation is very inefficient. • In the worst case, each request concerns a missing element, • forcing us to explore the entire list at each access. • A sequence of n accesses therefore requires a time in Ω(n2) in the worst case. • If up to m distinct indexes are used in those n accesses, m<=n, • and each request is equally likely to access any of those indexes, • the average case performance of this implementation is as bad as its worst case: Ω(mn). • Provided the index fields can be compared with one another, • and with the requested index in unit time, • balanced trees can be used to reduce this time to O(n log m) in the worst case. • This is better, but still not good enough for one of the main applications for associative tables, namely compilers. Associative Tables • Just about every compiler uses an associative table to implement the symbol table. • This table holds all the identifiers used in the program to be compiled. • If fred is an identifier, the compiler must be able to access relevant information such as its type and the level at which it is defined. • Using tables, this information is simply stored in T[“Fred”]. • The use of either implementation outlined so far, • would slow a compiler down unacceptably when it dealt • with programs containing a large number of identifiers. • Instead, most compilers use a technique known as • hash coding, or simply hashing. (hashing performs reasonably well most of the time) • Q: why usage of associative tables is not preferred in compilers? Hashing • Let U be the universe of potential indexes for • the associative table to be implemented, and • let N << │U │ be a parameter chosen as described below: • A hash function is a function h : U → {0,1,2,…,N-1}. • This function should efficiently disperse all the probable indexes: h(x) should be different from h(y) for most of the pairs x ≠ y likely to be used simultaneously. • For instance h(x) = x mod N • is reasonably good for compiler purposes provided N is prime and • x is obtained from the identifier’s name using any standard integer representation of character strings, such as ASCII. • When x ≠ y but h(x) = h(y), • we say there is a collision between x and y. • If collisions were very unlikely, • we could implement the associative table T with an ordinary array A[0...N-1], • using A[h[x]] each time T[x] is logically needed. Hashing • The problem is that T[x] and T[y] are confused • whenever x and y collide. • Unfortunately, this is intolerable because the collision probability can’t be neglected unless N >> m2 where m is the number of different indexes actually used. • Many solutions for this difficulty have been proposed. • The simplest is list hashing or chaining. • Each entry of array A[0..N-1] is of type table_list: A[i] contains the list of all indexes that hash to value i, together with their relevant information. hashing • figure 5.11 illustrates the situation after the following four requests to associative table T. Hashing • The load factor of the table is m/N, where • m is the number of distinct indexes used in the table, and • N is the size of array used to implement it. • If we suppose that every index, every value stored in the table and every pointer occupy a constant amount of space, the table takes space in Θ(N + m), and • the average length of the lists is equal to the load factor. • Thus, increasing N reduces the average list length, but • increases the space occupied by the table. • If the load factor is kept between ½ and 1, • the table occupies a space in Θ(m), • which is optimal up to a small constant factor, and • the average list length is less than 1, • which is likely to imply efficient access to the table. • Q: why hashing is preferred in compilers for dealing with symbol tables? Heaps • A heap is a special kind of rooted tree • that can be implemented efficiently in an array • without any explicit pointers. • This interesting structure lends itself to numerous applications, • including a remarkable sorting technique called heapsort, • (presented later in this section). • It can also be used for the efficient representation of certain dynamic priority lists, • such as the event list in a simulation or the list of tasks to be scheduled by an operating system. Heaps • A binary tree is essentially complete if • each internal node, with the possible exception of one special node, has exactly two children. • The special node, if there is one, is situated on level 1; • it has a left child but no right child. • Moreover, either all the leaves are on level 0, or • else they are on levels 0 and 1, and • no leaf on level 1 is to the left of an internal node at the same level. • Intuitively, an essentially complete tree is one where the internal nodes are pushed up the tree as high as possible, • the internal nodes on the last level being pushed over to the left; • the leaves fill the last level containing internal nodes, if there is still any room, and then spill over onto the left of level 0. An essentially complete binary tree Heaps • If an essentially complete binary tree has height k, then • • • • there is one node (the root) on level k, there are two nodes on level k-1, and so on; there are 2k-1 nodes on level 1, and at least 1 and not more than 2k on level 0. • If the tree contains n nodes in all, • counting both internal nodes and leaves, it follows that 2k ≤ n < 2k+1 . • Equivalently, the height of a tree containing n nodes is k = Llg n˩, (we shall use this result later) Heaps • This kind of tree can be represented in an array T by • • • • putting the nodes of depth k, from left to right, in the positions T[2k], T[2k+1], …, T[2k+1-1], with the possible exception of level 0, which may be incomplete. The figure indicates which array element corresponds to • each node of the tree. • Using this representation, • the parent of the node represented in T[i] • is found in T[i÷2] for i>1 (the root T[1] doesn’t have parent). • And, the children of the node represented in • T[i] are found in T[2i] and T[2i+1] (if exist). • The subtree whose root is in T[i] • is also easy to identify. Heaps • Now, a heap is an essentially complete binary tree, • each of whose nodes includes an element of information called the value of the node, • and which has the property that the value of each internal node is greater than or equal to the values of its children. • This is called the heap property. • Figure 5.13 shows an example of a heap with 10 nodes. • The heap property can be easily checked: • E.g., the node whose value is 9 has two children whose • values are 5 and 2: • both children have value less than the value of their parent. • This same heap can be represented by the following array. 10 7 9 4 7 5 2 2 1 6 Heaps • Since, the value of each internal node is greater than or equal to the values of its children, • which in turn have values greater than or equal to the values of their children, and so on, • the heap property ensures that the value of each internal node is greater than or equal to the values of all the nodes that lie in the subtrees below it. • In particular, the value of the root is greater than or equal to the values of all the other nodes in the heap. Heaps • The heap property can be restored efficiently if the value of a node is modified. • If the value of a node increases to the extent that it becomes greater than the value of its parent, • it suffices to exchange these two values, and then • to continue the same process upwards in the tree if necessary until the heap property is restored. • We say that the modified value has been percolated up to its new position in the heap. (this operation is often called sifting up.) • E.g., of percolating (next slide) Heaps - sifting up (percolate) • E.g., if the value 1 in the figure 5.13 is modified so that it becomes 8, • we can restore the heap property by exchanging the 8 with its parent 4, and • then exchanging it again with its new parent 7, • obtaining the result shown in figure 5.14 Special Types of Trees 4 • Full binary tree = a binary tree in which each node is either a leaf or has degree exactly 2. 1 3 2 14 • Complete binary tree = a binary tree in which all leaves are on the same level and all internal nodes have degree 2. 16 9 8 10 7 12 Full binary tree 4 1 2 3 16 9 Complete binary tree 60 10 Definitions • Height of a node = the number of edges on the longest simple path from the node down to a leaf • Level of a node = the length of a path from the root to the node • Height of tree = height of root node Height of root = 3 4 1 Height of (2)= 1 2 14 3 16 9 8 61 10 Level of (10)= 2 The Heap Data Structure • A heap is a nearly complete binary tree with the following two properties: • Structural property: all levels are full, except possibly the last one, which is filled from left to right • Order (heap) property: for any node x Parent(x) ≥ x 8 7 5 4 2 From the heap property, it follows that: “The root is the maximum element of the heap!” Heap A heap is a binary tree that62is filled in order Array Representation of Heaps • A heap can be stored as an array A. • Root of tree is A[1] • Left child of A[i] = A[2i] • Right child of A[i] = A[2i + 1] • Parent of A[i] = A[ i/2 ] • Heapsize[A] ≤ length[A] • The elements in the subarray A[(n/2+1) .. n] are leaves Adding/Deleting Nodes • New nodes are always inserted at the bottom level • (left to right) • Nodes are removed from the bottom level • (right to left) 64 Heaps • The heap property can be restored efficiently if the value of a node is modified. • If on the contrary the value of a node is decreased so that it becomes less than the value of its children, • it suffices to exchange the modified value with the larger of the values in the children, and then • to continue the same process downwards in the tree if necessary until the heap property is restored. • We say that the modified value has been sifted down to its new position. • E.g., of sifting down(next slide) Maintaining the Heap Property • Suppose a node is smaller than a child • • Left and Right subtrees of i are max-heaps To eliminate the violation: • • • Exchange with larger child Move down the tree Continue until node is not smaller than children 66 Example A[2] A[4] A[2] violates the heap property A[4] violates the heap property A[4] A[9] Heap property restored 67 Maintaining the Heap Property • Assumptions: • Left and Right subtrees of i are max-heaps • A[i] may be smaller than its children MAX-HEAPIFY(A, i, n) 1. l ← LEFT(i) 2. r ← RIGHT(i) 3. if l ≤ n and A[l] > A[i] 4. then largest ←l 5. else largest ←i 6. if r ≤ n and A[r] > A[largest] 7. then largest ←r 8. if largest i 9. then exchange A[i] ↔ A[largest] 10. MAX-HEAPIFY(A, largest, n) 68 MAX-HEAPIFY Running Time • Intuitively: - h 2h O(h) • Running time of MAX-HEAPIFY is O(lg n) • Can be written in terms of the height of the heap, as being O(h) • Since the height of the heap is lg n 69 Heaps - sifting down • E.g., if the value 10 in the root of figure 5.14 is modified to 3, • we can restore the heap property by exchanging the 3 with its larger child 9, and • then exchanging it again with the larger of its new children, namely say 5, • obtaining the result shown in figure 5.15 Heaps - manipulation Heaps - manipulation Heaps - manipulation Heaps • The heap is an ideal data structure for • • • • finding the largest element of a set, removing it, adding a new node, or modifying a node. • These are the exact operations that we need to efficiently implement dynamic lists: • • • • the value of the node gives the priority of the corresponding event, the event with highest priority is always found at the root of the heap, and the root of the heap, and the priority of an event can be changed dynamically at any time. • This is useful in computer simulations and in the design of schedulers for OS. Heaps • To find maximum node from the heap: Heaps • To delete the maximum node from the heap: Heaps • To insert a node into the heap: Heaps • To create a heap from an array T[1..n] of elements in an undefined order. • The obvious solution is to start with an empty heap and to add elements one by one. • It is inefficient Heaps • To create a heap from an array T[1..n]: • It is inefficient • To create a heap from an array T[1..n]: • It constructs a heap in linear time. Heap sort • Williams invented the heap to serve as the underlying data structure for the following sorting algorithm: • It takes a time in O(n log n) to sort n elements. References • Text book: • Fundamentals of Algorithmics by Gilles Brassard & Paul Bratley