Download Lecture 14

CS503: Fourteenth Lecture, Fall 2008 Amortized Analysis, Sets Michael Barnathan A Preliminary Note • The withdrawal deadline is Nov. 4. • THAT IS ONE WEEK FROM TODAY. • It’s also election day, so if you wish to both vote and withdraw from a course, you may wish to plan ahead. Here’s what we’ll be learning: • Theory: – Amortized Analysis – More complex recurrences. • Data Structures: – Disjoint sets and unions (very quick overview). • Java: – Sets and Multisets. Traditional Asymptotic Analysis • Looks at the behavior of one operation. – One insertion takes O(n) time… – One search takes O(log n) time… – One, one, one. • If every operation takes the same amount of time, this is perfectly fine. We can figure out the cost of the sequence. – What is the total complexity of n operations, each taking time proportional to O(n)? • However, this is not always the case. – What about Vectors, which increase in size when they are filled? – Each insertion at the end takes O(1) time, until the array is full, upon which the next insertion takes O(n) time. Vector Doubling in Traditional Analysis • Suppose we perform n insertions on a vector that employs the doubling strategy. • In traditional analysis, every operation has the same cost. So what is the worst case cost of insertion into an array? – O(n), because in the worst case, we double. – This is despite the fact that most insertions take O(1) time, because the majority do not double. • We perform n insertions, each taking O(n) time. • What is the bound upon the complexity? – n * O(n) = O(n2). • This is clearly not a tight bound. Amortization • Amortized analysis analyzes worst-case performance over a sequence of operations. • It is not an average case analysis; it is an “average of worst cases”. • Going back to Vector insertion: – If we perform 6 insertions into a Vector of size 5, 5 of those insertions will take 1 unit of time. The sixth will take 6 units. (Both in the worst case) – Since all 6 insertions will take 11 units of time, one insertion contributes roughly 2 time units, not on average, but in the worst case. – 2 is a constant. We would expect constant time behavior on insert. – An individual insertion may take longer (the sixth insertion takes 6 units of time, for example), but it will make up for it by preparing subsequent insertions to run quickly (by doubling the array). Methods of Amortization • There are three commonly employed amortized analysis techniques. • From least to most formal: – The Aggregate Method: • Count the number of time units across a sequence of operations and divide by the number of operations. – The Accounting Method: • Each operation “deposits” time, which is then used to “pay for” expensive operations. – The Potential Method: • A “potential function” φ is defined based on the change in state brought about by each operation and the difference in potential is added to the total cost (this difference may be negative). • We won’t go into too much detail on this method. • Each method has its limitations. The Aggregate Method • This is the simplest method of analysis. • Simply add the worst-case cost of each operation in a sequence up, then divide by the total number of operations in the sequence:  AmortizedCost  n i 1 Cost (i ) n • The cost of each operation is very often defined asymptotically, not as a number. • But that’s OK; O(n) means “a linear function of n”. • So O(n) / n = O(1), O(n2) / n = O(n), and so forth. The Aggregate Method – Example: Data Cost 1 2 3 O(1) O(1) O(1) So far, the amortized cost is [n * O(1)] / n = O(1). Data Cost 1 2 3 4 5 6 7 O(1) O(1) O(1) O(n) O(1) O(1) O(1) The fourth insertion doubles the array, an O(n) operation. Now the amortized cost is [(n-1) * O(1) + O(n)] / n = O(1) + O(1) = O(1). Caveats • The lack of formalism in the aggregate method has some consequences. • Specifically, when using the aggregate method, be careful with your asymptotics! • O(n) at the 4th insertion is very different from O(n) at the 32768th insertion! • It is thus sometimes useful to define the elementary cost of inserting without doubling as simply “1” and to use exact numbers. Again, with numbers. Data Cost 1 1 2 1 3 1 So far, the amortized cost is [3 * 1] / 3 = 1. Data Cost 1 1 2 1 3 1 4 4 5 1 6 1 The fourth insertion doubles the array, an O(n) operation. Now the amortized cost is [6 * 1 + 4] / 7 = 10 / 7 = 1.43. 7 1 How it Converges • It turns out that this is always constant-time, no matter how many insertions you do: • For a sequence of n operations, the algorithm will double at n, n/2, n/4, n/8, … • The number of elements in the array at each double is the cost of that doubling step (because it’s O(n)). • So the cost is defined by a convergent series: n n n   n   n     ... Cost of insertion 2 4 8  1 1 1    Cost   1  1     ...  3 n  2 4 8  Cost of doubling • So, at worst, it will take you thrice as long to use a doubling array as a preallocated one. • 3 * O(1) = O(1); this is still constant-time. The Accounting Method • The accounting method begins by assigning each elementary operation a cost of $1. – The cost your analysis returns, of course, is then in terms of how long those operations take. • Each operation will then pay for: – The actual cost of the operation, and – The future cost of keeping that element maintained (for example, copying it in the array). • We save in advance so we have something to “spend” when we double. • We call the saved money “the bank”. – The bank balance never goes negative; there are no subprime loans. • This is sort of difficult because it requires us looking ahead to see what happens when we double. What’s the cost? • Each element we insert costs $1 immediately. • When doubling: – We will have to move each element to a new array; each move costs $1. – We will have to create a new element for each existing element (because we’re doubling the size). Something is eventually going to fill this spot as well. This will cost $1. • So the total cost is $3 per insertion. – $1 for now, $2 for the future. • This is the same answer we received using the aggregate method. • But requires careful inspection to arrive at. Does it work? • Remember, the bank must never go negative. • Doubling costs $n+1: $n to copy the n elements, $1 for the insertion that immediately follows. • Each insertion pays $3 and costs $1, so $2 goes into the bank at each non-doubling step. • And each doubling costs $n+1. Yes, it does. i 1 2 3 4 5 6 size 1 2 4 4 8 8 Deposit $3 $3 $3 $3 $3 $3 Cost $1 $2 $3 $1 $5 $1 Profit $2 $1 $0 $2 -$2 $2 Bank $2 $3 $3 $5 $3 $5 Red fields represent insertions that cause the array to double. Potential Method • Instead of “saving” and paying the cost later, the potential method measures the “potential difference” between two adjacent operations. • This is defined by a potential function φ. – Φ(0) = 0. – Φ(i) ≥ 0 for all i. • The amortized cost of operation i is determined by the actual cost plus the difference in potential: – aci = ci + [φ(i) - φ(i-1)] • The total cost is the sum of these individual costs:  c  [ (i)   (i 1)]   c   [ (n)   (0)] n i 1 n i i 1 i Potential Method Example • For the array doubling problem,  (i)  2i  2 lgi  • That is, the difference between the amount that the array size would double to and the least power of 2 greater than the current size of the array. • If i-1 is a power of 2, c = i + 2 - (i - 1) = 1 + 2 = 3. • If i-1 is not a power of 2, ceil(lg(i)) = ceil(lg(i-1)), the potential terms cancel, and c = 1 + 2 - 0 = 3. • So we get the same answer as in the other methods. Sets • A set is a data structure that can be used to store records in which the key is the same as the value, keeping all elements in the set unique. • There are two types in Java: – TreeSets: Sorted Unique Containers. – HashSets: Unsorted Unique Containers. • A multiset, or bag, is like a set, but without the uniqueness constraint. Special Properties • Elements in a set class are guaranteed unique. • Attempting to insert an element that already exists will not modify the set at all and will cause add() to return false. • Sets can be split and merged. – You can get the entire set of elements greater than or less than a target, for example. – Or you can merge two disjoint sets together. • This is called a union operation. • TreeSets are implemented using Binary Search Trees in Java, providing O(log n) insertion, access, and update and guaranteeing sorted order (remember to implement Comparable in your classes). • HashSets are implemented using hash tables, providing average-case O(1) insertion, access, and deletion, but not guaranteeing sorted order. • They are thus appropriate data structures to use for operations such as picking out the unique words in a book and outputting them in sorted order. Methods • Has some of the usual ones: add(), remove(), size(), addAll()… • But also some exotic ones that return elements or subsets greater than or less than an element: – – – – – – higher(Object): Returns the first element > Object. lower(Object): Returns the first element < Object. floor(Object): Returns the first element ≤ Object. ceiling(Object): Returns the first element ≥ Object. headSet(Object): Returns the whole set < Object. tailSet(Object): Returns the whole set > Object. Special Set Operations • Sets are usually represented as trees. • The union algorithm merges two sets by attaching the smaller set/tree to the larger tree. This is determined by the set’s rank. • The rank of the set, also defined as the Horton-Strahler number of its tree, is as follows: – A set with one element has a rank of 0. – The result of a union between two sets of the same rank is a set of rank r+1. • The optimal find algorithm utilizes a strategy called path compression, which traverses up the tree and makes each node it encounters a child of the root. The amortized running time of this approach is O(a(n)), where a(n) is the inverse Ackermann function. This functions grows extraordinarily slowly, so it effectively runs in amortized constant-time. Recurrences, part 2. • If we have a recurrence of the form T(n) = a*T(n/b) + f(n), the solution can be found using the Master Method. • However, what if we have something like T(n) = T(n/3) + T(n/4) + O(1)? • Then we need to use a different method. Solving Complex Recurrences • Isolate the recursive terms: – T(n) = T(n/3) + T(n/4) + O(1). – This part of the recurrence is called a homogenous recurrence. • Guess a general form solution for this part of the recurrence. – Generally, these recurrences will be polynomial, so guess c*nâ. • Plug your guess into the homogenous recurrence: – c*nâ = c*(n/3)â + c*(n/4)â. • Solve for a (or at least get a bound on a): – – – – – • • c*nâ = c*nâ/3â + c*nâ/4â. 1 = 1/3â + 1/4â. Does a = 1 work? 1 > 1/3 + 1/4, so it’s too high. a = .5 is too small, but close. a = .56 works. So the solution is O(n.56). If the solution were of the same order as the driving function, we would still need to multiply by log n. Performance on a Sequence • We covered amortized analysis and sets today, plus a bit on recurrences. • Next time, we will discuss graphs – the root data structure from which most others derive. • We will also have a somewhat theoretical assignment on analyzing the performance of hashes next time. • The lesson: – Plan for the future. Plan your current actions to make your future efforts easier.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 14