Download Final Exam: 15-853Algorithm in the real and virtual world

Final Exam: 15-853Algorithm in the real and virtual world Name: Qi He Email: [email protected] Problem 1: (Compression: move to front) A) the initial order, possibility model, and compressed data with length of the uncompressed message need to be sent. B) Let the symbols ordered are {A, B, C}, if ABCAB repeats over and over, according to the move-to-front heuristic, the output during the compression will be as follows message Ordered symbols ABC A ABC B BAC C CBA A ACB B BAC A ABC B BAC C CBA A ACB B BAC A ABC B BAC C CBA A ACB B BAC A ABC B BAC C A B CBA ACB BAC output 0 1 2 2 2 1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 Table 1 Covert message of repeated “ABCAB” to numbers for compression According to the table 1, we know that after the first round of the repetition of ABCAB, the numbers generated fall into a pattern: “11222”. Asymptotically, the probability that1 to appear is 0.4 ( = 2/5 ) and the probability of 2 to appear is 0.6 ( = 1 3/5 ). Separately, the self-information: 0.4 log2 (1/0.4), 0.6log2(1/0.6). The bits-persymbol would be used to store the string is: (2/5) log2 (1/0.4) + (3/5) log2 (1/0.6) = 0.97. C) message Ordered output symbols ABCDE A ABCDE 0 B BACDE 1 C CBADE 2 D DCBAE 3 E EDCBA 4 A AEDCB 4 B BAEDC 4 C CBAED 4 D DCBAE 4 E EDCBA 4 A AEDCB 4 B BAEDC 4 C CBAED 4 D DCBAE 4 E EDCBA 4 Table 2. Convert message of repeated “ABCDE” to numbers for compression According to table 2, asymptotically, the probability of the presence of 4 is 1.0. Similar to C), the bits-per-symbol would the move-to-front heuristic use to store the string is: 1* log2 (1/1) = 0 C) Yes, the move-to-front heuristic can adapt well to changes in types of text within a single “document”. The main reason is: a single “document” usually has one concentrated topic and the related words will appears more frequently and the words will be ordered in the heard of the symbol list (move-to-front), which makes the output numbers fall into some pattern with a high probability. Therefore the entropy gets reduced. D) Yes, coding pairs of symbols can help get better compression for long messages – assume we are compressing the English language. The main reason is: in the English language, there many pairs appear TOGETHER with very high possibility, such as “er”, “or”, “th”, etc. Coding pairs can shorten the length of numbers for compression as well as increase self-information of each symbol in the language (due to the 2 property of move-to-front method move the most frequent symbols relatively stay in the front of the symbol list). Fore example, assume we have a message with a pair, “ab” appears in the language very often, e.g. three times, it appears in a message: “abxyabyab”. If we treat “a” and “b” separately: message Ordered symbols abxy a abxy b baxy x xbay y yxba a ayxb b bayx y ybax a aybx b bayx output 0 1 2 3 3 3 2 2 2 Table 3: Treat symbols in pair separately. the length of umbers to be compressed is 9. And the possibilities are more evenly distributed. If we coding “ab” together: message Ordered symbols abxy ab abxy x xaby y yxab ab abyx y yabx ab abyx output 0 1 2 2 2 1 Table 4. Coding symbols in a pair together. Then the length of number is shortened to be 6 and the probabilities are more diversified, which reduces the entropy making the compression more efficient. Problem 2: (Cryptography: McEliece Cryptosystem) A) 3 Public Key: G’ = SkxkGkxnPnxn, where G is a generator of Goppa code, P is a permutation matrix, and S is a nonsingular matrix. Also t is public. Private Key: { Skxk, Gkxn, Pnxn } Encryption: c = mG’ + e, where e is a random n-bit vector of weight t. Decryption: 1. Compute: c’ = cP-1 = mSG + eP-1 Note: P is a permutation matrix, dH( mSG, mSG + eP-1 ) 1  t . 2. Using the decoding algorithm for the Goppa code, find m’ such that dH( m’G, c’ )  t. 3. Finally, compute m = m’ S -1. B) Recall the encryption operation: c = mG’ + e, we can move e to left side and get: c + e = mG’ namely, cP -1 + eP -1 = mSG ……………………………………………( i ) Below is the procedure of a brutal-force-and-choose-plaintext-attack: 1. Choose a plaintext m and a vector with weight of t, compute cipher c by encryption operation: c = mG’ + e. 2. For each possible permutation P’ compute P’ –1, and try codeword = cP -1 + eP –1 until the obtained a codeword is legitimate. Note: at this time the P’ -1 = P –1. obtained. Accordingly, we get SG, since SG = G’P -1. 3. Construct a matrix with I and SG: (I, A), where A = SG. Multiply the matrix with elementary matrixes E1, …, El, transform (I kxk, A) to be form of: ( S’kxk, Ikxk, Hkx(n-k) ) Thus, S = S’ is obtained. According to S, we can get S-1. Based on the pre-processing (brutal-force-choose-plaintext-attack), Mallory can decode messages as the same as what the intended recipients of message will do, which was described in A). Complexity is pretty high: the dominant cost in 2: O(n! x ct), where c is a constant. 1 dH(mSG, eP-1) :Hamming distance between mSG and eP-1. 4 C) Again, recall the encryption operation: c = mG’ + e, we may reduce it to ck = mkGk’ + ek where ck denotes any k components of c, ek denotes the corresponding k component of e, Gk’ is square matrix consisting of columns i1, i2, …, ik of G’ . Thus we get: ck + ek = mkGk’ If the k component of ek are all “0”s, the equation above reduces to ck = mkGk’ Thus, Mallory can recover the sender’s message without decoding (since c’ and G’ are known). If error vector e is an n-bit vector with t “1”s and n – t “0”s, the probability of choosing (without replacement) k “0”s components from e is: n-t n k k P= Given a [1024, 524, 101] Goppa code, and assume to correct up 10 bit-errors as well as use it for encryption. Then the e will have 40 “1”s and 1024 – 40 = 984 “0”s. If there is no error happens, the probability that m could be recovered by using the method described above is much higher than the one when e has 50 “1”s. Therefore this seriously reduces the security strength. D) Advantages: McEliece is two to three orders of magnitude faster than RSA. Disadvantages: 1. The data expansion is large: the ciphertext is twice as long as the plaintext. 2. Public key size is large. 3. Possibly a certain similarity to the Knapsack Cryptosystem because they use the same design principle: decryption is an easy special case of an NP-Complete problem, disguised so that it looks like a general instance of the problem. Problem 3: (Linear Programming: Minimum-cost network flow) A) (Show by contradiction) Let’s assume the edges corresponding to a set of basis variables in the simplex method for the matrix A includes a cycle. 5 1 If there is a cycle, there must be a smallest cycle which has no any vertex appears in the cycle more than once. Thus, we just need to show the case of smallest cycle. 2 As we have known, each column in A has and only has two non-zero numbers, “1” and “–1” (if one is “1”, then the other one must be “–1”). 3 According to the definition of cycle, namely chain, there is row has non-zero number in a row of the sub-matrix, which corresponding to the subset of the basis variables and forming the cycle, then there must be two non-zero numbers, “1” or “–1”. 4 If the edges in the chain are in the same direction, then the two non-zero numbers in each row are 1 and –1. Thus the sum of the columns of the sub-matrix equals to 0, which means the columns that serve as a sub-set of basis variables in the simplex are linearly dependent. This is a contradiction. Therefore, the edges corresponding to a set of basis variables in the simplex method for the matrix A cannot include a cycle. 5 If the edges in the chain are not in the same direction, then there are two “1”s or “-1” in one row. In this case, choose –1 as the coefficient for one column (a variable of basis). Due to the property of a cycle, the algebra sum of the columns of the sub-matrix equals to 0. Again, a contradiction is generated. Therefore, , the edges corresponding to a set of basis variables in the simplex method for the matrix A cannot include a cycle. B) No. If there were two distinct paths between a source (positive b) and sink (negative b) with nonzero flow, there would be a cycle starting from b and back to b would. C) In terms of the graph, one step of the simplex method corresponds to one step of generating a minimum spanning tree: add an edge (with a smaller weight) to the tree and delete one edge (with bigger weight) from the tree – to avoid forming any cycle. Problem 4: (Integer Programming: Graph Coloring) Let G is a graph. V = {vi | vi  G} is the set of vertices in G. E = {eij | if <vi, vj>  G } is the set of edges in G. C = { ci | i = 1, 2, …, n  |V|} is the set of colors. 1 if vi is labeled with jth color. xij = ………………………………….(1) 0 otherwise. 0 i  |V|; 0 j  |C|; Every vertex is labeled with a color: 6 n  xij = 1; j where 0 i  |V| ……………………………………(2) =1 No two neighboring vertices has the same color: xij + xkj  1, if eik E 0 i  |V|, 0 j  |V| …….………………… (3) Thus the “Graph Coloring” problem is formulated as an “Integer Program” as: given a graph G, to find the smallest integer n, which satisfies the constraints (1), (2), (3). Let IP(xij, n) is an algorithm for “Integer Program” solving the problem given by (1), (2), and (3). Then, the algorithm to find minimum set of colors for G can be implemented with IP as follow: Graph-Coloring-Algorithm (G){ n = |V|; build (1), (2), (3) based on G; while (1){ if IP(xij, n) has NO feasible solution, then return xij, 1 i  |V|, 1 j  2n; n = n/2; } } Complexity Assume the complexity of algorithm IP is O(F( |V| )), the complexity of GraphColoring-Algorithm is O(log2F(|V|)). The number of number of equations is O(|V| + |V||E|), among them |V| equations are included in the constraints (2); and |V||E| equations are included in constraints (3). The number of the variables is O(|V|2). Problem 5: (CG Islands) Hidden Markov Model for the CG Islands problem:  = { A, C, G, T }. Q = { A+, A-, C+, C-, G+, G-, T+, T- }. A = ( Pij ) |Q|x|Q| , where Pij specifies the probability of transitioning from state i to state j, namely i,j  Q. e(q, ) is a function Qx  {0, 1}, where qQ and : 7 A+ A- C+ C- G+ G- T+ T1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 A C G T A path  = q1, q2, …, qn is a sequence of states, namely qi  Q. The possibility that a sequence x = x1,…, xn was generated by the path is: n P( x |  ) = ∏ P(xi | qi) x Pqi, qi+1 x e(qi, xi) i=1 Problem: find an optimal path: * ∆ arg max  P( x |  ) = Define sq(i) as the probability of the most probable path for the prefix x1, …, xi that ends with stat q, where qQ 1  i  n -1. Then the recurrence relation is: sq( i + 1 ) = max { e(q, xi+1) x Sk(i) x Pkq } kQ Viterbi Algorithm for CG Islands sq(1) = Pq e(q, x1), where qQ, Pq is the possibility of the existence of q; q (1) = q, where qQ; for i = 1 to n-1 { for qQ { sq( i+1 ) = max { e(q, xi+1) x sk(i) x Pkq }; kQ k* = arg max {e(q, xi+1) x{ sk(i) x Pkq }; kQ q ( i +1 ) = k* ( i ) || q; } } 8 q* = arg max{ sq( n )}; qQ return q*(n); Complexity: The Viterbi algorithm runs in O(n|Q|) time. Since |Q| = 8 is a constant, this algorithm has a linear complexity, O(n). Numerical problem: There is a numerical problem with this algorithm: low-flow, to avoid the problem we can get the algorithm done in logarithm scores: Sk = log sk(i). The recurrence relation is converted to: Sq( i + 1 ) = max { log (e(q, xi+1)) + log (Sk(i)) + log (Pkq) } kQ Problem 6: (A String Alignment Problem) The algorithm proceeds by aligning S[1]…S[I] with T[1]….T[j]. For these prefixes of S and T, define the following variables: 1. U(i, j) is the value of an optimal alignment of S[1]…S[i] and T[1]…T[j]. 2. V(i, j) is the value of an optimal alignment of S[1]…S[i] and T[1]…T[j], whose last pair matches S[i] with T[j]; 3. W(i, j) is the value of an optimal alignment of S[1]…S[i] and T[1]…T[j], whose last pair matches S[i] with a space. 4. X(i, j) is the value of an optimal alignment of S[1]…S[i] and T[1]…T[j]whose last pair matches a space with T[j]. Basis: V(i, 0) = i V(0, j) = j, W(0, j) = - ∞, X(i, 0) = - ∞ , Recurrence: for i > 0 and j > 0, U(i, j) = max {V(i, j), W(i, j), X(i, j)} V(i, j) = max {V(i-1, j-1) + σ ( S[ i ], T[ j ]), 9 W(i-1, j-1) + σ ( S[ i ], T[ j ]) + , X(i-1, j-1) + σ ( S[i], T[j] ) + } W(i, j) = max {V(i-1, j) + σ ( S[ i ], - ) + , W(i-1, j) + σ ( S[ i ], - ) + } 2 X(i, j) = max {V(i, j-1) + σ ( -, T[j] ) + , W(i, j-1) + σ ( -, T[j] ) + } 2 The algorithm based on the recurrence will have two level loop: one is for i (: 0  i  n ); one is for j (0 j  m). Thus the complexity of the algorithm is O(mn). Problem 7: (Link Queries) A) 1. Generate linked-by-doc matrix A = (aij), here aij = 1 if link URL i is pointed by web page j. 2. Computing SVD(A), generate U, , and V. 3. Retain only the first k terms of U, , and V, and therefore reconstruct Ak. The truncated SVD Ak describes the documents in k-d space rather than the n-d space of A. Since k is much smaller than the number of URLs, Ak Ignores minor difference in URL. Similarly, a doc-by-doc matrix is generated and tracked as below: B) 1. Let S is a query. Generate a vector q = (qi), where qi = number of pages in S, which point to URL i. This vector gets transformed into a vector q’ in k-d space, according to the equation q’ = qTUkk The projected query vector q’ gets compared to the document vectors, and documents get ranked by their proximity to the q’. Use the cosine measure as a measure of nearness, and return a document set S’, whose cosine with the query documents exceed some threshold. Besides a  will be charged for a space in the gaps, we also add the penalty (or negative cost)  considering the penalty of a single character much a space may not equal zero. The  could be omitted as in the class and lecture Notes. 2 10 Calculate the frequency of URLs in S and S’, and rank them according the frequency. Return the most frequently pointed URLs. C) O(kn) D) Yes. The procedure is as follow: given S, generate a query vector Q(qi), where qi is the URL of a page in S. Project the query vector into k-d space get q’. Then use the q’ to compare with the link vectors (column of Ak), and get ranked by their proximity to the q’. Use the cosine measure as a measure of nearness, and return a document set T, whose cosine with the projected query vector exceed some threshold, which mean we find a set of documents, T, pointing to the given documents, S. The preprocessing can be used as before. E) 1. Compute T1, a set of documents which are pointed by S or S’ (set of pages similar to S), see also B. 2. Compute T2 ( T1)3, a set of documents which point to S or S’, see also D. 3. Replace the elements aij{0, 1}of the matrix A with nij ( Z), which is the number of the shared terms in page represented by URL i with the document j. Then the T is therefore the set of documents that are similar to a the given set S in terms of a fixed linear combination of outgoing links, incoming links, and words in the document. Because of the replacement of elements of A, you can set this up using a single SVD. 3 This constraint is added to reduce the computing load. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Final Exam: 15-853Algorithm in the real and virtual world