Download Final Exam: 15-853Algorithm in the real and virtual world

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Fast Fourier transform wikipedia , lookup

Multidisciplinary design optimization wikipedia , lookup

False position method wikipedia , lookup

Multiplication algorithm wikipedia , lookup

System of linear equations wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Simulated annealing wikipedia , lookup

Cooley–Tukey FFT algorithm wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Multi-armed bandit wikipedia , lookup

Matrix multiplication wikipedia , lookup

P versus NP problem wikipedia , lookup

Secretary problem wikipedia , lookup

Gaussian elimination wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Travelling salesman problem wikipedia , lookup

Linear programming wikipedia , lookup

Simplex algorithm wikipedia , lookup

Transcript
Final Exam: 15-853Algorithm in the real and virtual world
Name: Qi He
Email: [email protected]
Problem 1: (Compression: move to front)
A) the initial order, possibility model, and compressed data with length of the
uncompressed message need to be sent.
B) Let the symbols ordered are {A, B, C}, if ABCAB repeats over and over, according
to the move-to-front heuristic, the output during the compression will be as follows
message Ordered
symbols
ABC
A
ABC
B
BAC
C
CBA
A
ACB
B
BAC
A
ABC
B
BAC
C
CBA
A
ACB
B
BAC
A
ABC
B
BAC
C
CBA
A
ACB
B
BAC
A
ABC
B
BAC
C
A
B
CBA
ACB
BAC
output
0
1
2
2
2
1
1
2
2
2
1
1
2
2
2
1
1
2
2
2
Table 1 Covert message of repeated “ABCAB” to numbers for compression
According to the table 1, we know that after the first round of the repetition of
ABCAB, the numbers generated fall into a pattern: “11222”. Asymptotically, the
probability that1 to appear is 0.4 ( = 2/5 ) and the probability of 2 to appear is 0.6 ( =
1
3/5 ). Separately, the self-information: 0.4 log2 (1/0.4), 0.6log2(1/0.6). The bits-persymbol would be used to store the string is:
(2/5) log2 (1/0.4) + (3/5) log2 (1/0.6) = 0.97.
C)
message Ordered output
symbols
ABCDE
A
ABCDE
0
B
BACDE
1
C
CBADE
2
D
DCBAE
3
E
EDCBA
4
A
AEDCB
4
B
BAEDC
4
C
CBAED
4
D
DCBAE
4
E
EDCBA
4
A
AEDCB
4
B
BAEDC
4
C
CBAED
4
D
DCBAE
4
E
EDCBA
4
Table 2. Convert message of repeated “ABCDE” to numbers for compression
According to table 2, asymptotically, the probability of the presence of 4 is 1.0. Similar
to C), the bits-per-symbol would the move-to-front heuristic use to store the string is:
1* log2 (1/1) = 0
C) Yes, the move-to-front heuristic can adapt well to changes in types of text within a
single “document”. The main reason is: a single “document” usually has one
concentrated topic and the related words will appears more frequently and the words
will be ordered in the heard of the symbol list (move-to-front), which makes the
output numbers fall into some pattern with a high probability. Therefore the entropy
gets reduced.
D) Yes, coding pairs of symbols can help get better compression for long messages –
assume we are compressing the English language. The main reason is: in the English
language, there many pairs appear TOGETHER with very high possibility, such as
“er”, “or”, “th”, etc. Coding pairs can shorten the length of numbers for compression
as well as increase self-information of each symbol in the language (due to the
2
property of move-to-front method move the most frequent symbols relatively stay in
the front of the symbol list). Fore example, assume we have a message with a pair,
“ab” appears in the language very often, e.g. three times, it appears in a message:
“abxyabyab”. If we treat “a” and “b” separately:
message Ordered
symbols
abxy
a
abxy
b
baxy
x
xbay
y
yxba
a
ayxb
b
bayx
y
ybax
a
aybx
b
bayx
output
0
1
2
3
3
3
2
2
2
Table 3: Treat symbols in pair separately.
the length of umbers to be compressed is 9. And the possibilities are more evenly
distributed. If we coding “ab” together:
message Ordered
symbols
abxy
ab
abxy
x
xaby
y
yxab
ab
abyx
y
yabx
ab
abyx
output
0
1
2
2
2
1
Table 4. Coding symbols in a pair together.
Then the length of number is shortened to be 6 and the probabilities are more
diversified, which reduces the entropy making the compression more efficient.
Problem 2: (Cryptography: McEliece Cryptosystem)
A)
3
Public Key: G’ = SkxkGkxnPnxn, where G is a generator of Goppa code, P is a
permutation matrix, and S is a nonsingular matrix. Also t is public.
Private Key: { Skxk, Gkxn, Pnxn }
Encryption: c = mG’ + e, where e is a random n-bit vector of weight t.
Decryption:
1. Compute: c’ = cP-1 = mSG + eP-1
Note: P is a permutation matrix, dH( mSG, mSG + eP-1 ) 1  t .
2. Using the decoding algorithm for the Goppa code, find m’ such that dH( m’G, c’ )
 t.
3. Finally, compute m = m’ S -1.
B)
Recall the encryption operation: c = mG’ + e, we can move e to left side and get:
c + e = mG’
namely,
cP -1 + eP -1 = mSG
……………………………………………( i )
Below is the procedure of a brutal-force-and-choose-plaintext-attack:
1. Choose a plaintext m and a vector with weight of t, compute cipher c by
encryption operation: c = mG’ + e.
2. For each possible permutation P’ compute P’ –1, and try codeword = cP -1 + eP –1
until the obtained a codeword is legitimate. Note: at this time the P’ -1 = P –1.
obtained. Accordingly, we get SG, since SG = G’P -1.
3. Construct a matrix with I and SG: (I, A), where A = SG. Multiply the matrix with
elementary matrixes E1, …, El, transform (I kxk, A) to be form of:
( S’kxk, Ikxk, Hkx(n-k) )
Thus, S = S’ is obtained. According to S, we can get S-1.
Based on the pre-processing (brutal-force-choose-plaintext-attack), Mallory can
decode messages as the same as what the intended recipients of message will do,
which was described in A).
Complexity is pretty high: the dominant cost in 2: O(n! x ct), where c is a constant.
1
dH(mSG, eP-1) :Hamming distance between mSG and eP-1.
4
C)
Again, recall the encryption operation: c = mG’ + e, we may reduce it to
ck = mkGk’ + ek
where ck denotes any k components of c, ek denotes the corresponding k component of e,
Gk’ is square matrix consisting of columns i1, i2, …, ik of G’ . Thus we get:
ck + ek = mkGk’
If the k component of ek are all “0”s, the equation above reduces to
ck = mkGk’
Thus, Mallory can recover the sender’s message without decoding (since c’ and G’ are
known).
If error vector e is an n-bit vector with t “1”s and n – t “0”s, the probability of choosing
(without replacement) k “0”s components from e is:
n-t
n
k
k
P=
Given a [1024, 524, 101] Goppa code, and assume to correct up 10 bit-errors as well as
use it for encryption. Then the e will have 40 “1”s and 1024 – 40 = 984 “0”s. If there is
no error happens, the probability that m could be recovered by using the method
described above is much higher than the one when e has 50 “1”s. Therefore this seriously
reduces the security strength.
D)
Advantages: McEliece is two to three orders of magnitude faster than RSA.
Disadvantages:
1. The data expansion is large: the ciphertext is twice as long as the plaintext.
2. Public key size is large.
3. Possibly a certain similarity to the Knapsack Cryptosystem because they use the
same design principle: decryption is an easy special case of an NP-Complete
problem, disguised so that it looks like a general instance of the problem.
Problem 3: (Linear Programming: Minimum-cost network flow)
A) (Show by contradiction) Let’s assume the edges corresponding to a set of basis
variables in the simplex method for the matrix A includes a cycle.
5
1
If there is a cycle, there must be a smallest cycle which has no any vertex appears
in the cycle more than once. Thus, we just need to show the case of smallest
cycle.
2
As we have known, each column in A has and only has two non-zero numbers,
“1” and “–1” (if one is “1”, then the other one must be “–1”).
3
According to the definition of cycle, namely chain, there is row has non-zero
number in a row of the sub-matrix, which corresponding to the subset of the basis
variables and forming the cycle, then there must be two non-zero numbers, “1” or
“–1”.
4
If the edges in the chain are in the same direction, then the two non-zero numbers
in each row are 1 and –1. Thus the sum of the columns of the sub-matrix equals to
0, which means the columns that serve as a sub-set of basis variables in the
simplex are linearly dependent. This is a contradiction. Therefore, the edges
corresponding to a set of basis variables in the simplex method for the matrix A
cannot include a cycle.
5
If the edges in the chain are not in the same direction, then there are two “1”s or
“-1” in one row. In this case, choose –1 as the coefficient for one column (a
variable of basis). Due to the property of a cycle, the algebra sum of the columns
of the sub-matrix equals to 0. Again, a contradiction is generated. Therefore, , the
edges corresponding to a set of basis variables in the simplex method for the
matrix A cannot include a cycle.
B) No. If there were two distinct paths between a source (positive b) and sink (negative
b) with nonzero flow, there would be a cycle starting from b and back to b would.
C) In terms of the graph, one step of the simplex method corresponds to one step of
generating a minimum spanning tree: add an edge (with a smaller weight) to the tree
and delete one edge (with bigger weight) from the tree – to avoid forming any cycle.
Problem 4: (Integer Programming: Graph Coloring)
Let G is a graph. V = {vi | vi  G} is the set of vertices in G. E = {eij | if <vi, vj>  G } is
the set of edges in G. C = { ci | i = 1, 2, …, n  |V|} is the set of colors.
1 if vi is labeled with jth color.
xij =
………………………………….(1)
0 otherwise.
0 i  |V|; 0 j  |C|;
Every vertex is labeled with a color:
6
n
 xij = 1;
j
where 0 i  |V|
……………………………………(2)
=1
No two neighboring vertices has the same color:
xij + xkj  1, if eik E
0 i  |V|, 0 j  |V| …….………………… (3)
Thus the “Graph Coloring” problem is formulated as an “Integer Program” as: given a
graph G, to find the smallest integer n, which satisfies the constraints (1), (2), (3).
Let IP(xij, n) is an algorithm for “Integer Program” solving the problem given by (1), (2),
and (3). Then, the algorithm to find minimum set of colors for G can be implemented
with IP as follow:
Graph-Coloring-Algorithm (G){
n = |V|;
build (1), (2), (3) based on G;
while (1){
if IP(xij, n) has NO feasible solution, then return xij, 1 i  |V|, 1 j  2n;
n = n/2;
}
}
Complexity
Assume the complexity of algorithm IP is O(F( |V| )), the complexity of GraphColoring-Algorithm is O(log2F(|V|)).
The number of number of equations is O(|V| + |V||E|), among them |V| equations are
included in the constraints (2); and |V||E| equations are included in constraints (3). The
number of the variables is O(|V|2).
Problem 5: (CG Islands)
Hidden Markov Model for the CG Islands problem:
 = { A, C, G, T }.
Q = { A+, A-, C+, C-, G+, G-, T+, T- }.
A = ( Pij ) |Q|x|Q| , where Pij specifies the probability of transitioning from state i to state j,
namely i,j  Q.
e(q, ) is a function Qx  {0, 1}, where qQ and :
7
A+ A- C+ C- G+ G- T+ T1 1 0 0 0 0 0 0
0 0 1 1 0 0 0 0
0 0 0 0 1 1 0 0
0 0 0 0 0 0 1 1
A
C
G
T
A path  = q1, q2, …, qn is a sequence of states, namely qi  Q.
The possibility that a sequence x = x1,…, xn was generated by the path is:
n
P( x |  ) = ∏ P(xi | qi) x Pqi, qi+1 x e(qi, xi)
i=1
Problem: find an optimal path:
* ∆ arg max  P( x |  )
=
Define sq(i) as the probability of the most probable path for the prefix x1, …, xi that ends
with stat q, where qQ 1  i  n -1. Then the recurrence relation is:
sq( i + 1 ) = max { e(q, xi+1) x Sk(i) x Pkq }
kQ
Viterbi Algorithm for CG Islands
sq(1) = Pq e(q, x1), where qQ, Pq is the possibility of the existence of q;
q (1) = q, where qQ;
for i = 1 to n-1 {
for qQ {
sq( i+1 ) = max { e(q, xi+1) x sk(i) x Pkq };
kQ
k* = arg max {e(q, xi+1) x{ sk(i) x Pkq };
kQ
q ( i +1 ) = k* ( i ) || q;
}
}
8
q* = arg max{ sq( n )};
qQ
return
q*(n);
Complexity:
The Viterbi algorithm runs in O(n|Q|) time. Since |Q| = 8 is a constant, this algorithm has
a linear complexity, O(n).
Numerical problem:
There is a numerical problem with this algorithm: low-flow, to avoid the problem we can
get the algorithm done in logarithm scores: Sk = log sk(i). The recurrence relation is
converted to:
Sq( i + 1 ) = max { log (e(q, xi+1)) + log (Sk(i)) + log (Pkq) }
kQ
Problem 6: (A String Alignment Problem)
The algorithm proceeds by aligning S[1]…S[I] with T[1]….T[j]. For these prefixes of S
and T, define the following variables:
1. U(i, j) is the value of an optimal alignment of S[1]…S[i] and T[1]…T[j].
2. V(i, j) is the value of an optimal alignment of S[1]…S[i] and T[1]…T[j], whose
last pair matches S[i] with T[j];
3. W(i, j) is the value of an optimal alignment of S[1]…S[i] and T[1]…T[j], whose
last pair matches S[i] with a space.
4. X(i, j) is the value of an optimal alignment of S[1]…S[i] and T[1]…T[j]whose last
pair matches a space with T[j].
Basis:
V(i, 0) = i
V(0, j) = j,
W(0, j) = - ∞,
X(i, 0) = - ∞ ,
Recurrence: for i > 0 and j > 0,
U(i, j) = max {V(i, j), W(i, j), X(i, j)}
V(i, j) = max {V(i-1, j-1) + σ ( S[ i ], T[ j ]),
9
W(i-1, j-1) + σ ( S[ i ], T[ j ]) + ,
X(i-1, j-1) + σ ( S[i], T[j] ) + }
W(i, j) = max {V(i-1, j) + σ ( S[ i ], - ) + ,
W(i-1, j) + σ ( S[ i ], - ) + } 2
X(i, j) = max {V(i, j-1) + σ ( -, T[j] ) + ,
W(i, j-1) + σ ( -, T[j] ) + } 2
The algorithm based on the recurrence will have two level loop: one is for i (: 0  i  n );
one is for j (0 j  m). Thus the complexity of the algorithm is O(mn).
Problem 7: (Link Queries)
A)
1. Generate linked-by-doc matrix A = (aij), here aij = 1 if link URL i is pointed by
web page j.
2. Computing SVD(A), generate U, , and V.
3. Retain only the first k terms of U, , and V, and therefore reconstruct Ak.
The truncated SVD Ak describes the documents in k-d space rather than the n-d space
of A. Since k is much smaller than the number of URLs, Ak
Ignores minor difference in URL. Similarly, a doc-by-doc matrix is generated and
tracked as below:
B)
1. Let S is a query. Generate a vector q = (qi), where qi = number of pages in S,
which point to URL i. This vector gets transformed into a vector q’ in k-d space,
according to the equation
q’ = qTUkk
The projected query vector q’ gets compared to the document vectors, and
documents get ranked by their proximity to the q’. Use the cosine measure as a
measure of nearness, and return a document set S’, whose cosine with the query
documents exceed some threshold.
Besides a  will be charged for a space in the gaps, we also add the penalty (or negative
cost)  considering the penalty of a single character much a space may not equal zero.
The  could be omitted as in the class and lecture Notes.
2
10
Calculate the frequency of URLs in S and S’, and rank them according the
frequency. Return the most frequently pointed URLs.
C) O(kn)
D) Yes. The procedure is as follow: given S, generate a query vector Q(qi), where qi is
the URL of a page in S. Project the query vector into k-d space get q’. Then use the q’ to
compare with the link vectors (column of Ak), and get ranked by their proximity to the q’.
Use the cosine measure as a measure of nearness, and return a document set T, whose
cosine with the projected query vector exceed some threshold, which mean we find a set
of documents, T, pointing to the given documents, S. The preprocessing can be used as
before.
E)
1. Compute T1, a set of documents which are pointed by S or S’ (set of pages similar to
S), see also B.
2. Compute T2 ( T1)3, a set of documents which point to S or S’, see also D.
3. Replace the elements aij{0, 1}of the matrix A with nij ( Z), which is the number of
the shared terms in page represented by URL i with the document j. Then the T is
therefore the set of documents that are similar to a the given set S in terms of a fixed
linear combination of outgoing links, incoming links, and words in the document.
Because of the replacement of elements of A, you can set this up using a single SVD.
3
This constraint is added to reduce the computing load.
11