Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Positional notation wikipedia , lookup
Big O notation wikipedia , lookup
List of important publications in mathematics wikipedia , lookup
Wiles's proof of Fermat's Last Theorem wikipedia , lookup
Karhunen–Loève theorem wikipedia , lookup
Fundamental theorem of algebra wikipedia , lookup
Principia Mathematica wikipedia , lookup
INFORMATION THEORY and Communication Wayne Lawton Department of Mathematics National University of Singapore S14-04-04, [email protected] http://math.nus.edu.sg/~matwml CHOICE CONCEPT We are confronted by choices every day, for instance we may choose to purchase an apple, a banana, or a coconut; or to withdraw k < N dollars from our bank account. Choice has two aspect: we may need to make the decision: “what fruit to purchase” or “how many dollars to withdraw”. We will ignore this aspect which is the concern of Decision Theory. We may also need to communicate this choice to our food seller or our bank. This is the concern of Information Theory. INCREASING CHOICE? The number of choices increases if more elements are added to the set of existing choices. For example, if a shopper is to choose one fruit from a store that carries Apples, Bananas, and Coconuts, then discovers that the store added Durians and Edelberries. The number of choices increased from 3 to 5 by the addition of 2 extra choices. A, B, C {A, B, C} {D, E} {A, B, C, D, E} The number of choices increases if two or more sets of choices are combined. A shopper may have the choice of choosing one fruit from {A,B,C} on Monday and choosing one fruit from {D,E} on Thursday. Compare with case above. A, B, C { A, B, C}{D, E} {( A, D), ( A, E ), ( B, D), ( B, E ), (C , D), (C , E )} MEASURING CHOICE? For any set X let #X denote the number of elements in X. Then # X Y # X #Y We require that the information, required to specify the choice of a element from a set, be an additive function, therefore H X Y H X H Y Theorem 1. The logarithm functions measure information H X log a # X log to the base a, where a > 0; called bits if a = 2, nats if a = e FACTORIAL FUNCTION For any positive integer n define n factorial n! n (n 1) (n 2) 2 1 0! 1 Often the choices within different sets are mutually constrained. If a shopper is to purchase a different fruit from the set {A,B,C,D,E} on each of 5 days, then there are 5 choices on the first days but only 4 choices on the second day, etc. so the total number of choices equals 5 4 3 2 1 5! 120 STIRLING’S APPROXIMATION Theorem 2. log e n! n 12 log e n n constant Proof [K] pages 111-115 COMBINATORICS n k nk ( s t ) s t k 0 k k n Theorem 2 (Binomial) where n k n called n choose k, is the number of ways of choosing k elements from a set with n elements. n! Furthermore, n choose k equals k!(n k )! Proof. Consider that ( s t ) ( s t ) ( s t ) n is the product of n-factors and it equals the sum of 2 terms, each term is obtained by specifying a choice k nk of s or t from each factor, the number of terms with s t is exactly the number of ways of choosing k factors to have s out of n factors. MULTINOMIAL COEFFICIENTS Theorem 3 The number of sequences with ni 0, i 1,..., M symbols of a given type equals N! n1!n2 ! nM ! where N n1 nM SHANNON’S FORMULA Theorem 4 For large N, the average information per symbol of a string of length N containing M symbols with probabilities p1 ,, pM is iM H p1 ,, pM pi log 2 pi bits i 1 Proof. The law of large numbers says that the i-th symbol will occur approximately ni Npi , i 1,..., M times, so the result follows from Stirling’s Approximation. ENTROPY OF A RANDOM VARIABLE Let X denote a random variable with values in a set A a1,..., a m with probabilities p1,..., pm We define the entropy H(X) by H(X) H p1 ,..., pn i 1 pi log 2 pi bits m Recall that for a large integer N, NH(X) equals the log of the number of strings of length N from A whose relative frequencies of letters are these probabilities Hence the entropy of a random variable equals the average information required to describe the values that it takes, it takes 1000 bits to describe 1000 flips of a fair coin but we can describe the loaded coin sequence HHHHHHTHHHHHHHHHTT by its run lengths 6H1T9H2T JOINT DISTRIBUTION Let X denote a random variable with values in a set A a1,..., a m with probabilities p1,..., pm and Y is a random variable with values in a set B b1 ,..., bn with probabilities q1 ,..., q n Then X Y is a random variable with values in A B (a i , b j ) | i 1,..., m; j 1,..., n whose probabilities rij | i 1,..., m; j 1,..., n satisfy the marginal equations (m+n-1 independent) m r q ij j i 1 n r p ij i j1 MUTUAL INFORMATION The joint entropy of X and Y HX Y i 1 j1 - rij log 2 rij m satisfies n HX Y H(X) H(Y) Equality holds if and only if X and Y are independent, this means that rij p i q j The mutual information of X and Y, defined by I(X, Y) H(X) H(Y) - HX Y satisfies 0 I(X, Y) min H(X) , H(Y) and 0 I(X, X) H(X) CHANNELS AND THEIR CAPACITY A channel is a relationship between the transmitted message X and received message Y Typically, this relationship does not determine Y as a function of X but only determines the statistics of Y given the value of X, this determines the joint distribution of X and Y The channel capacity is defined as C max {I(X, Y)} Example: a binary channel with a 10% bit error rate and prob{X 0} p1 has joint probabilities Max I(X,Y) = .531 bits for p1 .5 0 1 0 .9p1 .1p1 1 .1(1 - p1 ) .9(1 - p1 ) SHANNON’S THEOREM If a channel has capacity C then it is possible to send information over that channel with a rate arbitrarily close to,but never more than, C with a probability of error arbitrarily small. Shannon showed that this was possible to do by Proving that there existed a sequence of codes Whose rates approached C and whose probabilities of error approaced zero His masterpiece, called the Channel Coding Theorem, never actually constructed any specific codes, and thus provided jobs for thousands of for engineers, mathematicians and scientists LANGUAGE AS A CODE During my first visit to Indonesia I ate a curious small fruit. Back in Singapore I went to a store and asked for a small fruit with the skin of a dark brown snake and more bitter than any gourd. Now I ask for Salak – a far more efficient, if less descriptive, code to specify my choice of fruit. When I specify the number of dollars that I want to withdraw from my bank account I use positional notation (in base 10), a code to specify nonnegative integers that was invented in Babylonia (now Iraq) about 4000 years ago (in base 60). Digital Computers, in contrast to analog computers, represent numbers using positional notation in base 2 (shouldn’t they be called binary computers?). Is this because they can’t count futher than 1? These lecture will explore this and other intriguing mysteries. WHAT IS THE BEST BASE? A base B-code of length L (or uses an ordered sequence on symbols from a set of B symbols to represent B x B x … x B (read ‘B to the power L’) choices Physically, this is represented using L devices each of which can exist in one of B states. The cost is L times the cost of each device and the cost of each device is proportional to B since physical material is required to represent each of the B-1 ‘inactive states’ for each of the L devices that correspond to each position. The efficiency of base B is therefore the ratio of information in a base B sequence of length L divided by BL, therefore Eff loge B L BL loge B B IS THE SKY BLUE? If I use base 2 positional notation to specify that I want to withdraw d < 8 dollars from my bank account then 000 d 0 001 d 1 010 d 2 011 d 3 100 d 4 101 d 5 110 d 3 111 d 7 Positional notation is great for computing, but if I decide to withdraw 2 rather than 1 (or 4 rather than 3) dollars I must change my code by 2 (or 3) bits. Consider the gray code: 000 d 0 001 d 1 011 d 2 010 d 3 110 d 4 111 d 5 101 d 6 100 d 7 What’s different? GRAY CODE GEOMETRY How many binary gray codes of length 3 are there? And how can we construct them. Cube geometry gives the answers. 111 110 101 100 010 011 Bits in a code are the Cartesian Coordinates of the vertices. The d-th and (d+1)-th vertex share a common edge. Answer the questions. 000 001 000 d 0 001 d 1 011 d 2 010 d 3 110 d 4 111 d 5 101 d 6 100 d 7 PROBLEMS? 1. Derive Theorem 1. Hint: review properties of logarithms. 2. Write and run a simple computer program to demonstrate Stirling’s Approximation. 3. Derive the formula for n choose k by induction and then try to find another derivation. Then use the other derivation to derive the multinomial formula. 4. Complete the details for the second half of the derivation of Shannon’s Formula for Information. 5. How many binary Gray codes are there of length 3? ERROR CORRECTING CODES How many bits of information can be sent reliably by sending 3 bits if one of those bits may be corrupted? Consider the 3-dimensional binary hypercube. 111 110 H = {binary seq. of length 3} 101 100 H has 8 sequences 011 010 000 A Code C is a subset of H 001 The Hamming Distance d(x,y) between x and y in H is the number of bits that they differ by. Hence d(010,111) = 2. The minimal distance d(C) of a code C is min {d(x,y) | x, y in C} A code C can correct 1 error bit if and only if d(C) 3 So we can send 1 bit reliably with the code C = {(000),(111)} PARITY CODES If we wanted to send 4 bits reliably (to correct up to 1 bit error) then we could send each of these bits three times – this code consist of a set C of 16 sequences having length 12 – the code rate is 50% since 12 bits are used to communicate 4 bits However, it is possible to send 4 bits reliably using only 8 bits Arranging the four bits in a 2 x 2 square and assigning 4 parity bits- one for each row and each column To send a sequence abcd a b c d subscript means mod 2 a b ( a b) 2 c (a c) 2 d (b d ) 2 (c d ) 2 1 1 0 Note: a single bit error in 1 1 a,b,c,d results in odd parity 0 1 1 in its row and column 0 1 1 0 Ref. See rectangular and triangle codes in [H] HAMMING CODES The following [7,4,3] Hamming Code can send 4 bits reliably using only 7 bits, it has d(C) = 3. 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 OTHER CODES Hamming Codes are examples of cyclic group codes – why? BCH (Bose-Chaudhuri-Hocquenghem) codes are another class of cyclic group codes and generated by the coefficient sequences of certain irreducible polynomials over a finite field Reed-Solomon Codes were the first class of BCH codes to be discovered. They were first used by NASA for space communications and are now used as error corrections in CD’s Other codes include: Convolutional, Goethals, Golay, Goppa, Hadamard, Julin, Justesen, Nordstrom-Robinson, Pless double circulant, Preparata, Quadratic Residue, Rao-Reddy, Reed-Muller, t-designs and Steiner systems, Sugiyama, Trellis, Turbo, and Weldon codes. There are many waiting to be discovered and the number of open problems is huge. COUNTING STRINGS Let n and let m1 ,, mn be positive integers T1 T2 Tn be positive real numbers and Let A be an alphabet with m1 m2 Let T1 symbols of time duration T2 m n symbols of time duration Tn CSt, m1 , m2 ,, mn , T1, T2 ,, Tn , t 0 symbols of time duration be the number of strings, made from the letters of A, whose time duration is t MORSE CODE MESSAGES m1 m2 1 A = {dot, dash} = , n2 T1 time duration of CSt, 1,1, T1 , T2 Examples: If T2 time duration of # messages whose duration t T1 1, T2 2 CS1, 1,1, 1,2 1 CS2, 1,1, 1,2 3 ,, CS3, 1,1, 1,2 6 ,, ,,, CS4, 1,1, 1,2 11 ,, , ,,,,, , , PROTEINS n 20 mi 1, i 1,..., n A = {amino acids} = {Glycine, Alanine, Valine, Phenylalanine, Proline, Methionine, Isoleucine, Leucine, Aspartic Acid, Glutamic Acid, Lysine, Arginine, Serine, Threonine, Tyrosine, Histidine, Cystein, Aspargine, Glutatimine, Tryptophan} Ti weight of corresponding Peptide Unit of i-th amino acid arranged from lightest (i = 1) to heaviest (i = 20) R N C H H C H Unit Unit Unit OH Single Chain Protein with Three Units O Unit with Amino Acid Residue R CSt, ...,... # Single Chain Proteins with weight t weight(H 2O) RECURSIVE ALGORITHM CSt, m1 , m2 ,, mn , T1 , T2 ,, Tn 0 , t T1 m1 m1 CSt - T1 , ... , T1 t T2 k m i 1 i m i CSt - Ti , ... , Tk t Tk 1 i m i CSt - Ti , ... , Tn t n m i 1 MATLAB PROGRAM function N = cs(t,m,T) % function N = cs(t,m,T) % Wayne Lawton 19 / 1 / 2003 % Inputs: t = time, % m = array of n positive integers % T = array of n increasing positive numbers % Outputs: N = number of strings composed out of these % m(i) symbols of duration T(i), i = 1,...,n and having duration <= t % k = sum(T <= t); N = 0; if k > 0 for j = 1:k N = N + m(j)*(1+cs(t-T(j),m,T)); end end ASYMPTOTIC GROWTH Theorem 5 For large t CSt, m1 , m2 ,, mn , T1 , T2 ,, Tn cX where X n m i 1 Example i X t is the unique real root of the equation Ti 1 and c is some constant CS t, 1,1 , 1,2 c 1 5 2 t c (1.618 ) t Proof For integer T’s a proof based on linear algebra works and X is the largest eigenvalue of a matrix that represents the recursion or difference equation for CS Othewise the Laplace Transform is required. We discovered a new proof based on Information Theory INFORMATION THEORY PROOF We choose probabilities pi /mi , i 1,..., n for the symbols having time duration to maximize HH n p1 m1 H T where p1 m1 p2 m2 , , , p2 m2 , , , , p i log 2 p i p i log 2 m i Ti pn mn , , pn mn i 1 is the Shannon information (or entropy) per symbol and n T p i Ti is the average duration per symbol i 1 Clearly is the average information per time INFORMATION THEORY PROOF n Since there is the constraint for some Lagrange multiplier p i 1 n pi , p j p j i 1 i 1 j 1,..., n λTj Z( ), j 1,..., n hence p j m je where the denominator, called the partition function, is the sum of the numerators (why). Substituting these probabilities into H T gives Z ( ) 1 hence X e satisfies the root condition in Theorem 5. The proof is complete since the probabilities that maximal information are the ones that occur in the set of all sequences λ MORE PROBLEMS? 6. Compute H(1/2,1/3,1/6) 7. Show that H(X,Y) is maximal when X and Y are independent 8. Read [H] and explain what a triangular parity code is. 9. Compute all Morse Code Sequences of length <= 5 If dots have duration 1 and dashes have duration 2 10. Compute the smallest molecular weight W so that only at least 100 single strand proteins have weight <= W REFERENCES [BT] Carl Brandon and John Tooze, Introduction to Protein Structure, Garland Publishing, Inc., New York, 1991. [BC] James Ward Brown and Ruel V. Churchill, Complex Variables and Applications, McGraw-Hill, New York, 1996. [CS] J. H. Conway and N. J. A. Sloan, Sphere Packings, Lattices and Groups, Springer, New York, 1993. [Ham] R. W. Hamming, Coding and Information Theory, Prentice-Hall, New Jersey, 1980. [H] Sharon Heumann, Coding theory and its application to the study of sphere-packing, Course Notes, October 1998 http://www.mdstud.chalmers.se/~md7sharo/coding/main/main.htm l[K] Donald E. Knuth, The Art of Computer Programming, Volume 1 Fundamental Algorithms, Addison-Wesley, Reading, 1997. [SW] Claude E. Shannon and Warren Weaver, The Mathematical Theory of Communication, Univ. of Illinois Press, Urbana, 1949. MATHEMATICAL APPENDIX i n If M mi and t largest integer t i 1 then f(t) CSt, m1, m2 ,, mn , T1, T2 ,, Tn t T1 so its Laplace Transform satisfies f(t) M F(s) f(t) e dt exists for complex s if s M st 0 The recursion for f implies that P(s) 1 i1 mi e n G1 (s) Me sTn sTi F(s) G(s) P(s) Tn G(s) est f(t)dt G1 (s) 0 s i 1 mi e n sTi Tn Ti 0 st e f(t)dt 1 T1 MATHEMATICAL APPENDIX γ iR This allows F to be defined as a meromorphic function with singularities to the left of a line s Therefore, f is given by a Bromwich integral that can be computed by a contour integral using the method of residues, see page 235 [BC], γ i f(t) 1 2π i P.V. F(s) e ds, t 0 st γ -i f(t) j 1 Res F(s) e s3 s2 st s s j and s j , j 1 are the singularities of F The unique real singularity s1 log e X and for large t s4 0 γ iR γ s1 γ iR f(t) cX thus proving Theorem 5 t