Download SP2171_IT_Communications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Positional notation wikipedia , lookup

Big O notation wikipedia , lookup

List of important publications in mathematics wikipedia , lookup

Wiles's proof of Fermat's Last Theorem wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Fundamental theorem of algebra wikipedia , lookup

Principia Mathematica wikipedia , lookup

Inductive probability wikipedia , lookup

Theorem wikipedia , lookup

Proofs of Fermat's little theorem wikipedia , lookup

Transcript
INFORMATION THEORY
and Communication
Wayne Lawton
Department of Mathematics
National University of Singapore
S14-04-04, [email protected]
http://math.nus.edu.sg/~matwml
CHOICE CONCEPT
We are confronted by choices every day, for instance we
may choose to purchase an apple, a banana, or a coconut;
or to withdraw k < N dollars from our bank account.
Choice has two aspect: we may need to make the decision:
“what fruit to purchase” or “how many dollars to withdraw”. We
will ignore this aspect which is the concern of Decision Theory.
We may also need to communicate this choice to our food
seller or our bank. This is the concern of Information Theory.
INCREASING CHOICE?
The number of choices increases if more elements are added
to the set of existing choices. For example, if a shopper is to
choose one fruit from a store that carries Apples, Bananas,
and Coconuts, then discovers that the store added Durians
and Edelberries. The number of choices increased from 3 to
5 by the addition of 2 extra choices.
A, B, C  {A, B, C}  {D, E}  {A, B, C, D, E}
The number of choices increases if two or more sets of
choices are combined. A shopper may have the choice
of choosing one fruit from {A,B,C} on Monday and choosing
one fruit from {D,E} on Thursday. Compare with case above.
A, B, C  { A, B, C}{D, E}
 {( A, D), ( A, E ), ( B, D), ( B, E ), (C , D), (C , E )}
MEASURING CHOICE?
For any set X let #X denote the number of elements in X. Then
#  X  Y   # X  #Y 
We require that the information, required to specify the choice
of a element from a set, be an additive function, therefore
H  X  Y   H  X   H Y 
Theorem 1. The logarithm functions measure information
H  X   log a #  X 
log to the base a, where a > 0; called bits if a = 2, nats if a = e
FACTORIAL FUNCTION
For any positive integer n define n factorial
n! n  (n  1)  (n  2)   2 1
0! 1
Often the choices within different sets are mutually constrained.
If a shopper is to purchase a different fruit from the set
{A,B,C,D,E} on each of 5 days, then there are 5 choices on the
first days but only 4 choices on the second day, etc. so the total
number of choices equals
5  4  3 2 1  5! 120
STIRLING’S APPROXIMATION
Theorem 2.
log e n! n  12 log e n  n  

constant
Proof [K] pages 111-115
COMBINATORICS
 n  k nk
( s  t )     s t
k 0  k 
k n
Theorem 2 (Binomial)
where
n
 
k 
n
called n choose k, is the number of ways of
choosing k elements from a set with n elements.
n!
Furthermore, n choose k equals
k!(n  k )!
Proof. Consider that ( s  t )  ( s  t )   ( s  t )
n
is the product of n-factors and it equals the sum of 2
terms, each term is obtained by specifying a choice
k nk
of s or t from each factor, the number of terms with s t
is exactly the number of ways of choosing k factors
to have s out of n factors.
MULTINOMIAL COEFFICIENTS
Theorem 3 The number of sequences with
ni  0, i  1,..., M
symbols of a given type equals
N!
n1!n2 ! nM !
where
N  n1   nM
SHANNON’S FORMULA
Theorem 4 For large N, the average information per symbol
of a string of length N containing M symbols with probabilities
p1 ,, pM
is
iM
H  p1 ,, pM     pi log 2 pi
bits
i 1
Proof. The law of large numbers says that the i-th symbol will
occur approximately
ni  Npi , i  1,..., M
times, so the result follows from Stirling’s Approximation.
ENTROPY OF A RANDOM VARIABLE
Let X denote a random variable with values in a set
A  a1,..., a m  with probabilities p1,..., pm
We define the entropy H(X) by
H(X)  H p1 ,..., pn   i 1 pi log 2 pi bits
m
Recall that for a large integer N, NH(X) equals the
log of the number of strings of length N from A whose
relative frequencies of letters are these probabilities
Hence the entropy of a random variable equals the
average information required to describe the
values that it takes, it takes 1000 bits to describe
1000 flips of a fair coin but we can describe the
loaded coin sequence HHHHHHTHHHHHHHHHTT
by its run lengths 6H1T9H2T
JOINT DISTRIBUTION
Let X denote a random variable with values in a set
A  a1,..., a m  with probabilities p1,..., pm
and Y is a random variable with values in a set
B  b1 ,..., bn  with probabilities q1 ,..., q n
Then
X  Y is a random variable with values in
A  B  (a i , b j ) | i  1,..., m; j  1,..., n
whose probabilities  rij | i  1,..., m; j  1,..., n
satisfy the marginal equations (m+n-1 independent)

m
r

q
ij
j
i 1

n
r

p
ij
i
j1
MUTUAL INFORMATION
The joint entropy of X and Y
HX  Y    i 1  j1 - rij log 2 rij
m
satisfies
n
HX  Y  H(X)  H(Y)
Equality holds if and only if X and Y are
independent, this means that
rij  p i q j
The mutual information of X and Y, defined by
I(X, Y)  H(X)  H(Y) - HX  Y
satisfies 0  I(X, Y)  min H(X) , H(Y)
and
0  I(X, X)  H(X)
CHANNELS AND THEIR CAPACITY
A channel is a relationship between the transmitted
message X and received message Y
Typically, this relationship does not determine Y
as a function of X but only determines the
statistics of Y given the value of X, this
determines the joint distribution of X and Y
The channel capacity is defined as C  max {I(X, Y)}
Example: a binary channel with a 10% bit error rate
and prob{X  0}  p1 has joint probabilities
Max I(X,Y) = .531 bits
for
p1  .5
0
1
0
.9p1
.1p1
1 .1(1 - p1 ) .9(1 - p1 )
SHANNON’S THEOREM
If a channel has capacity C then it is possible to
send information over that channel with a rate
arbitrarily close to,but never more than, C with
a probability of error arbitrarily small.
Shannon showed that this was possible to do by
Proving that there existed a sequence of codes
Whose rates approached C and whose probabilities
of error approaced zero
His masterpiece, called the Channel Coding
Theorem, never actually constructed any specific
codes, and thus provided jobs for thousands of
for engineers, mathematicians and scientists
LANGUAGE AS A CODE
During my first visit to Indonesia I ate a curious small fruit.
Back in Singapore I went to a store and asked for a small fruit
with the skin of a dark brown snake and more bitter than any
gourd. Now I ask for Salak – a far more efficient, if less
descriptive, code to specify my choice of fruit.
When I specify the number of dollars that I want to withdraw
from my bank account I use positional notation (in base 10),
a code to specify nonnegative integers that was invented in
Babylonia (now Iraq) about 4000 years ago (in base 60).
Digital Computers, in contrast to analog computers,
represent numbers using positional notation in base 2
(shouldn’t they be called binary computers?). Is this
because they can’t count futher than 1? These lecture
will explore this and other intriguing mysteries.
WHAT IS THE BEST BASE?
A base B-code of length L (or uses an ordered sequence
on symbols from a set of B symbols to represent
B x B x … x B (read ‘B to the power L’) choices
Physically, this is represented using L devices each of
which can exist in one of B states.
The cost is L times the cost of each device and the cost of
each device is proportional to B since physical material is
required to represent each of the B-1 ‘inactive states’ for
each of the L devices that correspond to each position.
The efficiency of base B is therefore the ratio of information
in a base B sequence of length L divided by BL, therefore
Eff 
loge B L
BL

loge B
B
IS THE SKY BLUE?
If I use base 2 positional notation to specify that I want to
withdraw d < 8 dollars from my bank account then
000  d  0 001  d  1 010  d  2
011  d  3 100  d  4 101  d  5
110  d  3 111  d  7
Positional notation is great for computing, but if I decide to
withdraw 2 rather than 1 (or 4 rather than 3) dollars I must
change my code by 2 (or 3) bits. Consider the gray code:
000  d  0 001  d  1 011  d  2
010  d  3 110  d  4 111  d  5
101  d  6 100  d  7 What’s different?
GRAY CODE GEOMETRY
How many binary gray codes of length 3 are there? And how
can we construct them. Cube geometry gives the answers.
111
110
101
100
010
011
Bits in a code are the
Cartesian Coordinates
of the vertices. The d-th
and (d+1)-th vertex
share a common edge.
Answer the questions.
000
001
000  d  0 001  d  1 011  d  2
010  d  3 110  d  4 111  d  5
101  d  6 100  d  7
PROBLEMS?
1. Derive Theorem 1. Hint: review properties of logarithms.
2. Write and run a simple computer program to
demonstrate Stirling’s Approximation.
3. Derive the formula for n choose k by induction and then
try to find another derivation. Then use the other derivation
to derive the multinomial formula.
4. Complete the details for the second half of the derivation
of Shannon’s Formula for Information.
5. How many binary Gray codes are there of length 3?
ERROR CORRECTING CODES
How many bits of information can be sent reliably by sending 3 bits if one of
those bits may be corrupted? Consider the 3-dimensional binary hypercube.
111
110
H = {binary seq. of length 3}
101
100
H has 8 sequences
011
010
000
A Code C is a subset of H
001
The Hamming Distance d(x,y) between x and y in H is the
number of bits that they differ by. Hence d(010,111) = 2.
The minimal distance d(C) of a code C is min {d(x,y) | x, y in C}
A code C can correct 1 error bit if and only if d(C)  3
So we can send 1 bit reliably with the code C = {(000),(111)}
PARITY CODES
If we wanted to send 4 bits reliably (to correct up to 1 bit error)
then we could send each of these bits three times – this code
consist of a set C of 16 sequences having length 12 – the code
rate is 50% since 12 bits are used to communicate 4 bits
However, it is possible to send 4 bits reliably using only 8 bits
Arranging the four bits in a 2 x 2 square and assigning 4
parity bits- one for each row and each column
To send a sequence abcd
a b
c d

subscript means mod 2
a
b
( a  b) 2
c
(a  c) 2
d
(b  d ) 2
(c  d ) 2
1 1 0
Note: a single bit error in
1 1
a,b,c,d results in odd parity
 0 1 1
in its row and column
0 1
1 0
Ref. See rectangular and triangle codes in [H]
HAMMING CODES
The following [7,4,3] Hamming Code can send 4 bits reliably
using only 7 bits, it has d(C) = 3.
1 1 0 1 0 0 0
0 1 1 0 1 0 0
1 0 1 1 1 0 0
1 1 1 0 0 1 0
0 0 1 1 0 1 0
1 1 0 0 1 0 1
0 0 0 1 1 0 1
0 1 1 1 0 0 1
1 0 0 0 1 1 0
0 0 1 0 1 1 1
0 1 0 0 0 1 1
0 1 0 1 1 1 0
1 0 1 0 0 0 1
1 0 0 1 0 1 1
1 1 1 1 1 1 1
0 0 0 0 0 0 0
OTHER CODES
Hamming Codes are examples of cyclic group codes – why?
BCH (Bose-Chaudhuri-Hocquenghem) codes are another
class of cyclic group codes and generated by the coefficient
sequences of certain irreducible polynomials over a finite field
Reed-Solomon Codes were the first class of BCH codes to be
discovered. They were first used by NASA for space
communications and are now used as error corrections in CD’s
Other codes include: Convolutional, Goethals, Golay, Goppa,
Hadamard, Julin, Justesen, Nordstrom-Robinson, Pless
double circulant, Preparata, Quadratic Residue, Rao-Reddy,
Reed-Muller, t-designs and Steiner systems, Sugiyama,
Trellis, Turbo, and Weldon codes. There are many waiting
to be discovered and the number of open problems is huge.
COUNTING STRINGS
Let
n
and let
m1 ,, mn be positive integers
T1  T2    Tn be positive real numbers
and
Let A be an alphabet with
m1
m2
Let
T1
symbols of time duration T2



m n symbols of time duration Tn
CSt, m1 , m2 ,, mn  , T1, T2 ,, Tn  , t  0
symbols of time duration
be the number of strings, made from the letters of A,
whose time duration is
t
MORSE CODE MESSAGES
m1  m2  1 A = {dot, dash} = ,
n2
T1 
time duration of
CSt, 1,1, T1 , T2  
Examples: If

T2 
time duration of
# messages whose duration

t
T1  1, T2  2 CS1, 1,1, 1,2  1 
CS2, 1,1, 1,2  3 ,,
CS3, 1,1, 1,2  6 ,,  ,,, 
CS4, 1,1, 1,2  11
,,  ,  ,,,,,  ,  ,  
PROTEINS
n  20 mi  1, i  1,..., n
A = {amino acids} = {Glycine, Alanine, Valine, Phenylalanine,
Proline, Methionine, Isoleucine, Leucine, Aspartic Acid,
Glutamic Acid, Lysine, Arginine, Serine, Threonine, Tyrosine,
Histidine, Cystein, Aspargine, Glutatimine, Tryptophan}
Ti 
weight of corresponding Peptide Unit of i-th amino
acid arranged from lightest (i = 1) to heaviest (i = 20)
R
N C
H H
C
H
Unit
Unit
Unit
OH
Single Chain Protein with Three Units
O
Unit with Amino
Acid Residue R
CSt, ...,...   # Single Chain Proteins
with weight  t  weight(H 2O)
RECURSIVE ALGORITHM
CSt, m1 , m2 ,, mn  , T1 , T2 ,, Tn   
0 , t  T1
m1  m1  CSt - T1 , ... , T1  t  T2
k
m
i 1
i
 m i  CSt - Ti , ... , Tk  t  Tk 1
i
 m i  CSt - Ti , ... , Tn  t
n
m
i 1
MATLAB PROGRAM
function N = cs(t,m,T)
% function N = cs(t,m,T)
% Wayne Lawton 19 / 1 / 2003
% Inputs: t = time,
% m = array of n positive integers
% T = array of n increasing positive numbers
% Outputs: N = number of strings composed out of these
% m(i) symbols of duration T(i), i = 1,...,n and having duration <= t
%
k = sum(T <= t);
N = 0;
if k > 0
for j = 1:k
N = N + m(j)*(1+cs(t-T(j),m,T));
end
end
ASYMPTOTIC GROWTH
Theorem 5 For large
t
CSt, m1 , m2 ,, mn  , T1 , T2 ,, Tn    cX
where
X
n
m
i 1
Example
i
X
t
is the unique real root of the equation
 Ti
1
and c is some constant
CS t, 1,1 , 1,2   c
 
1 5
2
t
 c (1.618 )
t
Proof For integer T’s a proof based on linear algebra
works and X is the largest eigenvalue of a matrix that
represents the recursion or difference equation for CS
Othewise the Laplace Transform is required. We
discovered a new proof based on Information Theory
INFORMATION THEORY PROOF
We choose probabilities pi /mi , i  1,..., n
for the symbols having time duration
to maximize
HH
n

p1
m1
H T
where
p1
m1
p2
m2
, ,
,
p2
m2
, ,
, ,
   p i log 2 p i  p i log 2 m i
Ti
pn
mn
, ,
pn
mn

i 1
is the Shannon information (or entropy) per symbol and
n
T   p i Ti is the average duration per symbol
i 1
Clearly

is the average information per time
INFORMATION THEORY
PROOF
n
Since there is the constraint
for some Lagrange multiplier
p
i 1

 n

pi   ,

p j
p j i 1

i
1
j  1,..., n
 λTj
Z(  ), j  1,..., n
hence p j  m je
where the denominator, called the partition function,
is the sum of the numerators (why).
Substituting these probabilities into   H T
gives Z ( )  1 hence X  e
satisfies the root
condition in Theorem 5. The proof is complete
since the probabilities that maximal information
are the ones that occur in the set of all sequences
λ
MORE PROBLEMS?
6. Compute H(1/2,1/3,1/6)
7. Show that H(X,Y) is maximal when X and Y are independent
8. Read [H] and explain what a triangular parity code is.
9. Compute all Morse Code Sequences of length <= 5
If dots have duration 1 and dashes have duration 2
10. Compute the smallest molecular weight W so that only
at least 100 single strand proteins have weight <= W
REFERENCES
[BT] Carl Brandon and John Tooze, Introduction to Protein
Structure, Garland Publishing, Inc., New York, 1991.
[BC] James Ward Brown and Ruel V. Churchill, Complex
Variables and Applications, McGraw-Hill, New York, 1996.
[CS] J. H. Conway and N. J. A. Sloan, Sphere Packings, Lattices
and Groups, Springer, New York, 1993.
[Ham] R. W. Hamming, Coding and Information Theory,
Prentice-Hall, New Jersey, 1980.
[H] Sharon Heumann, Coding theory and its application to the
study of sphere-packing, Course Notes, October 1998
http://www.mdstud.chalmers.se/~md7sharo/coding/main/main.htm
l[K] Donald E. Knuth, The Art of Computer Programming, Volume
1 Fundamental Algorithms, Addison-Wesley, Reading, 1997.
[SW] Claude E. Shannon and Warren Weaver, The Mathematical
Theory of Communication, Univ. of Illinois Press, Urbana, 1949.
MATHEMATICAL APPENDIX
i n
If M   mi and t   largest integer  t
i 1
then
f(t)  CSt, m1, m2 ,, mn  , T1, T2 ,, Tn  
t T1  so its Laplace Transform

satisfies f(t)  M

F(s)   f(t) e dt exists for complex s if s  M
st
0
The recursion for f implies that
P(s)  1  i1 mi e
n
G1 (s)  Me
sTn
sTi
F(s)  G(s) P(s)
Tn
G(s)   est f(t)dt  G1 (s)
0
s  i 1 mi e
n
sTi

Tn Ti
0
st
e f(t)dt
1 T1
MATHEMATICAL APPENDIX
γ  iR
This allows F to be defined as a meromorphic function
with singularities to the left of a line s  
Therefore, f is given by a Bromwich integral that
can be computed by a contour integral using the
method of residues, see page 235 [BC],
γ  i
f(t) 
1
2π i
P.V.  F(s) e ds, t  0
st
γ -i
f(t)   j 1 Res F(s) e

s3
s2
st
s s j
and s j , j  1 are the singularities of F
The unique real singularity s1  log e X
and for large t
s4
0
γ  iR
γ
s1
γ  iR
f(t)  cX thus proving Theorem 5
t