Download Information Theory Modern digital communication depends on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Information security wikipedia , lookup

Hardware random number generator wikipedia , lookup

Entropy of mixing wikipedia , lookup

Probability box wikipedia , lookup

Randomness wikipedia , lookup

Gibbs paradox wikipedia , lookup

Arithmetic coding wikipedia , lookup

Information theory wikipedia , lookup

Transcript
L3
2
Information Theory
Modern digital communication depends on Information Theory, which was
invented in the 1940's by Claude E. Shannon.
Shannon
first
published
A Mathematical Theory of Communication in
1947-1948, and provides a
mathematical model for communication.
Information Sources
An information source is a system that outputs from a fixed set of n symbols
{x1 .. xn} in a sequence at some rate (see Fig. 1). In the simplest case, each symbol
that might be output from the system is equally likely. The letter i will stand for
some given ouput symbol from the set {x1 .. xn}. If all symbols are equally likely,
then the probability that symbol i will be the one produced is pi = P = 1/n no matter
which symbol we have in mind. For example, if the information source can
produce four equally likely symbols (A, B, C, and D), then each symbol has a
probability of .25 (that is, 25% or 1/4).
X1......Xn
Fig. 1. Information Source and Observer
An observer is uncertain which of the n symbols will be output. Once a given
symbol xi is observed, the observer has obtained information from the source. The
observer's uncertainty is reduced. The amount of information obtained can be
measured because the number of possible symbols is known. and the unit of
measure is binary digits, or bits. The unit of measure depends on the base of the
logarithm. Most of the time, Information Theory uses the base 2 logarithm (log2).
Any other logarithm base would work. If we used base 10, then the unit of measure
would be decimal digits.
If a system can output any of 16 possible symbols, for each symbol observed the
observer receives 4 bits of information. That is, it reduces the observer's
uncertainty by 4 bits (see Fig. 2).
X1 X2 . .. . X16
All symbols
have equal
probability
X5
Uncertainty reduced by I=log2 16=4
Fig. 2. One of 16 Possible Symbols is Observed
Entropies Defined, and Why They are
Measures of Information
The amount of information about an event is closely related to its probability of
occurrence . to formulate a mathematical equation in general any one of n
equiprobable message then contain log2 n bits of information . because we have
assumed all n message to be equiprobable , the probability of occurrence of each
one is Pi=1\n and the information associated with each message is then
The information content I of a single event or message is defined as the
base-2 logarithm of its probability p:
Ii = log2
-----(1)
bits
-----(2)
-----(3)
Example 1
The four symbols A,B,C,D occur with probabilities 1\2, 1\4 ,1\8 , 1\8 ,
respectively .compute the information in the three–symbol message X=BDA,
assuming that the symbols are statistically independent.
Solution : Because the symbols are independent, the measure of information is
additive using eq. (2) and we can write
Ix=log2 4+log2 8+log2 2
Ix=2+3+1
Ix=6 bits
The above results define our measurement of information for the somewhat
special case in which all message are equally likely. To generalized, we define
an average information which is called the Entropy H,
Entropy can be regarded intuitively as “uncertainty” or” disorder” To gain
information is to lose uncertainty by the same amount,
No information is gained (no uncertainty is lost) by the appearance of an event or
the receipt of a message that was completely certain any- way (p = 1; so I = 0).
Intuitively, the more improbable an event is, the more informative it is; and so the
monotonic behavior of Eqn. (2) seems appropriate.
A Note on Logarithms:
In information theory we often wish to compute the base-2 logarithms
of quantities, but most calculators only offer decimal (base 10) logarithms. So the
following conversions are useful:
log2 X = 3.322 log10 X
Entropy of Ensembles
We now move from considering the information content of a single event or
message, to that of an ensemble. An ensemble is the set of outcomes of one or
more random variables. The outcomes have probabilities attached to them. In
general these probabilities are non-uniform, with event i having probability pi, but
they must sum to 1 because all possible outcomes are included; hence they form a
probability distribution:
Σi pi = 1
-----(4)
The entropy of an ensemble is simply the average entropy of all the elements in it.
We can compute their average entropy by weighting each of the log pi
contributions by its probability pi
---------(5)
Eqn. (5) allows us to speak of the information content or the entropy of a random
variable, from knowledge of the probability distribution that it obeys. (Entropy
does not depend upon the actual values taken by the random variable!{Only
upon their relative probabilities.}
Example 2
Determine and graph the entropy (i.e. , average information ) of the binary code in
which the probability of occurrence of the two symbols is p and q = (1-p).
Solution
Using eq.(5)we have
A plot of H versus P is shown in figure 2. Note that the maximum entropy is one
bit/symbol and occur for the equiprobable case (p=q= 1\2)
Joint entropy of XY
H ( X , Y )   P( x, y ) log
x, y
1
P ( x, y )
----(6)
From this definition, it follows that joint entropy is additive if X and Y are
independent random variables:
H (X,Y) = H (X) + H (Y)
if
P (x,y) = P(x) P(y)
Conditional entropy of an ensemble X, given that y = bj
measures the uncertainty remaining about random variable X after specifying that
random variable Y has taken on a particular value y = b j. It is defined naturally as
the entropy of the probability distribution p(x \ y = bj)
H ( x \ y  b j )   P( x \ y  b j ) log
x
1
p( x \ y  b j )
If we now consider the above quantity averaged over all possible outcomes that Y
might have, each weighted by its probability p(y), then we arrive at the...
Conditional and Joint Entropy
If X and Y are random variables representing the input and output of a channel,
respectively, then the conditional entropy (meaning the average uncertainty of the
received symbol given that X was transmitted) is:
H(Y | X) = -∑ i, j p(xi, yj) log2 p(yj | xi),
the joint entropy (meaning the average uncertainty of the total information system)
is:
H(X, Y) = -∑ i, j p(xi, yj) log2 p(xi, yj),
and the equivocation entropy (meaning the average uncertainty of the transmitted
symbol after a symbol is received) is:
H(X | Y) = -∑ i, j p(xi, yj) log2 p(xi | yj).
The notation p(A, B) means the probability of A and B both occurring, while
p(A | B) means the probability of A occurring given that B has occurred.
An important relationship is:
H(X, Y) = H(X | Y) + H(Y) = H(Y | X) + H( X).
Mutual Information between X and Y
The mutual information between two random variables measures the amount of
information that one conveys about the other. Equivalently, it measures the average
reduction in uncertainty about X that results from learning about Y . It is defined:
I ( X ; Y )   p( x, y ) log
x, y
p ( x, y )
p( x) p( y )
-----(9)
Clearly X says as much about Y as Y says about X. Note that in case X and Y are
independent random variables, then the numerator inside the logarithm equals the
denominator. Then the log term vanishes, and the mutual information equals zero,
as one should expect.
Non-negativity: mutual information is always  0 . In the event that the two
random variables are perfectly correlated, then their mutual information is the
entropy of either one alone. (Another way to say this is: I(X;X) = H(X): the mutual
information of a random variable with itself is just its entropy. For this reason, the
entropy H(X) of a random variable X is sometimes referred to as its selfinformation.)
These properties are reflected in three equivalent definitions for the mutual
information between X and Y :
I(X; Y ) = H(X) - H(X\Y )
I(X; Y ) = H(Y ) - H(Y \X) = I(Y ;X)
I(X; Y ) = H(X) + H(Y ) - H(X; Y )
(10)
(11)
(12)
In a sense the mutual information I(X; Y ) is the intersection between H(X) and
H(Y ), since it represents their statistical dependence. In the Venn diagram, the
portion of H(X) that does not lie within I(X; Y ) is just
H(X\Y ). The portion of
H(Y ) that does not lie within I(X; Y ) is just
H(Y \X).