* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Information Theory Modern digital communication depends on
Survey
Document related concepts
Transcript
L3 2 Information Theory Modern digital communication depends on Information Theory, which was invented in the 1940's by Claude E. Shannon. Shannon first published A Mathematical Theory of Communication in 1947-1948, and provides a mathematical model for communication. Information Sources An information source is a system that outputs from a fixed set of n symbols {x1 .. xn} in a sequence at some rate (see Fig. 1). In the simplest case, each symbol that might be output from the system is equally likely. The letter i will stand for some given ouput symbol from the set {x1 .. xn}. If all symbols are equally likely, then the probability that symbol i will be the one produced is pi = P = 1/n no matter which symbol we have in mind. For example, if the information source can produce four equally likely symbols (A, B, C, and D), then each symbol has a probability of .25 (that is, 25% or 1/4). X1......Xn Fig. 1. Information Source and Observer An observer is uncertain which of the n symbols will be output. Once a given symbol xi is observed, the observer has obtained information from the source. The observer's uncertainty is reduced. The amount of information obtained can be measured because the number of possible symbols is known. and the unit of measure is binary digits, or bits. The unit of measure depends on the base of the logarithm. Most of the time, Information Theory uses the base 2 logarithm (log2). Any other logarithm base would work. If we used base 10, then the unit of measure would be decimal digits. If a system can output any of 16 possible symbols, for each symbol observed the observer receives 4 bits of information. That is, it reduces the observer's uncertainty by 4 bits (see Fig. 2). X1 X2 . .. . X16 All symbols have equal probability X5 Uncertainty reduced by I=log2 16=4 Fig. 2. One of 16 Possible Symbols is Observed Entropies Defined, and Why They are Measures of Information The amount of information about an event is closely related to its probability of occurrence . to formulate a mathematical equation in general any one of n equiprobable message then contain log2 n bits of information . because we have assumed all n message to be equiprobable , the probability of occurrence of each one is Pi=1\n and the information associated with each message is then The information content I of a single event or message is defined as the base-2 logarithm of its probability p: Ii = log2 -----(1) bits -----(2) -----(3) Example 1 The four symbols A,B,C,D occur with probabilities 1\2, 1\4 ,1\8 , 1\8 , respectively .compute the information in the three–symbol message X=BDA, assuming that the symbols are statistically independent. Solution : Because the symbols are independent, the measure of information is additive using eq. (2) and we can write Ix=log2 4+log2 8+log2 2 Ix=2+3+1 Ix=6 bits The above results define our measurement of information for the somewhat special case in which all message are equally likely. To generalized, we define an average information which is called the Entropy H, Entropy can be regarded intuitively as “uncertainty” or” disorder” To gain information is to lose uncertainty by the same amount, No information is gained (no uncertainty is lost) by the appearance of an event or the receipt of a message that was completely certain any- way (p = 1; so I = 0). Intuitively, the more improbable an event is, the more informative it is; and so the monotonic behavior of Eqn. (2) seems appropriate. A Note on Logarithms: In information theory we often wish to compute the base-2 logarithms of quantities, but most calculators only offer decimal (base 10) logarithms. So the following conversions are useful: log2 X = 3.322 log10 X Entropy of Ensembles We now move from considering the information content of a single event or message, to that of an ensemble. An ensemble is the set of outcomes of one or more random variables. The outcomes have probabilities attached to them. In general these probabilities are non-uniform, with event i having probability pi, but they must sum to 1 because all possible outcomes are included; hence they form a probability distribution: Σi pi = 1 -----(4) The entropy of an ensemble is simply the average entropy of all the elements in it. We can compute their average entropy by weighting each of the log pi contributions by its probability pi ---------(5) Eqn. (5) allows us to speak of the information content or the entropy of a random variable, from knowledge of the probability distribution that it obeys. (Entropy does not depend upon the actual values taken by the random variable!{Only upon their relative probabilities.} Example 2 Determine and graph the entropy (i.e. , average information ) of the binary code in which the probability of occurrence of the two symbols is p and q = (1-p). Solution Using eq.(5)we have A plot of H versus P is shown in figure 2. Note that the maximum entropy is one bit/symbol and occur for the equiprobable case (p=q= 1\2) Joint entropy of XY H ( X , Y ) P( x, y ) log x, y 1 P ( x, y ) ----(6) From this definition, it follows that joint entropy is additive if X and Y are independent random variables: H (X,Y) = H (X) + H (Y) if P (x,y) = P(x) P(y) Conditional entropy of an ensemble X, given that y = bj measures the uncertainty remaining about random variable X after specifying that random variable Y has taken on a particular value y = b j. It is defined naturally as the entropy of the probability distribution p(x \ y = bj) H ( x \ y b j ) P( x \ y b j ) log x 1 p( x \ y b j ) If we now consider the above quantity averaged over all possible outcomes that Y might have, each weighted by its probability p(y), then we arrive at the... Conditional and Joint Entropy If X and Y are random variables representing the input and output of a channel, respectively, then the conditional entropy (meaning the average uncertainty of the received symbol given that X was transmitted) is: H(Y | X) = -∑ i, j p(xi, yj) log2 p(yj | xi), the joint entropy (meaning the average uncertainty of the total information system) is: H(X, Y) = -∑ i, j p(xi, yj) log2 p(xi, yj), and the equivocation entropy (meaning the average uncertainty of the transmitted symbol after a symbol is received) is: H(X | Y) = -∑ i, j p(xi, yj) log2 p(xi | yj). The notation p(A, B) means the probability of A and B both occurring, while p(A | B) means the probability of A occurring given that B has occurred. An important relationship is: H(X, Y) = H(X | Y) + H(Y) = H(Y | X) + H( X). Mutual Information between X and Y The mutual information between two random variables measures the amount of information that one conveys about the other. Equivalently, it measures the average reduction in uncertainty about X that results from learning about Y . It is defined: I ( X ; Y ) p( x, y ) log x, y p ( x, y ) p( x) p( y ) -----(9) Clearly X says as much about Y as Y says about X. Note that in case X and Y are independent random variables, then the numerator inside the logarithm equals the denominator. Then the log term vanishes, and the mutual information equals zero, as one should expect. Non-negativity: mutual information is always 0 . In the event that the two random variables are perfectly correlated, then their mutual information is the entropy of either one alone. (Another way to say this is: I(X;X) = H(X): the mutual information of a random variable with itself is just its entropy. For this reason, the entropy H(X) of a random variable X is sometimes referred to as its selfinformation.) These properties are reflected in three equivalent definitions for the mutual information between X and Y : I(X; Y ) = H(X) - H(X\Y ) I(X; Y ) = H(Y ) - H(Y \X) = I(Y ;X) I(X; Y ) = H(X) + H(Y ) - H(X; Y ) (10) (11) (12) In a sense the mutual information I(X; Y ) is the intersection between H(X) and H(Y ), since it represents their statistical dependence. In the Venn diagram, the portion of H(X) that does not lie within I(X; Y ) is just H(X\Y ). The portion of H(Y ) that does not lie within I(X; Y ) is just H(Y \X).