Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
JOSE C. PRINCIPE U. OF FLORIDA Information Theory EEL 6935 352-392-2662 [email protected] copyright 2009 Claude Shannon was interested in the optimal design of communication systems. But he realized that the solution was not in addressing the many physical components of the system, but how to quantify and constrain the flow of messages from the transmitter to the receiver in such a way that the noise in the channel would corrupt the least the electrical signals, given a certain power level. Transmitter Pool of all possible messages messages Encoder Receiver bits Modulation s(t) x(t) Demodulation bits Decoder messages Channel noise v(t) Instead of using the concept of optimal filtering championed by Norbert Wiener, his idea was to “distort” (code) the messages to make them robust to the noise, and recover them (decode) at the receiver (he borrowed from Wiener the idea that messages are stochastic). page 1 of 16 JOSE C. PRINCIPE U. OF FLORIDA How to do this optimally was anyone’s best guess! The following questions required answers: EEL 6935 352-392-2662 • What do you preserve in messages, and how do you quantify it? [email protected] copyright 2009 • How do you best distort a signal to withstand noise? • How do you select the best transmission rate? • How do you optimal recover signals corrupted by noise? Shannon presented answers to ALL of the above: • He derived entropy to describe the uncertainty in a message set • From entropy he enunciated the Source Coding theorem • He proposed Mutual information (kind of entropy for pairs of variables) and from it derived the Channel Capacity Theorem • He used again Mutual Information to derive the Rate distortion theorem to optimally create signals within a given distortion. page 2 of 16 JOSE C. PRINCIPE U. OF FLORIDA EEL 6935 [email protected] copyright 2009 Channel Source 352-392-2662 Source Compressed Source X, H(x) Min I ( X , Xˆ ) Compression Rate distortion X̂ C O D I N G Max I ( Xˆ ,Y ) D E C O D E Error correction Channel Capacity Source Decompress Source X Decompression Figure 1.4. An information theoretic view of optimal communications So he basically solved single handily the optimal design of communication systems! Note the fundamental role played by entropy and mutual information on this theory. We will be concentrating on these definitions in this course page 3 of 16 JOSE C. PRINCIPE The concept of entropy U. OF FLORIDA EEL 6935 352-392-2662 [email protected] copyright 2009 Messages convey information, therefore what is needed is how to mathematically describe information. It was clear that information should be additive for independent events, but what else? An early attempt by Hartley equated information as the number of binary questions necessary to find univocally one message in a set of N messages, which translates to I = log 2 N He called IH the amount of information. Note that there is no probabilities involved in Hartley’s definition, just counting. Shannon picked on this idea and brought probability into the picture. He reasoned that the probabilities of the events mattered. Let us define a random variable x with a set of possible outcomes SX ={s1,…..sN} having probabilities p ={p1,…..pN}, with p(x=si)= pi, pi ≥ 0 ∑ pi = 1 So he defined the amount of information as i Ii = log2 page 4 of 16 1 pi JOSE C. PRINCIPE U. OF FLORIDA EEL 6935 352-392-2662 [email protected] copyright 2009 which is a random variable that measures the degree of “surprise” carried by the message (i.e. when you expect an event P(x)~ 1 there is really no information, but when P(x)~0 the information is high). When we are interested in a scalar that translates the information of a data set, we should weight the amount of information by the probabilities, i.e. H(X ) = −∑p(x)log2 p(x) = −E[log2 p(X)] x He called this quantity ENTROPY, and it measures the uncertainty in the ensemble. It is the combination of the unexpectedness of a message weighted by its probability that makes up the concept of uncertainty. 7 I(p) H(p) 6 5 4 3 2 Maximum Entropy 1 0 page 5 of 16 0 0.1 0.2 0.3 0.4 0.5 p 0.6 0.7 0.8 0.9 1 JOSE C. PRINCIPE U. OF FLORIDA EEL 6935 352-392-2662 So we can say that not all random variables are ‘equally” random. In fact if we do realizations from the r.v. that has a peaky PMF, we will have many repeated values, while if we do realizations from the flat PMF, we hardly will have repeated values. [email protected] copyright 2009 1 fx (x) fx (x) 0.4 0.2 0.5 0 -5 0 {0,0.1,-0.2,0,0,....} 0 -5 0 x 5 {-1,2,0,1.1,-0.1,....} The single number that captures this concept is entropy, and it is probably the most interesting scalar that describes the PMF (or PDF) function. But it is not the only one, i.e. we normally use the mean and variance...... page 6 of 16 JOSE C. PRINCIPE U. OF FLORIDA EEL 6935 Defining entropy by Axioms: Shannon entropy is the only function that obeys the following axioms. 352-392-2662 [email protected] copyright 2009 Entropy is a concave function The 4th property called recursivity singles out Shannon’s entropy from the others. page 7 of 16 JOSE C. PRINCIPE Why is entropy so interesting? U. OF FLORIDA EEL 6935 352-392-2662 Because it quantifies remarkably well the effective “volume” of the data set in high dimensional spaces. [email protected] copyright 2009 Asymptotic equipartition principle: For an ensemble S of N independent i.i.d. random variables X=(X1,...XN) with N sufficiently large, the outcome x=(x1,...xN) is almost certain to belong to a subset of 2NH(X) members, all having probability close to 2-NH(X). This means that there is a sub set of elements (specified by H(X)) - the typical set--- that capture the probability of the set and control the behavior of the distribution. This fact is the fundamental piece for source coding. c onvergence of the probability of the t ypical s equenc es (Bernouli, p=0.2) 0.77 0.76 0.75 -1/n*logP 0.74 0.73 0.72 0.71 0.7 0.69 0.68 50 page 8 of 16 100 150 200 250 300 350 400 Sequenc e Length 450 500 550 600 JOSE C. PRINCIPE The concept of mutual information. U. OF FLORIDA EEL 6935 352-392-2662 [email protected] copyright 2009 Mutual information is an extension of entropy (uncertainty) for pairs of variables (it is the key concept in the transmission of information over a channel). The question is the following: We have uncertain messages X being sent thru the channel, which is noisy, leading to an observation Y (the noise creates a joint distribution P(X,Y)). Since by observing yi the probability of xk is p(xk|yi), the uncertainty left in xk becomes log (1/p(xk|yi)). Therefore the decrease in uncertainty about xk by observing yi is I (xk , yi ) ≡ log2 ( 1 1 p(xk | yi ) ) − log2 ( ) = log2 p( xk ) p(xk | yi ) p(xk ) and can also be thought as the gain of information. Note that if xk,yi are independent p(xk|yi)=0, that means we do not know anything about xk by observing yi. On the other hand if they are perfectly correlated, we know everything about xk (conditional probability is 1) and the mutual information defaults to the uncertainty of xk which is H(X). Mutual Information is defined as I ( X,Y) = E[ xk , yi ] = ∑∑ p(xk , yi ) log2 i page 9 of 16 k p(xk | yi ) p(xk, yi ) = ∑∑P(xk , yi )log2 p(xk ) p(xk ) p(yi ) i k JOSE C. PRINCIPE U. OF FLORIDA EEL 6935 The joint entropy of a pair of random variables X and Y is defined H( X ,Y ) = −∑∑ p( x, y) log p(x, y) = −EX ,Y [log p( X ,Y )] x 352-392-2662 [email protected] copyright 2009 y and the conditional entropy H (Y | X ) = −∑∑ p( x, y) log p( x | y) = − EX ,Y [log p( x | y)] x y therefore mutual information can be also formally defined as I ( X ,Y ) = H ( X ) − H ( X | Y ) = H (Y ) − H (Y | X ) and visualized in the Venn diagram H(X,Y) H(X) H(X|Y) I(X,Y) H(Y|X) H(Y) page 10 of 16 JOSE C. PRINCIPE The Kullback-Liebler divergence U. OF FLORIDA EEL 6935 352-392-2662 In statistics, Kullback proposed a dissimilarity measure between two PMFs (or PDFs) f(x) and g(x) as [email protected] D K L ( p || q ) = copyright 2009 ∑x p ( x ) log p( X ) p ( x) = E p [log ] q( X ) q(x) This is not a distance because from the three properties of distance (positive, symmetric and the triangle inequality) it only obeys the positivity, and it is called divergence or directed distance. Jeffrey proposed the following positive and symmetric divergence J div ( p, q) = 1/ 2( DKL ( p || q)) + 1/ 2( DKL (q || p)) 2 1 1 1 Euclidean 5 0.8 0.5 0.6 0 0 0.2 0.4 0.6 0.8 D(p||q) 3 2 0.2 1 0 0 6 0.5 1 1 6 5 5 4 0.5 3 4 0.5 3 2 2 1 page 11 of 16 0 0.2 0.4 0.6 0.8 1 D(q||p) 4 0.5 0.4 1 1 0 2 1 0 0 0.5 1 Jdiv JOSE C. PRINCIPE U. OF FLORIDA EEL 6935 352-392-2662 [email protected] copyright 2009 The K-L divergence has also an IT interpretation as relative entropy. In fact, it tells us the gain of information of replacing the (apriori) distribution q(x) by the posterior distribution p(x). It is perhaps the fundamental concept in information theory because entropy can be derived from it as well as it can be defined for incomplete distributions as done by Renyi. The relative entropy has very important properties: • It is invariant to change of variables (reparameterization) because of its close ties to the Fisher Rao metric. • It is monotonically decreasing under stochastic maps (like Markov chains) DKL (TX || TY) ≤ DKL (X || Y) • It is doubly convex. • Mutual information is the K-L divergence between the joint and the product of the marginals and happens to be symmetric. DKL ( p( X ,Y ) || p( X )q(Y )) = I ( X , Y ) page 12 of 16 JOSE C. PRINCIPE Extensions to continuous variables U. OF FLORIDA EEL 6935 352-392-2662 [email protected] Although the extensions of entropy and relative entropy to continuous variables is non trivial, it was conducted mathematically and the expressions become copyright 2009 H ( X ) = − ∫ l o g ( p ( x )) p ( x ) d x = − E X [lo g ( p ( x )] I ( X ,Y ) = ∫ ∫ I ( x , y ) p ( x , y ) d sd x D K L ( f ( x ) || g ( x ) ) = ∫ f ( x ) lo g = EX ,Y [ I ( x , y )] f (x) f ( x) ] d x = E X [ lo g g ( x) g ( x) Entropy for continuous variables may be negative. page 13 of 16 JOSE C. PRINCIPE Generalized definitions of entropy and divergence U. OF FLORIDA EEL 6935 352-392-2662 [email protected] Many generalizations have been proposed throughout the years, and the first was attempted by Alfred Renyi that we will study in chapter 2. copyright 2009 Salicru defined the (h,φ) entropies as H φ ( X ) = h ( ∫ φ ( f ( x )) ) d x h where φ is a continuous concave (convex) real function and h is a differentiable and increasing (decreasing) real function, from which most of the definitions in the literature can be derived for specific h and φ. page 14 of 16 JOSE C. PRINCIPE U. OF FLORIDA EEL 6935 352-392-2662 Likewise Csiszar proposed the φ divergences extended later by Menendez to the (h,φ) divergences. Dφh ( f ( x), g ( x)) = h( Dφ ( f ( x), g ( x)) ⎛ f (x) ⎞ ⎟⎟dx Dφ ( f ( x), g( x)) = ∫ g (x)φ ⎜⎜ ( ) g x ⎝ ⎠ [email protected] copyright 2009 The current view is that entropy quantifies properties of the class of concave functions, therefore the subject is very abstract and subject to many possible improvements for specific functions that translate well the problem at hand. Likewise divergence is measuring dissimilarity in probability spaces, and depending upon the problem many possible measures can be derived. page 15 of 16 JOSE C. PRINCIPE Information beyond Communication theory U. OF FLORIDA EEL 6935 352-392-2662 [email protected] copyright 2009 In statistical signal processing, learning theory and in statistics, information can play a major role, because it is a “macroscopic” descriptor of the data that possess different properties beyond the moment expansions. However, in learning theory we have to realize that the communication setting is not rich enough. In fact, in communications the receiver knows exactly the joint PDF of the received and the transmitted signals. In learning, the system does not know the joint PDF of the input and desired (or the PDF of the input if unsupervised). The problem is exactly how to best estimate these quantities. Shannon left (on purpose) “meaning” out of the formulation, and in learning systems (biological or man made) meaning is crucial because the state of the system evolves in time, so the amount of information contained in messages changes with the state. This is now being realized in active learning and may bring again information theory into the main stream of statistical thinking. page 16 of 16