Download Information Theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Signals intelligence wikipedia , lookup

Coding theory wikipedia , lookup

Entropy of mixing wikipedia , lookup

Gibbs paradox wikipedia , lookup

Information theory wikipedia , lookup

Transcript
JOSE C. PRINCIPE
U. OF FLORIDA
Information Theory
EEL 6935
352-392-2662
[email protected]
copyright 2009
Claude Shannon was interested in the optimal design of communication systems. But he realized that the solution was not in
addressing the many physical components of the system, but how
to quantify and constrain the flow of messages from the transmitter
to the receiver in such a way that the noise in the channel would
corrupt the least the electrical signals, given a certain power level.
Transmitter
Pool of all possible messages
messages
Encoder
Receiver
bits
Modulation
s(t)
x(t)
Demodulation
bits
Decoder
messages
Channel noise v(t)
Instead of using the concept of optimal filtering championed by
Norbert Wiener, his idea was to “distort” (code) the messages to
make them robust to the noise, and recover them (decode) at the
receiver (he borrowed from Wiener the idea that messages are stochastic).
page 1 of 16
JOSE C. PRINCIPE
U. OF FLORIDA
How to do this optimally was anyone’s best guess!
The following questions required answers:
EEL 6935
352-392-2662
• What do you preserve in messages, and how do you quantify it?
[email protected]
copyright 2009
• How do you best distort a signal to withstand noise?
• How do you select the best transmission rate?
• How do you optimal recover signals corrupted by noise?
Shannon presented answers to ALL of the above:
• He derived entropy to describe the uncertainty in a message set
• From entropy he enunciated the Source Coding theorem
• He proposed Mutual information (kind of entropy for pairs of variables) and from it derived the Channel Capacity Theorem
• He used again Mutual Information to derive the Rate distortion
theorem to optimally create signals within a given distortion.
page 2 of 16
JOSE C. PRINCIPE
U. OF FLORIDA
EEL 6935
[email protected]
copyright 2009
Channel
Source
352-392-2662
Source
Compressed
Source
X, H(x)
Min
I ( X , Xˆ )
Compression
Rate distortion
X̂
C
O
D
I
N
G
Max
I ( Xˆ ,Y )
D
E
C
O
D
E
Error correction
Channel Capacity
Source
Decompress
Source
X
Decompression
Figure 1.4. An information theoretic view of optimal communications
So he basically solved single handily the optimal design of communication systems!
Note the fundamental role played by entropy and mutual information on this theory. We will be concentrating on these definitions in
this course
page 3 of 16
JOSE C. PRINCIPE
The concept of entropy
U. OF FLORIDA
EEL 6935
352-392-2662
[email protected]
copyright 2009
Messages convey information, therefore what is needed is how to
mathematically describe information. It was clear that information
should be additive for independent events, but what else?
An early attempt by Hartley equated information as the number of
binary questions necessary to find univocally one message in a set
of N messages, which translates to
I = log 2 N
He called IH the amount of information. Note that there is no probabilities involved in Hartley’s definition, just counting.
Shannon picked on this idea and brought probability into the picture. He reasoned that the probabilities of the events mattered. Let
us define a random variable x with a set of possible outcomes SX
={s1,…..sN} having probabilities p ={p1,…..pN}, with p(x=si)= pi, pi ≥ 0
∑ pi = 1 So he defined the amount of information as
i
Ii = log2
page 4 of 16
1
pi
JOSE C. PRINCIPE
U. OF FLORIDA
EEL 6935
352-392-2662
[email protected]
copyright 2009
which is a random variable that measures the degree of “surprise”
carried by the message (i.e. when you expect an event P(x)~ 1
there is really no information, but when P(x)~0 the information is
high).
When we are interested in a scalar that translates the information
of a data set, we should weight the amount of information by the
probabilities, i.e.
H(X ) = −∑p(x)log2 p(x) = −E[log2 p(X)]
x
He called this quantity ENTROPY, and it measures the uncertainty
in the ensemble. It is the combination of the unexpectedness of a
message weighted by its probability that makes up the concept of
uncertainty.
7
I(p)
H(p)
6
5
4
3
2
Maximum Entropy
1
0
page 5 of 16
0
0.1
0.2
0.3
0.4
0.5
p
0.6
0.7
0.8
0.9
1
JOSE C. PRINCIPE
U. OF FLORIDA
EEL 6935
352-392-2662
So we can say that not all random variables are ‘equally” random.
In fact if we do realizations from the r.v. that has a peaky PMF, we
will have many repeated values, while if we do realizations from the
flat PMF, we hardly will have repeated values.
[email protected]
copyright 2009
1
fx (x)
fx (x)
0.4
0.2
0.5
0
-5
0
{0,0.1,-0.2,0,0,....}
0
-5
0
x
5
{-1,2,0,1.1,-0.1,....}
The single number that captures this concept is entropy, and it is
probably the most interesting scalar that describes the PMF (or
PDF) function.
But it is not the only one, i.e. we normally use the mean and variance......
page 6 of 16
JOSE C. PRINCIPE
U. OF FLORIDA
EEL 6935
Defining entropy by Axioms:
Shannon entropy is the only function that obeys the following axioms.
352-392-2662
[email protected]
copyright 2009
Entropy is a concave function
The 4th property called recursivity singles out Shannon’s entropy
from the others.
page 7 of 16
JOSE C. PRINCIPE
Why is entropy so interesting?
U. OF FLORIDA
EEL 6935
352-392-2662
Because it quantifies remarkably well the effective “volume” of the
data set in high dimensional spaces.
[email protected]
copyright 2009
Asymptotic equipartition principle:
For an ensemble S of N independent i.i.d. random variables
X=(X1,...XN) with N sufficiently large, the outcome x=(x1,...xN) is
almost certain to belong to a subset of 2NH(X) members, all having
probability close to 2-NH(X).
This means that there is a sub set of elements (specified by H(X)) - the typical set--- that capture the probability of the set and control
the behavior of the distribution. This fact is the fundamental piece
for source coding.
c onvergence of the probability of the t ypical s equenc es (Bernouli, p=0.2)
0.77
0.76
0.75
-1/n*logP
0.74
0.73
0.72
0.71
0.7
0.69
0.68
50
page 8 of 16
100
150
200
250 300 350 400
Sequenc e Length
450
500
550
600
JOSE C. PRINCIPE
The concept of mutual information.
U. OF FLORIDA
EEL 6935
352-392-2662
[email protected]
copyright 2009
Mutual information is an extension of entropy (uncertainty) for pairs
of variables (it is the key concept in the transmission of information
over a channel). The question is the following:
We have uncertain messages X being sent thru the channel, which
is noisy, leading to an observation Y (the noise creates a joint distribution P(X,Y)). Since by observing yi the probability of xk is p(xk|yi),
the uncertainty left in xk becomes log (1/p(xk|yi)). Therefore the
decrease in uncertainty about xk by observing yi is
I (xk , yi ) ≡ log2 (
1
1
p(xk | yi )
) − log2 (
) = log2
p( xk )
p(xk | yi )
p(xk )
and can also be thought as the gain of information. Note that if xk,yi
are independent p(xk|yi)=0, that means we do not know anything
about xk by observing yi. On the other hand if they are perfectly correlated, we know everything about xk (conditional probability is 1)
and the mutual information defaults to the uncertainty of xk which is
H(X). Mutual Information is defined as
I ( X,Y) = E[ xk , yi ] = ∑∑ p(xk , yi ) log2
i
page 9 of 16
k
p(xk | yi )
p(xk, yi )
= ∑∑P(xk , yi )log2
p(xk )
p(xk ) p(yi )
i k
JOSE C. PRINCIPE
U. OF FLORIDA
EEL 6935
The joint entropy of a pair of random variables X and Y is defined
H( X ,Y ) = −∑∑ p( x, y) log p(x, y) = −EX ,Y [log p( X ,Y )]
x
352-392-2662
[email protected]
copyright 2009
y
and the conditional entropy
H (Y | X ) = −∑∑ p( x, y) log p( x | y) = − EX ,Y [log p( x | y)]
x
y
therefore mutual information can be also formally defined as
I ( X ,Y ) = H ( X ) − H ( X | Y ) = H (Y ) − H (Y | X )
and visualized in the Venn diagram
H(X,Y)
H(X)
H(X|Y)
I(X,Y)
H(Y|X)
H(Y)
page 10 of 16
JOSE C. PRINCIPE
The Kullback-Liebler divergence
U. OF FLORIDA
EEL 6935
352-392-2662
In statistics, Kullback proposed a dissimilarity measure between
two PMFs (or PDFs) f(x) and g(x) as
[email protected]
D K L ( p || q ) =
copyright 2009
∑x
p ( x ) log
p( X )
p ( x)
= E p [log
]
q( X )
q(x)
This is not a distance because from the three properties of distance
(positive, symmetric and the triangle inequality) it only obeys the
positivity, and it is called divergence or directed distance. Jeffrey
proposed the following positive and symmetric divergence
J div ( p, q) = 1/ 2( DKL ( p || q)) + 1/ 2( DKL (q || p))
2
1
1
1
Euclidean
5
0.8
0.5
0.6
0
0
0.2
0.4
0.6
0.8
D(p||q)
3
2
0.2
1
0
0
6
0.5
1
1
6
5
5
4
0.5
3
4
0.5
3
2
2
1
page 11 of 16
0
0.2
0.4
0.6
0.8
1
D(q||p)
4
0.5
0.4
1
1
0
2
1
0
0
0.5
1
Jdiv
JOSE C. PRINCIPE
U. OF FLORIDA
EEL 6935
352-392-2662
[email protected]
copyright 2009
The K-L divergence has also an IT interpretation as relative
entropy. In fact, it tells us the gain of information of replacing the
(apriori) distribution q(x) by the posterior distribution p(x). It is perhaps the fundamental concept in information theory because
entropy can be derived from it as well as it can be defined for
incomplete distributions as done by Renyi.
The relative entropy has very important properties:
• It is invariant to change of variables (reparameterization)
because of its close ties to the Fisher Rao metric.
• It is monotonically decreasing under stochastic maps (like
Markov chains)
DKL (TX || TY) ≤ DKL (X || Y)
• It is doubly convex.
• Mutual information is the K-L divergence between the joint and
the product of the marginals and happens to be symmetric.
DKL ( p( X ,Y ) || p( X )q(Y )) = I ( X , Y )
page 12 of 16
JOSE C. PRINCIPE
Extensions to continuous variables
U. OF FLORIDA
EEL 6935
352-392-2662
[email protected]
Although the extensions of entropy and relative entropy to continuous variables is non trivial, it was conducted mathematically and
the expressions become
copyright 2009
H ( X ) = − ∫ l o g ( p ( x )) p ( x ) d x = − E X [lo g ( p ( x )]
I ( X ,Y ) =
∫ ∫ I ( x , y ) p ( x , y ) d sd x
D K L ( f ( x ) || g ( x ) ) =
∫
f ( x ) lo g
= EX
,Y
[ I ( x , y )]
f (x)
f ( x)
]
d x = E X [ lo g
g ( x)
g ( x)
Entropy for continuous variables may be negative.
page 13 of 16
JOSE C. PRINCIPE
Generalized definitions of entropy and divergence
U. OF FLORIDA
EEL 6935
352-392-2662
[email protected]
Many generalizations have been proposed throughout the years,
and the first was attempted by Alfred Renyi that we will study in
chapter 2.
copyright 2009
Salicru defined the (h,φ) entropies as
H φ ( X ) = h ( ∫ φ ( f ( x )) ) d x
h
where φ is a continuous concave (convex) real function and h is a
differentiable and increasing (decreasing) real function, from which
most of the definitions in the literature can be derived for specific h
and φ.
page 14 of 16
JOSE C. PRINCIPE
U. OF FLORIDA
EEL 6935
352-392-2662
Likewise Csiszar proposed the φ divergences extended later by
Menendez to the (h,φ) divergences.
Dφh ( f ( x), g ( x)) = h( Dφ ( f ( x), g ( x))
⎛ f (x) ⎞
⎟⎟dx
Dφ ( f ( x), g( x)) = ∫ g (x)φ ⎜⎜
(
)
g
x
⎝
⎠
[email protected]
copyright 2009
The current view is that entropy quantifies properties of the class of
concave functions, therefore the subject is very abstract and subject to many possible improvements for specific functions that
translate well the problem at hand.
Likewise divergence is measuring dissimilarity in probability
spaces, and depending upon the problem many possible measures
can be derived.
page 15 of 16
JOSE C. PRINCIPE
Information beyond Communication theory
U. OF FLORIDA
EEL 6935
352-392-2662
[email protected]
copyright 2009
In statistical signal processing, learning theory and in statistics,
information can play a major role, because it is a “macroscopic”
descriptor of the data that possess different properties beyond the
moment expansions.
However, in learning theory we have to realize that the communication setting is not rich enough. In fact, in communications the
receiver knows exactly the joint PDF of the received and the transmitted signals. In learning, the system does not know the joint PDF
of the input and desired (or the PDF of the input if unsupervised).
The problem is exactly how to best estimate these quantities.
Shannon left (on purpose) “meaning” out of the formulation, and in
learning systems (biological or man made) meaning is crucial
because the state of the system evolves in time, so the amount of
information contained in messages changes with the state. This is
now being realized in active learning and may bring again information theory into the main stream of statistical thinking.
page 16 of 16