Download Chapter 3: Asymptotic Equipartition Property Chapter 3 outline

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 3: Asymptotic Equipartition Property
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Chapter 3 outline
• Strong vs. Weak Typicality
• Convergence
• Asymptotic Equipartition Property Theorem
• High-probability sets and the “typical set”
• Consequences of the AEP: data compression
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Strong versus Weak Typicality
• Intuition behind typicality?
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Another example of Typicality
• Bit-sequences of length n = 8, prob(1) = p (prob(0) = (1-p))
• Strong typicality?
• Weak typicality?
• What if p=0.5?
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Convergence of random variables
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Weak Law of Large Numbers + the AEP
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Typical sets intuition
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
• What’s
the
point?
You can buy this
book for 30
pounds
or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
• Consider
4.4:
Typicalityiid bit-strings strings of length N=100, prob(1) = p1=0.1
• Probability of a given string X with r ones is
• Number of strings with r ones is
• Distribution of r, the # of ones in a string of length N is thus
N = 100
3e+299
1e+29
2.5e+299
!N "
n(r) =
N = 1000
1.2e+29
r
8e+28
2e+299
6e+28
1.5e+299
4e+28
1e+299
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
2e+28
5e+298
Copyright Cambridge University Press 2003. On-screen
viewing permitted. Printing not
permitted. http://www.cambridge.org/0521642981
0
0
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/
for links.
0
10 20 30 40 50 60 70 80 90 100
0 100 200 300 400 500 600 700 800 9001000
7
4.4: Typicality
P (x) =
pr1 (1
2e-05
− p1 )
N −r
2e-05
1e-05
Typical sets intuition
0
1e-05
0
log2 P (x)
0
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
0
0
-50
-500
-100
-1000
T
-150
-1500
-200
-2000
-250
-2500
-300
-350
0
-3000
N = 100
10 20 30 40 50 60 70 80 90 100
3e+299
1e+29
0.12
2.5e+299
0.04
0.14
n(r)P (x) =
!N "
!N " r n(r) =
N −rr
r p1 (1 − p1 )
N = 1000
0 100 200 300 400 500 600 700 800 9001000
0.045
0.035
0.03
8e+28
0.1
2e+299
6e+28
0.08
1.5e+299
0.025
0.06
4e+28
0.02
1e+299
0.04
2e+28
5e+298
0.02 0
0 10 20 30 40 50 60 70 80 90 100
0
0 10 20 30 40 50 60 70 80 90 100
P (x) = pr1 (1 − p1 )N −r
-3500
1.2e+29
T
r
2e-05
2e-05
0.015
0.01
0
0.005
0 100 200 300 400 500 600 700 800 9001000
0
0 100 200 300 400 500 600 700 800 9001000
r
1e-05
Figure
4.11. Anatomy
of the typical
set of
T .length
For p1 N,
= 0.1
and N= =
1e-05
• Consider
iid bit-strings
strings
prob(1)
p1100
=0.1and N = 1000, these graphs
show n(r), the number of strings containing r 1s; the probability P (x) of a single string
that contains r 1s; the same0probability
on a log scale; and the total probability n(r)P (x) of
0 10 20 30 40 50 60 70 80 90 100
all strings that contain r 1s.0 The number r is on the horizontal
axis. The plot of log 2 P (x)
0
-50 the mean value of log P (x)
-500= −N H (p ) which equals −46.9
also
shows
by
a
dotted
line
2 1
2
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
-1000
T
T includes only the strings that
when N = 100log
and
−469-100
when
N
=
1000.
The
typical
set
2 P (x)
-150
-1500
have log2 P (x) close to this
T shows the set TN β (as defined in
-200 value. The range marked-2000
-250 β = 0.29 (left) and N = -2500
section 4.4) for N = 100 and
1000, β = 0.09 (right).
0
0
1
2
3
4
5
Typical sets intuition
• What’s the point?
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
• You
Consider
iid bit-strings strings of length N=100, prob(1) = p1=0.1
78
4 — The Source Coding Theorem
log2 (P (x))
x
...1...................1.....1....1.1.......1........1...........1.....................1.......11...
......................1.....1.....1.......1....1.........1.....................................1....
........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1.
1.1...1................1.......................11.1..1............................1.....1..1.11.....
...11...........1...1.....1.1......1..........1....1...1.....1............1.........................
..............1......1.........1.1.......1..........1............1...1......................1.......
.....1........1.......1...1............1............1...........1......1..11........................
.....1..1..1...............111...................1...............1.........1.1...1...1.............1
.........1..........1.....1......1..........1....1..............................................1...
......1........................1..............1.....1..1.1.1..1...................................1.
1.......................1..........1...1...................1....1....1........1..11..1.1...1........
...........11.1.........1................1......1.....................1.............................
.1..........1...1.1.............1.......11...........1.1...1..............1.............11..........
......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1..............
............11.1......1....1..1............................1.......1..............1.......1.........
....................................................................................................
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
−50.1
−37.3
−65.9
−56.4
−53.2
−43.7
−46.8
−56.4
−37.3
−43.7
−56.4
−37.3
−56.4
−59.5
−46.8
−15.2
−332.1
except for tails close to δ = 0 and 1. As long as we are allowed a tiny
University of Illinois at Chicago
ECE 534,
Fall 2009,
Natasha Devroye
probability
of error
δ, compression
down to N H bits is possible. Even if we
are allowed a large probability of error, we still can compress only down to
N H bits. This is the source coding theorem.
Theorem 4.1 Shannon’s source coding theorem. Let X be an ensemble with
entropy H(X) = H bits. Given " > 0 and 0 < δ < 1, there exists a positive
integer N0 such that for N > N0 ,
!
!
!
!1
! Hδ (X N ) − H ! < ".
(4.21)
!
!N
Definition:
weak typicality
4.4 Typicality
Why does increasing N help? Let’s examine long strings from X N . Table 4.10
shows fifteen samples from X N for N = 100 and p1 = 0.1. The probability
of a string x that contains r 1s and N −r 0s is
P (x) = pr1 (1 − p1 )N −r .
The number of strings that contain r 1s is
" #
N
n(r) =
.
r
So the number of 1s, r, has a binomial distribution:
" #
N r
P (r) =
p (1 − p1 )N −r .
r 1
(4.22)
(4.23)
(4.24)
These functions are shown
in figure 4.11. The mean of r is N p 1 , and its
$
standard deviation is N p1 (1 − p1 ) (p.1). If N is 100 then
r ∼ N p1 ±
$
N p1 (1 − p1 ) # 10 ± 3.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Figure 4.10. The top 15 strings
are samples from X 100 , where
p1 = 0.1 and p0 = 0.9. The
bottom two are the most and
least probable strings in this
ensemble. The final column shows
the log-probabilities of the
random strings, which may be
compared with the entropy
H(X 100 ) = 46.9 bits.
(4.25)
[Mackay textbook, pg. 78]
n(H (X)+!)
|A(n)
.
! |≤2
Finally, for sufficiently large n,
Pr{A(n)
! }
(3.12)
> 1 − !, so that
1 − ! < Pr{A(n)
(3.13)
! }
!
−n(H (X)−!)
≤
2
(3.14) permitted. Printing not permitted. http://www.cambridge.org/0521642981
Copyright Cambridge University Press 2003. On-screen viewing
(n) this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
You can
x∈Abuy
!
The typical set visually
|A(n)
= 2−n(H (X)−!)
4.5: Proofs
! |,
(3.15)
81
where the second inequality follows from (3.6). Hence,
log2 P (x)
n(H (X)−!) of length 100, prob(1) = 0.1
Bit sequences
|A(n)
,
(3.16)
! | ≥ (1 − !)2
!
which completes the proof of the properties of A(n)
! .
3.2
−N H(X)
!
TN β
CONSEQUENCES OF THE AEP: DATA COMPRESSION
Let X1 , X2 , . . . , Xn be independent, identically distributed random variables drawn from the probability mass"function p(x). We wish to find
short descriptions for such sequences of1111111111110.
random variables.
We divide all
. . 11111110111
sequences in X n into two sets: the typical set A(n)
! and its complement,
as shown in Figure 3.1.
"
"
"
"
0000100000010. . . 00001000010
0100000001000. . . 00010000000
n:|
0001000000000. . . 00000000000
|n elements
0000000000000. . . 00000000000
The ‘asymptotic equipartition’ principle is equivalent to:
Shannon’s source Non-typical
coding theorem
(verbal statement). N i.i.d. ranset
dom variables each with entropy H(X) can be compressed into more
than N H(X) bits with negligible risk of information loss, as N → ∞;
conversely if they are compressed into fewer than N H(X) bits it is virset
tually certain that Typical
information
will be lost.
∋
A(n) : 2n(H +
)
∋
Most + least likely sequences
NOT in typical set!!
Proofsthe second inequality follows from (3.6). Hence
where
60
4.5
[Mackay pg. 81]
elements
These two theorems are equivalent because we can define a compression algorithm that gives a distinct name of length N H(X) bits to each x in the typical
FIGURE 3.1. Typical sets and source coding.
set.
[Cover+Thomas pg. 60]
Figure 4.12. Schematic diagram
showing all strings in the ensemble
X N ranked by their probability,
and the typical set TN β .
ASYMPTOTIC EQUIPARTITION PROPERTY
University of Illinois at Chicago ECE 534, FallThis
2009,
Natasha
Devroye
section
may
be skipped if found
tough going.
|A(n) | ≤ 2n(H (X)+!) .
(3.12)
!
The law of large numbers
Finally, for sufficiently large n, Pr{A(n)
! } > 1 − !, so that
Our proof of the source coding theorem uses the law of large numbers.
(n)
! (3.13)
1−
Pr{A
! }
Mean and variance
of!a<real
random
= ū =
u P (u)u
!variable are E[u]
!2
2.
and var(u) = σu2 = E[(u
−
ū)
]
=
P
(u)(u
−
ū)
−n(H
(X)−!)
u
≤
2
(3.14)
(n)
Technical note: strictly
x∈A! I am assuming here that u is a function u(x)
of
finite discrete
ensemble X. Then the summations
! a sample x from a −n(H
(X)−!)! (n)
|Ax! P |,
=2
(x)f (u(x)). This means that(3.15)
P (u)
u P (u)f (u) should be written
is a finite sum of delta functions. This restriction guarantees that the
and variance
of u follows
do exist,from
which(3.6).
is notHence,
necessarily the case for
wheremean
the second
inequality
general P (u).
Properties of the typical set
(n)
n(H (X)−!)
|ALet
(1 −
(3.16)
! |≥
Chebyshev’s inequality 1.
t be
a !)2
non-negative, real random variable,
and let α be a positive real number. Then
which completes the proof of the properties of A(n)
!
! .
t̄
P (t ≥ α) ≤ .
(4.30)
α
3.2 CONSEQUENCES
OF THE AEP: DATA COMPRESSION
!
Proof: P (t ≥ α) = !
t≥α P (t). We multiply each term by t/α ≥ 1 and
, .≥
. . ,α)
Xn≤be independent,
identically
random
variLet
X1 , X
add thedistributed
(non-negative)
missing
obtain:
P2(t
t≥α P (t)t/α.
! We function
ables
drawn
from
the
terms and obtain: P (t probability
≥ α) ≤ t mass
P (t)t/α = t̄/α.p(x). We wish to find
!
short descriptions for such sequences of random variables. We divide all
sequences in X n into two sets: the typical set A(n)
! and its complement,
as shown in Figure 3.1.
n:|
|n elements
Non-typical set
Typical set
∋
FIGURE 3.1. Typical sets and source coding.
[Cover+Thomas pg. 60]
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
∋
A(n) : 2n(H +
)
elements
n(H (X)−!)
|A(n)
,
! | ≥ (1 − !)2
(3.16)
!
which completes the proof of the properties of A(n)
! .
3.2
CONSEQUENCES OF THE AEP: DATA COMPRESSION
Let X1 , X2 , . . . , Xn be independent, identically distributed random variables drawn from the probability mass function p(x). We wish to find
short descriptions for such sequences of random variables. We divide all
sequences in X n into two sets: the typical set A(n)
! and its complement,
as shown in Figure 3.1.
Consequences of the AEP
n:|
|n elements
Typical set contains almost
all the probability!
Non-typical set
Typical set
∋
A(n) : 2n(H +
)
elements
3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION
61
∋
FIGURE 3.1. Typical sets and source coding.
Non-typical set
Description: n log | | + 2 bits
How many are in this set
useful for source coding
(compression)!
Typical set
Description: n(H + ) + 2 bits
∋
FIGURE 3.2. Source code using the typical set.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
We order all elements in each set according to some order (e.g., lexicographic order). Then we can represent each sequence of A(n)
! by giving
the index of the sequence in the set. Since there are ≤ 2n(H +!) sequences
(n)
in A! , the indexing requires no more than n(H + !) + 1 bits. [The extra
bit may be necessary because n(H + !) may not be an integer.] We prefix all these sequences by a 0, giving a total length of ≤ n(H + !) + 2
bits to represent each sequence in A(n)
! (see Figure 3.2). Similarly, we can
index each sequence not in A(n)
! by using not more than n log |X| + 1 bits.
Prefixing these indices by 1, we have a code for all the sequences in X n .
Note the following features of the above coding scheme:
Consequences of the AEP
•
•
•
The code is one-to-one and easily decodable. The initial bit acts as
a flag bit to indicate the length of the codeword that follows.
c
We have used a brute-force enumeration of the atypical set A(n)
!
without taking into account the fact that the number of elements in
c
is less than the number of elements in X n . Surprisingly, this is
A(n)
!
good enough to yield an efficient description.
The typical sequences have short descriptions of length ≈ nH .
We use the notation x to denote a sequence x , x , . . . , x . Let l(x )
By enumeration!
be the length of the codeword corresponding to x . If n is sufficiently
n
1
n
2
n
n
large so that Pr{A(n)
! } ≥ 1 − !, the expected length of the codeword is
!
E(l(X n )) =
p(x n )l(x n )
(3.17)
xn
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
AEP and data compression
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
``High-probability set’’ vs. ``typical set’’
• Typical set: small number of outcomes that contain most of the probability
• Is it the smallest such set?
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Some notation
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
81
4.5: Proofs
log2 P (x)
Bit sequences of length 100, prob(1) = 0.1
−N H(X)
!
TN β
"
"
"
"
"
1111111111110. . . 11111110111
0000100000010. . . 00001000010
0100000001000. . . 00010000000
0001000000000. . . 00000000000
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
0000000000000. . . 00000000000
The ‘asymptotic equipartition’ principle is equivalent to:
Shannon’s source coding theorem (verbal statement). N i.i.d. random variables each with entropy H(X) can be compressed into more
than N H(X) bits with negligible risk of information loss, as N → ∞;
conversely if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost.
Figure 4.12. Schematic diagram
showing all strings in the ensemble
X N ranked by their probability,
and the typical set TN β .
[Mackay pg. 81]
What’s the difference?
These two theorems are equivalent because we can define a compression algorithm that gives a distinct name of length N H(X) bits to each x in the typical
set.
4.5 Proofs
This section may be skipped if found tough going.
The law of large numbers
Our proof of the source coding theorem uses the law of large numbers.
!
Mean and variance of a real random
= ū =
u P (u)u
!variable are E[u]
2
2
2
and var(u) = σu = E[(u − ū) ] = u P (u)(u − ū) .
Technical note: strictly I am assuming here that u is a function u(x)
of a sample x from a finite discrete
!
! ensemble X. Then the summations
u P (u)f (u) should be written
x P (x)f (u(x)). This means that P (u)
is a finite sum of delta functions. This restriction guarantees that the
mean and variance of u do exist, which is not necessarily the case for
general P (u).
Chebyshev’s inequality 1. Let t be a non-negative real random variable,
and let α be a positive real number. Then
t̄
P (t ≥ α) ≤ .
(4.30)
• Why use the ``typical set’’ rather
than αthe ``high-probability
set’’?
!
Proof: P (t ≥ α) = !
t≥α P (t). We multiply each term by t/α ≥ 1 and
obtain: P (t ≥ α) ≤ t≥α P (t)t/α.
! We add the (non-negative) missing
terms and obtain: P (t ≥ α) ≤ t P (t)t/α = t̄/α.
!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Related documents