Download ch7

Document related concepts
no text concepts found
Transcript
Chapter 7
Mathematical Foundations
7-1
Notions of Probability Theory
• Probability theory deals with predicting how likely
it is that something will happen.
• The process by which an observation is made is
called an experiment or a trial.
• The collection of basic outcomes (or sample points)
for our experiment is called the sample space  (Omega).
• An event is a subset of the sample space.
• Probabilities are numbers between 0 and 1, where
0 indicates impossibility and 1 certainty.
• A probability function/distribution distributes a
probability mass of 1 throughout the sample space.
 : F  0,1
  1
7-2
Example
the event of interest
an experiment
• A fair coin is tossed 3 times. What is the chance
sample space
of 2 heads?
– ={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
probabilistic function
– uniform distribution
1
P(basic outcome)=
8
The chance of getting 2 heads when tossing 3 coins
A={HHT, HTH, THH}
P(A)= A  3

8
a subset of 
7-3
Conditional Probability
• Conditional probabilities measure the
probability of events given some knowledge.
• Prior probabilities measure the
probabilities of events before we consider
our additional knowledge.
• Posterior probabilities are probabilities that
result from using our additional knowledge.
1st
Event B: is H
Event A: 2 Hs in 1st, 2nd and 3rd
4
P(B)=
8
P(AB)= 2
8
1st
H
2
 P(A|B)=
4
2nd
H
T
3rd
T
H
2
4
7-4
P(A  B)
A | B 
P(B)

A
B
AB
The multiplication rule
P(A  B)  P(B)P(A | B)  P(A)P(B | A)
The chain rule
P(A1  A 2  ...  A n )  P(A1 )P(A 2 | A1 )P(A 3 | A1  A 2 )    P(A n | in11 A i )
(used in Markov models, …)
7-5
Independence
• The chain rule relates intersection with
conditionalization (important to NLP)
• Independence and conditional independence
of events are two very important notions in
statistics
– independence:
P( A)  P( A | B)
P( A  B)  P( A) P( B)
– conditional independence
P( A  B | C )  P( A | C ) P( B | C )
7-6
Bayes’ Theorem
• Bayes’ Theorem lets us swap the order of
dependence between events. This is
important when the former quantity is
difficult to determine.
P(B  A) P(A | B) (B)
B | A  

P(A)
P(A)
• P(A) is a normalizing constant.
7-7
An Application
• Pick the best conclusion given some evidence
– (1) evaluate the probability
P(c|e) <--- unknown
– (2) Select c with the largest P(c|e).
P(c | e) 
P (c ) * P ( e | c )
P(e)
P( c ) and P(e|c) are known.
• Example.
– Relative probability of a disease
– how often a symptom is associated
7-8
Bayes’ Theorem
P(A | B) (B)
arg max P(A)  arg max P(A | B)(B)
B
B

A
B
AB
P( A)  P( A  B)  P( A  B)  P( A | B) P( B)  P( A | B) P( B)
7-9
Bayes’ Theorem
(  j |  ) 
if
 (  |  j ) (  j )
(  )

A  i 1 Bi , P(A) > 0,
n
(  |  j ) (  j )

n
i 1
(  | i )( i )
B B
i
j

(ij)
The group of sets Bi partition A
(used in noisy channel model)
7 - 10
An Example
• A parasitic gap occurs once in 100,000 sentences.
• A complicated pattern matcher attempts to identify
sentences with parasitic gaps.
• The answer is positive with probability 0.95 when a
sentence has a parasitic gap, and the answer is positive
with probability 0.005 when it has no parasitic gap.
• When the test says that a sentence contains a parasitic
gaps, what is the probability that this is true?
P(G)=0.00001, P (G )  0.99999 , P(T|G)=0.95, P(T | G )  0.005
P(T | G ) P(G )
P(G | T ) 
P(T | G ) P(G )  P(T | G ) P(G )
0.95  0.00001
7 - 11

 0.002
0.95  0.0001  0.005  0.99999
Random Variables
• A random variable is a function
X:  (sample space)  Rn
– Talk about the probabilities that are related to the event space
• A discrete random variable is a function
X:  (sample space)  S
where S is a countable subset of R.
• If X:  (sample space)  {0,1},
then X is called a Bernoulli trial.
• The probability mass function (pmf) for a random
variable X gives the probability that the random variable
has different numeric values.
pmf
where
p( x)  P( X  x)  P( Ax )
Ax  {   : X ( )  x}
7 - 12
Events: toss two dice and sum their faces
={(1,1), (1,2), …, (1,6), (2,1), (2,2), …, (6,1), (6,2), …, (6,6)}
S={2, 3, 4, …, 12}
X:   S
First
die
2
Second die
3
4
5
1
6
6
7
8
9
10
11
12
5
6
7
8
9
10
11
4
5
6
7
8
9
10
3
4
5
6
7
8
9
2
3
4
5
6
7
8
1
2
3
4
5
6
7
x
2
3
4
5
6
p(X=x)
1/36 1/18 1/12 1/9
7
5/36 1/6
8
9
5/36 1/9
10
11
12
1/12 1/18 1/36
1
pmf
p(3)=p(X=3)=P(A3)=P({(1,2),(2,1)})=2/36
where A3={: X()=3}={(1,2),(2,1)}
7 - 13
Expectation
• The expectation is the mean (, Mu) or
average of a random variable.
E ( X )   xp( x )
x
• Example
– Roll one die and Y is the value on its face
6
6
1
21
1
E (Y )   xp( y)   y 
3
6 y 1
6
2
y 1
E ( g (Y ))   g ( y ) p( y )
y
E(aY+b)=aE(Y)+b
E(X+Y)=E(X)+E(Y)
7 - 14
Variance
• The variance (2, Sigma ) of a random variable
is a measure of whether the values of the
random variable tend to be consistent over
trials or to vary a lot.
2
Var( X )  E (( X  E ( X )) 2 )  E ( X 2 )  E 2 ( X )
• standard deviation ()
– square root of the variance
7 - 15
X: a random variable that is the sum of the numbers on two dice
First
die
2
Second die
3
4
5
1
6
6
7
8
9
10
11
12
5
6
7
8
9
10
11
4
5
6
7
8
9
10
3
4
5
6
7
8
9
2
3
4
5
6
7
8
1
2
3
4
5
6
7
x
2
3
4
5
6
p(X=x)
1/36 1/18 1/12 1/9
7
5/36 1/6
8
9
5/36 1/9
10
11
12
1/12 1/18 1/36
1
E( X )  2 
1
1
1
1
5
1
5
1
1
1
1
 3   4   5   6   7   8   9   10   11  12   7
36
18
12
9
36
6
36
9
12
18
36
1
1
E ( X )  E (Y  Y )  E (Y )  E (Y )  3  3  7
2
2
var( X )  E (( X  E ( X )) 2 )   p( x)( x  E ( X )) 2  5
x
X: toss two dice, and sum of
their faces, I.e., X=Y+Y
Y: toss one die, and its value
5
6 E ( g (Y ))  g ( y ) 7p-(16
y)

y
Joint and Conditional Distributions
• More than one random variable can be defined over
a sample space. In this case, we talk about a joint or
multivariate probability distribution.
• The joint probability mass function for two discrete
random variables X and Y is:
p(x, y)=P(X=x, Y=y)
• The marginal probability mass function totals up
the probability masses for the values of each
variable separately.
p
X
( x)   p( x, y)
p
Y
( y )   p ( x, y )
y
x
• If X and Y are independent, then
p( x, y)  pX ( x) pY ( y)
7 - 17
Joint and Conditional Distributions
• Similar intersection rules hold for joint
distributions as for events.
p
X |Y
( x | y) 
p ( x, y )
pY ( y )
• Chain rule in terms of random variables
p( w, x, y, z )  p( w) p( x | w) p( y | w, x) p( z | w, x, y )
7 - 18
Estimating Probability Functions
• What is the probability that sentence “The cow
chewed its cud” will be uttered? Unknown 
P must be estimated from a sample of data.
• An important measure for estimating P is the
relative frequency of the outcome, i.e., the
proportion of times a certain outcome occurs.
C (u )
fu 
N
• Assuming that certain aspects of language can be
modeled by one of the well-known distribution is
called using a parametric approach.
• If no such assumption can be made, we must use a
non-parametric approach or distribution-free
7 - 19
approach.
parametric approach
• Select an explicit probabilistic model
• Specify a few parameters to determine a
particular probability distribution
• The amount of training data required is not
great, and can be calculated to make good
probability estimates
7 - 20
Standard Distributions
• In practice, one commonly finds the same basic
form of a probability mass function, but with
different constants employed.
• Families of pmfs are called distributions and the
constants that define the different possible pmfs in
one family are called parameters.
• Discrete Distributions: the binomial distribution,
the multinomial distribution, the Poisson
distribution.
• Continuous Distributions: the normal distribution,
the standard normal distribution.
7 - 21
Binomial distribution
• A series of trials with only two outcomes,
each trial is independent from all others
• The number r of successes out of n trials
given that the probability of success in any
trial is p:
variable
b( r; n, p)  nr p r (1  p)nr

n
r
n!
  (n  r)! r!
expectation: np
parameters
0r n
variance: np(1-p)
7 - 22
7 - 23
The normal distribution
• two parameters for mean  and standard
deviation , the curve is given by
1
( x   ) 2 /( 2 2 )
n( x;  ,  ) 
e
2 
• standard normal distribution
– =0, =1
7 - 24
7 - 25
Baysian Statistics I: Bayesian Updating
• frequentist statistics vs. Bayesian statistics
– Toss a coin 10 times and get 8 heads  8/10 (maximum
likelihood estimate)
– 8 heads out of 10 just happens sometimes given a small
sample
• Assume that the data are coming in sequentially
and are independent.
• Given an a priori probability distribution, we can
update our beliefs when a new datum comes in by
calculating the Maximum A Posteriori (MAP)
distribution.
• The MAP probability becomes the new prior and
the process repeats on each new datum.
7 - 26
Frequentist Statistics
m : the model that asserts P(head)=m
s: a particular sequence of observations yielding i heads and j tails
P( s | m )  mi (1  m) j
Find the MLE (maximum likelihood estimate)
arg max P(s |  )
m
m
by differentiating the above polynomial.
i
i j
8 heads and 2 tails:
8/(8+2)=0.8
f (m)  mi (1  m) j
f ' (m)  i  mi 1  (1  m) j  j  mi  (1  m) j 1
 (i  mi 1  (1  m)  j  mi )  (1  m) j 1
 (i  mi 1  i  m i  j  mi )  (1  m) j 1
 (i  (i  j )m)  m i 1  (1  m) j 1
7 - 27
belief in the fairness of a coin:
Bayesian Statistics
a regular, fair one
a priori probabilistic distribution: P( m )  6m(1  m)
P( s | m )  mi (1  m) j
a particular sequence of observations:
i heads and j tails
New belief in the fairness of a coin??
P ( s | m ) P( m )
P ( m | s ) 
P( s )
m (1  m)  6m(1  m)

P( s )
i
j
6mi 1 (1  m) j 1

P( s )
arg max P(  m | s)
m
f (m)  6m i 1 (1  m) j 1
f ' (m)  6(i  1)m i (1  m) j 1  6( j  1)m i 1 (1  m) j
 6m i (1  m) j (i  1)(1  m)  ( j  1)m
i 1
m
i j2
new priori
arg max P(  m | s ) 
m
(when i=8,j=2)
3
4
7 - 28
Bayesian Statistics II:
Bayesian Decision Theory
• Bayesian Statistics can be used to evaluate
which model or family of models better
explains some data.
• We define two different models of the event
and calculate the likelihood ratio between
these two models.
7 - 29
Entropy
• The entropy is the average uncertainty of a single
random variable.
• Let p(x)=P(X=x); where x X.
Toss two coins and count the number of heads
p(0)=1/4, p(1)=1/2, p(2)=1/4
• H(p)= H(X)= - xX p(x)log2p(x)
• In other words, entropy measures the amount of
information in a random variable. It is normally
measured in bits.
7 - 30
Example
• Roll an 8-sided die
8
8
1
1
1
H ( X )   p(i ) log p(i )   log( )   log  log 8  3bits
8
8
i 1
i 1 8
• Entropy (another view)
– the average length of the message needed to transmit an
outcome of that variable
1
2
3
4
5
6
7
8
001
010
011
100
101
110
111
000
– Optimal code to send a message of probability p(i):
 log p(i)
7 - 31
Example
• Problem:
send a friend a message that is a number from 0 to 3.
How long a message must you send ?
(in terms of number of bits)
• Example: watch a house with two occupants.
Case Situation
Probability1 Probability2
0
no occupants
0.25
0.5
1
1st occupant
0.25
0.125
2
2nd occupant
0.25
0.125
3
both occupant
0.25
0.25
0
1
2
3
Probability Code Probability2 Code
0.25
00
0.5
0
0.25
01
0.125
110
0.25
10
0.125
111
0.25
11
0.25
10
7 - 32
• Variable-length encoding
0-No occupants
10-both occupants
110-First occupant
111-Second occupant
• Code tree:
– (1) all messages are handled.
– (2) when one message ends and the next starts.
1
1
1
1
*1bit  * 2bits  * 3bits  * 3bits  1.75bits ( 2bits )
2
4
8
8
– Fewer bits for more frequent messages more bits for
less frequent messages.
7 - 33
W : random variable for a message
V(W) : possible messages
P : probability distribution
• lower bound on the number of bits needed to
encode such a message:
H (W )    P ( w) log P ( w)
wV (W )
(entropy of a random variable)
H (Wocc )
1
1
1
1
1
1
1
1
 (
log

log

log

log
)
2
2
4
4
8
8
8
8
1
1
1
1
 (
( 1) 
( 2 ) 
( 3) 
( 3))
2
4
8
8
 1.75
 P ( w)bits  required ( w)
wV (W )

 P ( w)(  log
wV (W )
P ( w))
7 - 34
• (1) entropy of a message:
– lower bound for the average number of bits needed to
transmit that message.
• (2) encoding method:
– using -log P(w) bits
• entropy (another view)
– a measure of the uncertainty about what a message says.
– fewer bits for more certain message more bits for less
certain message
7 - 35
H ( p)  H ( X )    p( x ) log 2 p( x )
xX
1
H ( X )   p( x ) log
p( x )
xX
E ( X )   xp( x )
x

1 
H ( X )  E  log

p( X ) 

7 - 36
Simplified Polynesian
• p
1/8
t
1/4
k
1/8
a
1/4
i
1/8
u
1/8
a
01
i
110
u
111
• per-letter entropy
H ( P)  
 P(i) log P(i)
i{ p ,t , k , a ,i ,u }
1
1
1
1
 [4  log  2  log ]
8
8
4
4
1
 2 bits
2
• p
100
t
00
k
101
Fewer bits are used to send more frequent letters.
7 - 37
Joint Entropy and Conditional Entropy
• The joint entropy of a pair of discrete random variables
X, Y ~ p(x,y) is the amount of information needed on
average to specify both their values.
H ( X , Y )    p( x, y ) log p( x, y )
xX yY
• The conditional entropy of a discrete random variable
Y given another X, for X, Y ~ p(x,y), expresses how
much extra information you still need to supply on
average to communicate Y given that the other party
knows X.
H (Y | X )   p( x )H (Y | X  x )
xX



  p( x )   p( y | x ) log p( y | x ) 
xX
 yY

   p( x, y ) log p( y | x )
xX yY
7 - 38
Chain rule for entropy
H ( X , Y )  H ( X )  H (Y | X )
H ( X 1,..., X n )  H ( X 1 )  H ( X 2 | X 1 )  ...  H ( X n | X 1,..., X n1 )
Proof:
H ( X , Y )   E p ( x , y ) (log p( x, y ))
  E p ( x , y ) (log( p( x) p( y | x))
  E p ( x , y ) (log p( x)  log p( y | x))
  E p ( x , y ) (log p( x))  E p ( x , y ) (log p( y | x))
 H ( X )  H (Y | X )
7 - 39
Simplified Polynesian revised
• Distinction between model and reality
• Simplified Polynesian has syllable structure
• All words are consist of sequences of CV
(consonant-vowel) syllables.
• A better model in terms of two random
variables C and V.
7 - 40
t
k
1
16
3
8
1
16
1
16
0
3
16
3
16
1
16
P(C,•)
1
8
3
4
1
8
p
t
k
a
i
u
1
16
3
1
16
1
1
8
1
8
P(V,C)
v
o
w
e
l
consonant
a
i
u
p
8
P(•,V)
1
2
1
4
1
4
0
4
1
1 3
3 1
1
1
3
H (C )  ( log  log  log )  2   3  (2  log 3)
8
8 4
4 8
8
8
4
9 3
7 - 41
  log 3bits  1.061bits
4 4
H (V | C ) 
 p(C  c) H (V | C  c)
c  p ,t , k
1
1 1
3
1 1 1 1
1 1
 H ( , ,0)  H ( , , )  H ( ,0, )
8
2 2
4
2 4 4 8
2 2
1
1
1
 1


 3
1  16
16
16
16
 
(log ) 
(log )  0  
1
1
1
1
8
 4
8
8
8
 8


1
1
1 
 1

1  16
16
16
16
 1 (log 1 )  0  1 (log 1 )
8

8
8
8 
 8
3
8 (log
3
4
3
3
8 )  16 (log
3
3
4
4
3
3
16 )  16 (log
3
3
4
4
3 
16 ) 
3 

4 
1
3 1 1 1 1
 1       1 H (C,V )  H (C )  H (V | C )
8
4 2 2 2 8
 1.061bits  1.375bits  2.44bits
11
 bits  1.375bits
8
7 - 42
p
t
k
a
i
u
1
16
3
1
16
1
1
8
1
8
8
4
1
1 3
3 1
1 1
1 1
1 1
1
log  log  log  log  log  log )
16
16 8
8 16
16 4
4 8
8 8
8
2
1 3
3 1
1 2
1
 ( log  log  log  log )
16
16 8
8 4
4 8
8
8 3
2 6
  (3  log 3)  
16 8
4 8
 3.2825
H ( P )  (
p
t
k
1
8
1
1
8
1
H ( P)  2 bits
2
4
a
1
4
i
u
1
8
1
8
1
H (2 P)  2  2 bits  5bits
2
7 - 43
Short Summary
• Better understanding means much less
uncertainty (2.44bits < 5 bits)
• Incorrect model’s cross entropy is larger
than that of the correct model
 (W1,n )   (W1,n , M )
def
 (W1, n , M )    P ( w1, n ) log M ( w1, n )
w1, n
 (W1, n ) : correct model
M (W1, n ) : approximate model
7 - 44
Entropy rate:
per-letter/word entropy
• The amount of information contained in a
message depends on the length of the
message
1
1
H rate  H ( X 1n )    p( x1n ) log p( x1n )
n
n x1n
• The entropy of a human language L
1
H rate( L)  lim H ( X 1 , X 2 ,..., X n )
n  n
7 - 45
Mutual Information
• By the chain rule for entropy, we have
H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y)
• Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X)
• This difference is called the mutual information
between X and Y.
• It is the reduction in uncertainty of one random
variable due to knowing about another, or, in other
words, the amount of information one random
variable contains about another.
7 - 46
H(X,Y)
H(X|Y)
I(X;Y)
H(X)
I ( X ;Y )  H ( X )  H ( X | Y )
H(Y|X)
H(Y)
p
X
( x)   p( x, y)
y
p ( y )   p ( x, y )
 H ( X )  H (Y )  H ( X , Y )
1
1
  p( x ) log
  p( y ) log
  p( x, y ) log p( x, y )
p( x ) y
p( y ) x , y
x
Y
p ( x, y )
  p( x, y ) log
p( x ) p( y )
x, y
x
7 - 47
It is 0 only when two variables are independent, but
for two dependent variables, mutual information grows not only
with the degree of dependence, but also according to the entropy
of the variables
Conditional mutual information
I ( X ;Y )  H ( X )  H ( X | Y )
I ( X ;Y | Z )  I (( X ;Y ) | Z )  H ( X | Z )  H ( X | Y , Z )
Chain rule for mutual information
I ( X 1n ; Y )  I ( X 1; Y )  I ( X 2 ; Y | X 1 )  ...  I ( X n ; Y | X 1 ,..., X n1 )
n
  I ( X i ;Y | X 1 ,..., X i 1 )
i 1
7 - 48
Pointwise Mutual Information
p ( x, y )
I ( x, y )  log
p( x ) p( y )
Applications:
clustering words
word sense disambiguation
7 - 49
• Clustering by Next Word (Brown, et al., 1992)
1. Each word was characterized by the word that
immediately followed it.
c(wi) ==<|w1|,|w2|, ...,|ww|>
wj total … wi wj … in the corpus
2. Define the distance measure on such vectors.
mutual information I(x, y)
the amount of information one outcome gives us
about the other
def
I(x, y) == (-log P(x))
(-log P(x|y))
x uncertainty
== log
x gives y uncertainty
P ( x, y )
P( x)
P( y )
certainty (x gives y)
7 - 50
• Example.
How much information the word “pancake”
gives us about the following word “syrup”
P( pancake, syrup )
I ( pancake; syrup )  log
P( pancake) P( syrup )
7 - 51
Physical meaning of MI
(1) wi and wj have no particular relation to each other.
I(x; y)  0
P(wj| wi) = P(wj) x與y相互之間不會給對方信息
I ( wi ; w j )  log
 log
 log
P ( wi , w j )
P ( wi )
P( w j )
P ( w j | wi )
P( w j )
P( w j )
P( w j )
0
7 - 52
Physical meaning of MI
(2) wi and wj are perfectly coordinated.
i
j
P
(
w
,
w
)
i
j
I ( w ; w )  log
P ( wi )
P( w j )
P ( wi )
 log
P ( wi ) P ( w j )
1
 log
a very large number
j
P( w )
I(x; y) >> 0
當知道y後,提供很多有關x的信息
7 - 53
Physical meaning of MI
(3) wi and wj are negatively correlated
a very small negative number
I(x; y) << 0 王不見王
7 - 54
The Noisy Channel Model
• Assuming that you want to communicate
messages over a channel of restricted capacity,
optimize (in terms of throughput and accuracy) the
communication in the presence of noise in the
channel.
n
o
i
s
y
c
h
a
n
n
e
l
W
Message
from a finite
alphabet
Encoder
X
Input to
channel

Channel
p(y|x)
Y
Decoder
Output from
channel
W
Attempt to
reconstruct message
based on output
• A channel’s capacity can be reached by designing
an input code that maximizes the mutual
information between the input and output over all
possible input distributions. C  max I ( X ; Y )
p( X )
7 - 55
A binary symmetric channel. A 1 or 0 in the input gets flipped on
transmission with probability p.
0
1-p
0
p
I(X;Y) = H(Y) – H(Y|X)
= H(Y) – H(p)
max I ( X ; Y )  1  H ( p)
1
1
1- p
The channel capacity is 1 bit only
if the entropy H(p) is 0.
I.E.,
If p=0 the channel reliability
transmits a 0 as 0 and 1 as 1.
If p=1 it always flips bits
p( X )
H(p)=-p(x)log2p(x)
The channel capacity is 0 when
both 0s and 1s are transmitted
with equal probability as 0s and 1s
(i.e., p=1/2).
1
1 1
1
 log  log  1
2
2 2
2
 Completely noisy binary channel
7 - 56
The noisy channel model in linguistics

I
Noisy Channel
p(o|i)
O
I
Decoder

p (i ) p (o | i )
I  arg max p(i | o)  arg max
 arg max p(i ) p(o | i )
p(o)
i
i
i
p(i): language model
p(o|i): channel probability
7 - 57
Speech Recognition
• Find the sequence of words W1,n that maximizes
P(W1,n  w1,n | Speech Signal)
P(W1,n ) P(Speech Signal | W1,n )

P( Speech Signal)
• Maximize P( W1,n) P(Speech Signal |W1,n )
language model
signal
acoustic aspects of speech
W1
W2
W3
a1
b1
c1
c2
a2
a3
a2
b1
c4
b2
c3
c4
P(a2 , b1 , c4 ) P(a2 , b1 , c4 | Speech Signal)
7 - 58
The
big dog...
pig
Assume P(big | the) = P(pig | the).
P(the big dog) = P(the)P(big | the)P(dog | the big)
P(the pig dog) = P(the)P(pig | the)P(dog | the pig)
 P(dog | the big) > P(dog | the pig)
 “the big dog” is selected.
=> “dog” selects “big”
7 - 59
7 - 60
Relative Entropy or
Kullback-Leibler Divergence
• For 2 pmfs, p(x) and q(x), their relative entropy is:
p( x)
D( p || q)   p( x) log
q( x)
xX
• The relative entropy (also known as the KullbackLeibler divergence) is a measure of how different
two probability distributions (over the same event
space) are.
• The KL divergence between p and q can also be
seen as the average number of bits that are wasted
by encoding events from a distribution p with a
code based on a not-quite-right distribution q.
D( p || q )  0 i.e., no bits are wasted when p=q
7 - 61

p( X ) 
D( p || q)  E p  log

q( X ) 

D( p || q)   p( x) log
xX
Recall
p( x)
q( x)
E ( X )   xp( x )
x
A measure of how far a joint distribution is from independence:
I ( X ;Y )  D( p( x, y ) || p( x ) p( y ))
I ( X ; Y )   p( x, y ) log
x, y
p ( x, y )
p( x) p( y )
Conditional relative entropy:
p( y | x )
D ( p( y | x ) || q( y | x ))   p( x ) p( y | x ) log
q( y | x )
x
y
Chain rule for relative entropy:
D( p( x, y ) || q( x, y ))  D( p( x) || q( x))  D( p( y | x) || q( y | x))
Application: measure selectional preferences in selection
7 - 62
The Relation to Language: Cross-Entropy
• Entropy can be thought of as a matter of how
surprised we will be to see the next word given
previous words we already saw.
• The cross entropy between a random variable X
with true probability distribution p(x) and another
pmf q (normally a model of p) is given by:
H ( X , q)  H ( X )  D( p || q)
  p( x) log q( x)
x
1
 E p (log
)
q( x)
H ( X )  D( p || q )
1
p ( x)
  p ( x) log
  p ( x) log
p( x)
q ( x)
1
p ( x)
  p ( x) log (
 log
)
p ( x)
q( x)
• Cross-entropy can help us find out what our
average surprise for the next word is.
7 - 63
Cross Entropy
• Cross entropy of a language L=(Xi)~p(x)
1
H ( L, m)   lim  p( x1n ) log m( x1n )
n n
x1n
• The language is “nice”
1
H ( L, m)   lim log m( x1n )
n  n
• Large body of utterances available
1
H ( L, m)   log m( x1n )
n
7 - 64
How much good does the
approximate model do ?
 (W1,n )   (W1,n , M )
def
 (W1, n , M )    P ( w1, n ) log M ( w1, n )
w1, n
 (W1, n ) : correct model
M (W1, n ) : approximate model
7 - 65
 (W1,n )   (W1,n , M )
proof:
log e X  X  1
n
x
i 1
i
….(1)
n
y
1
i 1
i
1
….(2)
(correct model) (approximate model)
n
yi
xi log 2 ( )

xi
i 1

1
log e 2
y=x-1
n

i 1
xi log e (
n
1

log e 2

1

log e 2
n
i 1
xi (
(y
i 1
i
yi
)
xi
yi
 1)
xi
 xi )
y  log e x
… (1)
e
… (2)
0
7 - 66
n
 xi log 2 (
i 1
n
yi
)
xi
n
  xi log 2 yi   xi log 2 xi  0
i 1
n
x
i 1
n
i
log 2 yi 
i 1
n
x
i
i 1
log 2 xi
n
  xi log 2 xi   xi log 2 yi
i 1
i 1
 (W1,n )   (W1,n , M )
cross entropy of a language L give a model M
def
1
( w1,n ) log M ( w1,n )

n  n
w1,n
 ( L, PM )   lim
( w1,n ) ?
(“all” English test)
infinite
representative samples of English text
7 - 67
H ( L, PM )
1
  lim  P(W1,n ) log PM (W1,n )
n  n
1
  lim log PM (W1,n )
n  n
7 - 68
Cross Entropy as a Model Evaluator
• Example:
– Find the best model to produce message of 20 words.
• Correct probabilistic model.
Message
M1
M2
M3
M4
Probability
0.05
0.05
0.05
0.10
Message
M5
M6
M7
M8
Probability
0.10
0.20
0.20
0.25
7 - 69
• Per-word cross entropy
1

* (0.05 log P(M1) + 0.05 log P(M2)+
20 0.05 log P(M3) + 0.10 log P(M4)+
0.10 log P(M5) + 0.20 log P(M6)+
0.20 log P(M7) + 0.25 log P(M8) )
• Approximate model
100 samples
M1 M1 M1 M1 M1
M2 M2 M2 M2 M2
M3 M3 M3 M3 M3
M4 M4 M4 ... M4
M5 M5 M5 ... M5
M6 M6 M6 ... M6
M7 M7 M7 ... M7
M8 M8 M8 ... M8
(5)
(5)
(5)
(10)
(10)
(20)
(20)
(25)
Each message is
independent of the next
7 - 70
PM
= M1 :
P(M1) * P(M1) * … * P(M1)
5 times
M2 :
.
.
.
M8 :
P(M2) * P(M2) * … * P(M2)
5 times
* ...
P(M8) * P(M8) * … * P(M8)
25 times
Log(PM ) = 5 log P(M1) + 5 log P(M2) + 5 log P(M3)+
10 log P(M4) + 10 log P(M5) + 20 log P(M6)+
20 log P(M7) + 25 log P(M8)
7 - 71
• per-word cross entropy
1

(log PM )
20  100
1
5
5
5

(
log P ( M 1) 
log P( M 2) 
log P( M 3)
20 100
100
100
10
10
20

log P ( M 4) 
log P( M 5) 
log P( M 6)
100
100
100
20
25

log P ( M 7) 
log P( M 8))
100
100
• the test suite of 100 examples was exactly indicative of the
probabilistic model.
• To measure the cross entropy of a model works only if the
test sequence has not been used by model builder.
closed test (inside test)
open test (outside test)
7 - 72
• The Composition of Brown and LOB Corpora
• Brown Corpus:
– Brown University Standard Corpus of Present-day
American English.
• LOB Corpus:
– Lancaster/Oslo-Bergen Corpus of British English.
7 - 73
Text Category
Number of Texts
Brown
LOB
A Press:reportage 報導文學
44
44
B Press:editorial 社論
27
27
C Press:reviews 書評
17
17
D Religion 宗教性
17
17
E Skills and hobbies 技藝,商業性,娛樂性 36
38
F Popular lore 名間傳奇
48
44
G Belles lettres,biography,memoirs,etc純文學
75
77
H Miscellaneous(mainly government documents)
雜類(包括政府、出版品、基金會報告等)
30
30
J Learned(including suence and technology)
學術性和科學論文
80
80
K General fiction 一般小說
29
29
L Mystery and detective 神秘及偵探小說
24
24
M Suence fiction 科幻小說
6
6
N Adverture and western fiction 探險及西部小說 29
29
P Romancer and love story 浪漫愛情小說
29
29
R Humour 幽默體
9
9
7 - 74
The Entropy of English
• We can model English using n-gram
models (also known a Markov chains).
• These models assume limited memory, i.e.,
we assume that the next word depends only
on the previous k ones
[kth order Markov approximation].
P( X n  xn | X n1  xn1 ,..., X 1  x1 ) 
P( X n  xn | X n1  xn1,..., X nk  xnk )
7 - 75
P(W1  w1 ,W2  w2 ,...,Wn  wn )
 P(W1,n  w1,n ) P(w1,n)
 P( w1 ) P( w2 | w1 ) P( w3 | w1, 2 )...P( wn | w1,n 1 )
bigram:
P(W1  w1 ,W2  w2 ,...,Wn  wn )
 P( w1 ) P( w2 | w1 ) P( w3 | w2 )...P( wn | wn 1 )
trigram:
P(W1  w1 ,W2  w2 ,...,Wn  wn )
 P( w1 ) P( w2 | w1 ) P( w3 | w1, 2 )...P( wn | wn 2,n 1 )
7 - 76
The Entropy of English
• What is the Entropy of English?
Model
Cross entropy
zeroth order
4.76
first order
4.07
second order
208
Shannon's experiment
1.3(1.34)
(bits)
(uniform model, so log27)
(Cover and Thomas 1991:140)
7 - 77
Perplexity
• A measure related to the notion of cross-entropy
and used in the speech recognition community is
called the perplexity.
perplexity ( x1n , m)  2 H ( x1n ,m )
 m( x1n )

1
n
• A perplexity of k means that you are as surprised
on average as you would have been if you had had
to guess between k equiprobable choices at each
step.
7 - 78
Related documents