Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical methods in NLP
Course 2
Diana Trandabăț
2015-2016
Quick Recap
• Out of three prisoners, one, randomly
selected, without their knowldge, will be
executed, and the other two will be released.
• One of the prisoners asks the guard to show
him which of the other two will be released
(at least one will be released anyway). If the
quard answers, will the prisoner have more
information than before?
Quick Recap
Essential Information Theory
• Developed by Shannon in the 40s
• Maximizing the amount of information that
can be transmitted over an imperfect
communication channel
• Data compression (entropy)
• Transmission rate (channel capacity)
Probability mass function
• The probability that a random variable X has
differen numeric values
• p(x) = P(X=x) =P(Ax) where Ax={ : X() = x}
• Example: The probabiliy of heads when flipping
2 coins
• p(0) = ¼
• p(1) = ½
• p(2) = ¼
Probability mass function
• The probability that a random variable X has
differen numeric values
• p(x) = P(X=x) =P(Ax) where Ax={ : X() = x}
• Example: The probabiliy of heads when flipping
2 coins
• p(nr_heads=0) = ¼
• p(nr_heads= 1) = ½
• p(nr_heads= 2) = ¼
Probability mass function
• The probability that a random variable X has
differen numeric values
• p(x) = P(X=x) =P(Ax) where Ax={ : X() = x}
• Example: The probabiliy of heads when flipping
2 coins
• p(nr_heads=0) = ¼
• p(nr_heads= 1) = ½
• p(nr_heads= 2) = ¼
Expectation
• The expectation is the mean or average of a
random variable
• Example: Expectation of rolling one die and Y
being the value of its face is:
• E(X+Y) = E(X)+E(Y)
• E(XY) = E(X)E(Y) if X and Y are independent
Variance
• The variance of a random variable is a
measure of whether the values of the variable
tend to be consistent over trials or to vary a
lot.
• Var(X) = E((X-E(X))2) = E(X2) – E2(X)
• The commonly used standard deviation σ is
the square root of variance.
Entropy
• X: discrete random variable;
• p(x) = probability mass function of the random
variable X
• Entropy (or self-information)
H(p) H(X) p(x)log2p(x)
xX
• Entropy measures the amount of information in a
random variable
• It is the average length of the message needed to
transmit an outcome of that variable using the
optimal code (in bits)
Entropy (cont)
H(X) p(x)log2p(x)
xX
1
p(x)log2
p(x)
xX
1
E log2
p(x)
H(X) 0
H(X) 0 p(X) 1
i.e when the value of X is
determinate, hence providing no
new information
Exercise
• Compute the Entropy of tossing a coin
Exercise
Exercise 2
• Example: Entropy of rolling a 8-sided die.
Exercise 2
• Example: Entropy of rolling a 8-sided die.
• 1
2
3
4 5
6 7 8
• 001 010 011 100 101 110 111 000
Exercise 3
• Entropy of biased die
•
•
•
•
•
•
P(X=1)=1/2
P(X=2)=1/4
P(X=3)=0
P(X=4)=0
P(X=5)=1/8
P(X=6)=1/8
Exercise 3
• Entropy of biased die
•
•
•
•
•
•
P(X=1)=1/2
P(X=2)=1/4
P(X=3)=0
P(X=4)=0
P(X=5)=1/8
P(X=6)=1/8
Symplified Polynesian
– letter frequencies
p
t
k
a
i
u
1/8
1/4
1/8
1/4
1/8
1/8
– per-letter entropy
– coding
p
100
t
00
k
101
a
01
i
110
u
111
Symplified Polynesian
– letter frequencies
p
t
k
a
i
u
1/8
1/4
1/8
1/4
1/8
1/8
– per-letter entropy
P(i) log P(i) 2.5 bits
H ( P)
i{ p ,t , k , a ,i ,u }
– coding
p
100
t
00
k
101
a
01
i
110
u
111
Joint Entropy
• The joint entropy of 2 random variables X,Y is
the amount of the information needed on
average to specify both their values
H(X, Y) p(x, y)log p(x, y)
xX yY
Conditional Entropy
• The conditional entropy of a random variable Y given
another X, expresses how much extra information
one still needs to supply on average to communicate
Y given that the other party knows X
H(Y | X)
p(x)H(Y | X x)
xX
p(x) p(y | x) log p(y | x)
xX
yY
p( x)
p(x, y) log (
) Elog p(y | x)
p ( x, y )
xX yY
Chain Rule
H(X, Y) H(X) H(Y | X)
H(X1,..., Xn ) H(X1 ) H(X2 | X1 ) .... H(Xn | X1,...Xn1 )
Simplified Polynesian Revisited
– syllable structure
• all words consist of sequences of CV syllables.
• C: consonant, V: vowel
H (C ) 1.061 bits
p
t
k
a
1/16
3/8
1/16
1/2
i
1/16
3/16
0
1/4
u
0
3/16
1/16
1/4
1/8
3/4
1/8
H (V | C )
p(C c) H (V | C c)
c p ,t , k
1
1 1
3
1 1 1 1
1 1
H ( , ,0) H ( , , ) H ( ,0, )
8
2 2
4
2 4 4 8
2 2
1.375 bits
H (C ,V ) H (C ) H (V | C ) 2.44 bits
More on entropy
• Entropy Rate(per-word/per-letter entropy)
1
1
H rate H ( X 1n ) p( x1n ) log p( x1n )
n
n x1n
• Entropy of a Language
1
H ( X 1 , X 2 ,..., X n )
n n
H rate ( L) lim
Mutual Information
H(X, Y) H(X) H(Y | X) H(Y) H(X | Y)
H(X) - H(X | Y) H(Y) - H(Y | X) I(X, Y)
• I(X,Y) is the mutual information between X and Y.
• It is the measure of dependence between two
random variables, or the amount of information one
random variable contains about the other
Mutual Information (cont)
I(X, Y) H(X) - H(X | Y) H(Y) - H(Y | X)
p ( x, y )
p( x, y ) log(
)
p ( x) p ( y )
x
y
• I is 0 only when X,Y are independent:
H(X|Y)=H(X)
• H(X)=H(X)-H(X|X)=I(X,X)
Entropy is the self-information
More on Mutual Information
• Conditional Mutual Information
I ( X ; Y | Z ) I (( X ; Y ) | Z ) H ( X | Z ) H ( X | Y , Z )
• Chain Rule
I ( X 1n ; Y ) I ( X 1 ; Y ) ... I ( X n ; Y | X 1 ,..., X n 1 )
n
I ( X i ; Y | X 1 ,..., X i 1 )
i 1
• Pointwise Mutual Information
I ( x, y) log
p( x, y)
p( x) p( y )
Exercise 4
• Let p(x; y) be given by
• Find:
(a) H(X), H(Y )
(b) H(X|Y ), H(Y|X)
(c) H(X,Y)
(d) I(X,Y)
X|Y
0
1
0
1/3
1/3
1
0
1/3
Entropy and Linguistics
• Entropy is measure of uncertainty. The more
we know about something the lower the
entropy.
• If a language model captures more of the
structure of the language, then the entropy
should be lower.
• We can use entropy as a measure of the
quality of our models
Entropy and Linguistic
• Measure of how different two probability
distributions are
• Average number of bits that are wasted by encoding
events from a distribution p with a code based on a
not-quite right distribution q
• Noisy channel! = > next class!!!
Great!
See you next time!