Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Foundations of Statistic Natural Language Processing
2. Mathematical Foundations
2001. 7. 10.
인공지능연구실
성경희
Contents – Part 1
1. Elementary Probability Theory
– Conditional probability
– Bayes’ theorem
– Random variable
– Joint and conditional distributions
– Standard distribution
2
Conditional probability (1/2)
 P(A) : the probability of the event A
 Ex1> A coin is tossed 3 times.
W = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
A = {HHT, HTH, THH} : 2 heads, P(A)=3/8
B = {HHH, HHT, HTH, HTT} : first head, P(B)=1/2
P(A  B)
 P(A | B) 
P(B)
: conditional probability
B
A
A B
W
3
Conditional probability (2/2)
 Multiplication rule
P(A  B)  P(B)P(A | B)  P(A)P(B | A)
 Chain rule
P(A1      A )  P(A1)P(A2 | A1)P(A3 | A1  A2)    P(A | 
1
1
A)
 Two events A, B are independent
– P(A  B)  P(A)P(B)
– If P(B)  0, P(A)  P(A | B)
4
Bayes’ theorem (1/2)
 P(A)  P(A  B)  P(A  B)
 P(A | B)P(B)  P(A | B)P( B)
Generally, if A   1 B and the Bi are disjoint (B  B  )
P(A)   P(A | B )P(B )
 P(B | A) 
Bayes’
theorem
P(B  A) P(A | B)P(B)

P(A)
P(A)
P(B | A) 
P(A | B )P(B )

P(A)
P(A | B )P(B )

1
P(A | B )P(B )
5
Bayes’ theorem (2/2)
 Ex2> G : the event of the sentence having a parasitic gap
T : the event of the test being positive
P(T | G)P(G)
P(G | T) 
P(T | G)P(G)  P(T | G)P(G)
0.95  0.00001

 0.002
0.95  0.00001  0.005  0.99999
 This poor result comes about because the prior probability
of a sentence containing a parasitic gap is so low.
6
Random variable
 Ex3> Random variable X for the sum of two dice.
First
die
6
1
Second die
2
3
4
5
6
7
8
9
10
11
12
5
6
7
8
9
10
11
4
5
6
7
8
9
10
3
4
5
6
7
8
9
2
3
4
5
6
7
8
1
2
3
4
5
6
7
2
3
4
5
6
x
p(X=x)
Expectation : E (X)   xp(x)
Variance : Var(X)  E (( X  E ( X
)) )
 E (X 2 )  E 2 (X)
7
8
9
10
11
12
S={2,…,12}
1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36
probability mass function(pmf) : p(x) = p(X=x), X ~ p(x)
 p(x )  P(W)  1
If X:W  {0,1}, then X is called an indicator RV or a Bernoulli trial
7
Joint and conditional distributions
 The joint pmf for two discrete random variables X, Y
– p(x, y)  P (X  x, Y  y)
 Marginal pmfs, which total up the probability mass for the
values of each variable separately.
– p X (x) 
 p(x, y)
p Y (y)   p(x, y)
 Conditional pmf
p(x, y)
– p X|Y (x | y) 
p Y (y)
for y such that p Y (y)  0
8
Standard distributions (1/3)
 Discrete distributions: The binomial distribution
– When one has a series of trials with only two outcomes, each trial
being independent from all the others.
– The number r of successes out of n trials given that the probability
of success in any trial is p. :
n 
p ( R  r )  b ( r ; n , p )    p (1  p )
r 

n 
n!
 C
where   
r
(
n

r
)!
r
!
 
0 r  n
– Expectation : np, variance : np(1-p)
9
Standard distributions (2/3)
 Discrete distributions: The binomial distribution
0.3
probability
n
 10 ,20 ,40
0.2
b( r ; n , 0.5)
b( r ; n , 0.7)
0.1
0
0
5
10
15
20
25
30
35
40
count
10
Standard distributions (3/3)
 Continuous distributions: The normal distribution
– For the Mean m and the standard deviation s :
0.6
N (0,0.7)
Probability density function (pdf)
0.4
N ( 0 ,1)
1
n (x ; m, s ) 
e
2 s
N (1.5,2)
0.2

( m )2
2s 2
0
-5
-4
-3
-2
-1
0
1
2
3
4
5
value
11
Contents – Part 2
2. Essential Information Theory
– Entropy
– Joint entropy and conditional entropy
– Mutual information
– The noisy channel model
– Relative entropy or Kullback-Leibler divergence
12
Shannon’s Information Theory
 Maximizing the amount of information that one can
transmit over an imperfect communication channel such as
a noisy phone line.
 Theoretical maxima for data compression
– Entropy H
 Theoretical maxima for the transmission rate
– Channel Capacity
13
Entropy (1/4)
 The entropy H (or self-information) is the average
uncertainty of a single random variable X.
H
(p)  H (X)    p(x)log 2 p(x)
where, p(x) is pmf of X
xX
 Entropy is a measure of uncertainty.
– The more we know about something, the lower the entropy will be.
– We can use entropy as a measure of the quality of our models.
 Entropy measures the amount of information in a random
variable (measured in bits).
14
Entropy (2/4)
 The entropy of a weighted coin. The horizontal axis shows the
probability of a weighted coin to come up heads. The vertical axis
shows the entropy of tossing the corresponding coin once.
H (p )
 p log p  (1  p ) log (1  p )
back 23 page
p
15
Entropy (3/4)
 Ex7> The result of rolling an 8-sided die. (uniform distribution)
–
H (X)  
8

p ( i ) log p ( i )  
1
8
8
1
log
1
1
1
  log  3 bits
8
8
1
2
3
4
5
6
7
8
001
010
011
100
101
110
111
000
– Entropy : The average length of the message needed to transmit an
outcome of that variable.
 For expectation E
H

1
(X)  E  log
p (X



)
16
Entropy (4/4)
 Ex8> Simplified Polynesian
p
t
k
a
i
u
1/8
1/4
1/8
1/4
1/8
1/8
P ( i ) log P ( i )  2
1
bits
2

– H (P)  
{ , , , , , }
– We can design a code that on average takes
letter
p
t
k
a
i
u
100
00
101
01
110
111
2
1
2
2
bits to transmit a
2
3
bits
– Entropy can be interpreted as a measure of the size of the ‘search
space’ consisting of the possible values of a random variable.
17
Joint entropy and conditional entropy (1/3)
 The joint entropy of a pair of discrete random variable
X,Y~ p(x,y)
–
H (X, Y)  
 p(x, y)log
xX yY
2
p(x, y)
 The conditional entropy
–
H


(Y | X)    p(x)H(Y | X  x)    p(x)   p(y | x)log 2 p(y | x) 
xX
xX
 yY

   p(x, y)log 2 p(y | x)
xX yY
 The chain rule for entropy
–
H (X, Y)  H (X)  H (Y | X)
–
H (X1 ,..., X n )  H (X1 )  H (X 2 | X1 )     H (X n | X1 ,..., X n -1 )
18
Joint entropy and conditional entropy (2/3)
 Ex9> Simplified Polynesian revisited
– All words of consist of sequence of CV(consonant-vowel) syllables
Marginal probabilities
(per-syllable basis)
a
p
t
k
1
6
16
3
16
3
16
3
4
1
16
16
1
i
16
u
0
1
8
0
1
16
1
8
1
2
1
4
1
4
Per-letter basis probabilities
p
t
k
a
i
u
1
3
8
1
16
1
4
1
8
1
8
16
double
1
back 8 page
19
Joint entropy and conditional entropy (3/3)
1
1 3
3 9 3
(C)  2   log  log   log 3bits  1.061bits
8
8 4
4 4 4
–
H
–
H (V | C) 
 p(C  c ) H (V | C  c )
 , ,
a
1
1 1
3
1 1 1 1
1 1
 H ( , ,0)  H ( , , )  H ( ,0, )
8
2 2
4
2 4 4 8
2 2
11
 bits  1.375 bits
8
–
H
(C, V)  H ( C )  H ( V | C ) 
p
t
k
1
6
16
3
16
3
16
3
4
1
16
16
1
i
16
u
0
1
8
0
1
16
1
8
1
2
1
4
1
4
1
29 3
 log 3 bits  2.44 bits
8 4
20
Mutual information (1/2)
 By the chain rule for entropy
– H (X, Y)  H (X)  H (Y | X)  H (Y)  H (X | Y)
– H (X)  H (X | Y)  H (Y)  H (Y | X) : mutual information
 Mutual information between X and Y
– The amount of information one random variable contains about
another. (symmetric, non-negative)
– It is 0 only when two variables are independent.
– It grows not only with the degree of dependence, but also
according to the entropy of the variables.
– It is actually better to think of it as a measure of independence.
21
Mutual information (2/2)
–
I
(X; Y)  H (X)  H (X | Y)  H (X)  H (Y)  H (X, Y)
1
1
  p(x)log
  p(y)log
  p(x, y)logp(x, y)
p(x) y
p(y) x, y
x
  p(x, y)log
x, y
– Since
p(x, y)
=I(x,y) Pointwise MI
p(x)p(y)
H (X | X)  0
H(X,Y)
H(Y|X)
H(X|Y)
H (X)  H (X)  H (X | X)  I (X; X)
I(X; Y)
(entropy is called self-information)
– Conditional MI and a chain rule
H(X)
H(Y)
I (X; Y | Z)  I ((X; Y) | Z)  H (X | Z)  H (X | Y, Z)
I (X1n ; Y)  I (X1 | Y)      I (X n ; Y | X1 ,..., X n -1 ) 
n
 I (X ; Y | X ,..., X
i 1
i
i
i 1
)
22
Noisy channel model
W
Encoder
X
Input to
channel
Message
from a finite
alphabet
Channel
p(y|x)
Y
Decoder
Output from
channel
^
W
Attempt to
reconstruct message
based on output
 Channel capacity : the rate at which one can transmit
1-p
information through the channel (optimal)
0
0
–
C
 m ax I (X; Y)
(
p
)
 Binary symmetric channel go 15 page
–
I (X; Y)  H (Y)  H (Y | X)  H (Y)  H (p)
–
m ax I
(
)
(X; Y)  1  H (p)
1
1-p
1
since entropy is non-negative, C  1
23
Relative entropy or Kullback-Leibler divergence
 Relative entropy for two pmfs, p(x), q(x)
– A measure of how close two pmfs are.
– Non-negative, and D(p||q)=0 if p=q

p(x)
p(X) 

 E p  log
q(x)
q(X) 


I (X, Y)  D(p(x, y) || p(x)p(y))
D(p || q) 
 p(x) log
– Conditional relative entropy and chain rule
p(y | x)
D(p(y | x) || q(y | x))   p(x)  p(y | x)log
q(y | x)
x
y
D(p(x, y) || q(x, y))  D(p(x) || q(x))  D(p(y | x) || q(y | x))
24
Related documents