Download Maximum Entropy Modeling and its application to NLP

Document related concepts

Probability wikipedia , lookup

Transcript
Maximum Entropy Modeling and
its application to NLP
Utpal Garain
Indian Statistical Institute, Kolkata
http://www.isical.ac.in/~utpal
Language Engineering in Daily Life
In Our Daily Life
• Message, Email
– We can now type our message in my own language/script
• Oftentimes I need not write the full text
– My mobile understands what I intend to write!!
– I ha reac saf
– I have reached safely
• Even if I am afraid of typing in my own language (so
many letters, spellings are so difficult.. Uffss!!)
– I type my language in “English” and my computer or my
mobile types it in my language!!
– mera bharat..
In Our Daily Life
• I say “maa…” to my cell
– and my mother’s number is called!
• I have gone back to my previous days and left typing in the
computer/mobile
– I just write on a piece of paper or scribble on the screen
– My letters are typed!!
• Those days were so boring…
–
–
–
–
If you are an exiting customer press 1 otherwise press 2
If you remember your customer ID press 1 otherwise press 2
So on and so on..
I just say “1”, “service”, “cricket” and the telephone understands what I
want!!
• My grandma can’t read English but she told she found her name
written in Hindi in Railway reservation chart
– Do Railway staff type so many names in Hindi everyday
– NO!! Computer does this
In Our Daily Life
• Cross Lingual Information Search
– I wanted to know what exactly happened that created
such a big inter-community problem in UP
– My friend told me read UP newspaper
– I don’t know Hindi 
– I gave query in the net in my language
– I got news articles from UP local newspaper translated in
my language!! Unbelievable!!!
• Translation
– I don’t know French
– Still I can chat with my French friend 
In Our Daily Life
• I had problem to draw the diagram for this
– ABCD is a parallelogram, DC is extended to E such that BCE is
an equilateral triangle.
– I gave it to my computer and it draws the diagram showing
the steps!!
• I got three history books for my son and couldn’t
decide which one will be good for him
– My computer suggested Book 2 as it has better readability
for a grade-V student
– Later on, I found it is right!!!
• I type questions in the net and get answers (oftentimes
they are correct!!)
– How does it happen?!!!
Language
• Language is key to culture
– Communication
– Power and Influence
– Identity
– Cultural records
• The multilingual character of Indian society
– Need to preserve this character to move successfully
towards closer cooperation at a political, economic, and
social level
• Language is both the basis for communication and
a barrier
Role of Language
Courtesy: Simpkins and Diver
Language Engineering
• Application of knowledge of language to the
development of computer systems
– That can understand, interpret and generate human
language in all its forms
• Comprises a set of
– Techniques and
– Language resources
Components of Language Engg.
• Get material
– Speech, typed/printed/handwritten text, image, video
• Recognize the language and validate it
– Encoding scheme, distinguishing separate words..
• Build an understanding of the meaning
– Depending on the application you target
• Build the application
– Speech to text
• Generate and present the results
– Use monitor, printer, plotter, speaker, telephone…
Language Resources
• Lexicons
– Repository of words and knowledge about them
• Specialist lexicons
– Proper names, Terminology
– Wordnets
• Grammars
• Corpora
– Language sample
– Text, speech
– Helps to train a machine
NLP vs. Speech
• Consider these two types of problems:
– Problem set-1
• “I teach NLP at M.Tech. CS”=> what’s in Bengali?
• Scan newspaper, pick out those news dealing with
forest fires, fill up a database with relevant information
– Problem set-2
• In someone’s utterance you might have difficulty to
distinguish between “merry” from “very” or “pan” from
“ban”
• Context often overcomes this
– Please give me the ??? (pan/ban)
– The choice you made was ??? good.
NLP
• NLU community is more concerned about
– Parsing sentences
– Assigning semantic relations to the parts of a
sentence
– etc…
• Speech recognition community
– Predicting next word on the basis of the words so
far
• Extracting the most likely words from the signal
• Deciding among these possibilities using knowledge
about the language
NLP
• NLU demands “understanding”
– Requires a lot of human effort
• Speech people rely on statistical technology
– Absence of any understanding limits its ability
• Combination of these two techniques
NLP
• Understanding
– Rule based
• POS tagging
• Tag using rule base
– I am going to make some tea
– I dislike the make of this shirt
– Use grammatical rules
– Statistical
• Use probability
• Probability of sequence/path
– PN
– PN
VG
V
PREP
ART
V/N?
V/N?
ADJ
PP
N
Basic Probability
Probability Theory
• X: random variable
– Uncertain outcome of some event
• V(X): outcome
– Example event: open to some page of an English book and
X is the word you pointed to
– V(X) ranges over all possible words of English
• If x is a possible outcome of X, i.e. x  V(X)
– P(X=x) or P(x)
• Wi is the i-th word prob. Of picking up the i-th word is
P( w  w ) 
i
| wi |
w
| w
j
|
j 1
• if U denotes the universe of all possible outcomes then the
denominator is |U|.
Conditional Probability
• Pick up two words which are in a row -> w1 and w2
– Or, given the first word, guess the second word
– Choice of w1 changes things
P( w2  w j | w1  wi ) 
| w1  wi , w2  w j |
|w1  wi |
• Bayes’ law: P(x|y) = P(x) * P(y|x)/P(y)
– |x,y|/|y|=|x|/|U| * |y,x|/|U|/|y|/|U|
• Given some evidence e, we want to pick up the best
conclusion P(c|e)… it is done
– if we know P(c|e) = P(c) * P(e|c) /P(e)
• Once evidence is fixed then the denominator stays
the same for all conclusions.
Conditional Probabiliy
• P(w,x|y,z) = P(w,x) P(y,z|w,x) / P(y,z)
• Generalization:
– P(w1,w2,…,wn) = P(w1) p(w2|w1) P(w3|w1,w2)
…. P(wn|w1, w2, wn-1)
– P(w1,w2,…,wn|x) = P(w1|x) p(w2|w1,x)
P(w3|w1,w2,x) …. P(wn|w1, w2, wn-1,x)
– P(W1,n = w1,n)
• Example:
– John went to ?? (hospital, pink, number, if)
Conditional Probability
• P(w1,n|speech signal)
= P(w1,n) P(signal|w1,n)/P(signal)
• Say, there are words (a1,a2,a3) (b1,b2)
(c1,c2,c3,c4)
• P(a2,b1,c4|signal) and P(a2,b1,c4)
• P(a2,b1,c4|signal)
= P(a2,b1,c4) * P(signal|a2,b1,c4)
• Example:
– The {big / pig} dog
– P(the big dog) = P(the) P(big|the) P(dog|the big)
– P(the pig dog) = P(the) P(pig|the) P(dog|the pig)
Application Building
• Predictive Text Entry
–
–
–
–
–
–
Tod => Today
I => I
=> have
a => a
ta => take, tal => talk
Tod => today tod toddler
• Techniques
– Probability of the word
– Probability of the word at position “x”
– Conditional probability
• What is the probability of writing “have” after writing two words
“today” and “I”
• Resource
– Language corpus
Application Building
• Transliteration
– Kamal => কমল
• Indian Railways did it before
– Rule based
• Kazi => কাজী
• Ka => ক or কা
– Difficult to extend it other languages
• Statistical model
–
–
–
–
N-gram modeling
Kamal=> ka am ma al; kazi => ka az zi
কমল => কম মল; কাজী => কা াাজ জী
Alignment of pairs (difficult computational problem)
Transliteration
• Probability is computed for
– P(ka=>ক), P(ক), P(কমল ), etc.
• Best probable word is the output
• Advantage:
– Easily extendable to any language pairs
– Multiple choices are given (according to rank)
• Resource needed
– Name pairs
– Language model
Statistical Models and Methods
Statistical models and methods
• Intuition to make crude probability judgments
• Entropy
– Situation
No occu
1st occu
2nd occu
Both
Prob.
0.5
0.125
o.125
0.25
• [1*1/2+2*1/4+3*(1/8+1/8)]bits = 1.75 bits
• Random variable W takes on one of the several values
V(W), entropy: H(W) = -P(w)log P(w); wV(W)
• -logP(w) bits are required to code w
Use in Speech
• {the, a, cat, dog, ate, slept, here, there}
• If use of each word is equal and independent
• Then the entropy of the language
-P(the)logP(the)-P(a)log P(a)…
=8.(-1/8*log1/8)
=3
– H(L) = lim [1/n P(w1,n)logP(w1,n)]
Markov Chain
• If we remove the numbers then it’s a finite state automaton
which is acceptor as well as generator
• Adding the probabilities we make it probabilistic finite state
automaton => Markov Chain
• Assuming all states are accepting states (Markov Process), we
can compute the prob. of generating a given string
• Product of probabilities of the arcs traversed in generating
the string.
Cross entropy
• Per word entropy of the previous model is
– [0.5log (1/2)]
– At each state only two equi-probable choices so
• H(p) = 1
• If we consider each word is equi-probable then
H(pm) = 3 bits/word
• Cross Entropy
– Cross entropy of a set of random variables W1,n
where correct model is P(w1,n) but the
probabilities are estimated using the model
Pm(w1,n) is H (W1,n , Pm)   P(w1,n ) log Pm (w1,n )

w1,n
Cross entropy
1
H (W1, n , Pm )
n
• Per word cross entropy is
• Per word entropy of the given Markov Chain: 1
• If we slightly change the model:
– Outgoing probabilities are 0.75 and 0.25
– per word entropy becomes
• -[1/2log(3/4)+1/2log(1/4)]
= - (1/2) [log 3 – log4 +log 1 – log4]
= - (1/2) [log 3 – log4]
= 2 – 1.7/2 = 1.2
• Incorrect model:
– H(W1,n)  H(W1,n, PM)
Cross entropy
1
H (W1, n , Pm )
n
• Per word cross entropy is
• Per word entropy of the given Markov Chain: 1
• If we slightly change the model:
– Outgoing probabilities are 0.75 and 0.25
– per word entropy becomes
• -[1/2log(3/4)+1/2log(1/4)]
= - (1/2) [log 3 – log4 +log 1 – log4]
= - (1/2) [log 3 – log4]
= 2 – 1.7/2 = 1.2
• Incorrect model:
– H(W1,n)  H(W1,n, PM)
Cross entropy
• Cross entropy of a language
– A stochastic process is ergodic if its statistical
properties (i.e. and ) can be computed from a
single sufficiently large sample of the process.
– Assuming L is an ergodic language
– Cross entropy of L is
– H (L, PM)
1
 lim
P( w1, n ) log PM ( w1, n )
=
n 
n

1
  lim log PM ( w1, n )
n  n
Corpus
• Brown corpus
– Coverage
•
•
•
•
•
•
500 text segments of 2000 words
Press, reportage etc.
44
Press editorial etc.
27
Press, reviews
17
Religion books, periodicals..17
…
Trigram Model
Trigram models
• N-gram model…
• P(wn|w1…wn-1) = P(wn|wn-1wn-2)
• P(w1,n)=P(w1)P(w2|w1)P(w3|w1w2).. P(wn|w1,n-1)
=P(w1)P(w2|w1)P(w3|w1w2).. P(wn|wn-1,n-2)
• P(w1,n)=P(w1)P(w2|w1)P(wi|wi-1wi-2)
• Pseudo words: w-1, w0
• “to create such”
• #to create such=?
• #to create=?
n
P( w1, n ) 
 P( w | w
i
i 1,i  2 )
i 1
Pe ( wi | wi 1,i  2 ) 
C ( wi  2,i )
C ( wi  2,i 1 )
Trigram as Markov Chain
• It is not possible to determine state of the
machine simply on the basis of the last output
(the last two outputs are needed)
• Markov chain of order 2
Problem of sparse data
• Jelinek stuided
– 1,500,000 word corpus
– Extracted trigrams
– Applied to 300,000 words
– 25% trigram types were missing
Maximum Entropy Model
An example
• Machine translation
– Star in English
• Translation in Hindi:
– सितारा, तारा, तारक, प्रसिद्ध असिनेता, िाग्य
• First statistics of this process
– p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध
असिनेता)+p(िाग्य) = 1
• There are infinite number of models p for
which this identity holds
• One model
– p(सितारा) = 1
– This model always predicts सितारा
• Another model
– p(तारा) = ½
– p(प्रसिद्ध असिनेता) = ½
• These models offend our sensibilities
– The expert always chose from the five choices
– How can we justify either of these probability
distribution ?
– These models bold assumptions without empirical
justification
• What we know
– Experts chose exclusively from these five words
• the most intuitively appealing model is
– p(सितारा) = 1/5
– p(तारा) = 1/5
– p(तारक) = 1/5
– p(प्रसिद्ध असिनेता) = 1/5
– p(िाग्य) = 1/5
• The most uniform model subject o our
knowledge
• Suppose we notice that the expert’s chose
either सितारा or तारा 30% of the time
• We apply this knowledge to update our model
– p(सितारा)+p(तारा) = 3/10
– p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध
असिनेता)+p(िाग्य) = 1
• Many probability distributions consistent with
the above constraints
• A reasonable choice for p is again the most
uniform
• i.e. the distribution which allocates its
probability as evenly as possible, subject to
the constraints
– p(सितारा) = 3/20
– p(तारा) = 3/20
– p(तारक) = 7/30
– p(प्रसिद्ध असिनेता) = 7/30
– p(िाग्य) = 7/30
• Say we inspect the data once more and notice
another interesting fact
– In half the cases, the expert chose either सितारा or
प्रसिद्ध असिनेता
• So we add a third constraint
– p(सितारा)+p(तारा) = 3/10
– p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध
असिनेता)+p(िाग्य) = 1
– p(सितारा)+p(प्रसिद्ध असिनेता) = ½
• Now if we want to look for the most uniform p
satisfying the constraints the choice is not as
obvious
• As complexity added, we have two difficulties
– What is meant by “uniform” and how can we measure
the uniformity of a model
– How will we find the most uniform model subject to a
set constraints?
• Maximum entropy method (E. T. Jaynes) answers
both of these questions
Maximum Entropy Modeling
• Consider a random process that produces an output value
y (a member of a finite set, Y)
• For the translation example just considered, the process
generates a translation of the word star, and the output y
can be any word in the set {सितारा, तारा, तारक, प्रसिद्ध
असिनेता, िाग्य}.
• In generating y, the process may be influenced by some
contextual information x, a member of a finite set X.
• In the present example, this information could include the
words in the English sentence surrounding star.
• Our task is to construct a stochastic model that accurately
represents the behavior of the random process.
Maximum Entropy Modeling
• Such a model is a method of estimating the
conditional probability that, given a context x,
the process will output c.
• We will denote by p(clx) the probability that the
model assigns to y in context x.
• We will denote by P the set of all conditional
probability distributions. Thus a model p(c|x) is,
by definition, just an element of P.
Training Data
• A large number of samples
– (x1,c1), (x2, c2) . . . (xN, cN).
• Each sample would consist of a phrase x
containing the words surrounding star, together
with the translation c of star that the process
produced.
• Empirical probability distribution p᷉
1
~
p ( x, c)   number of time that ( x, c) occurs in the sample
N
Features
• The feature fi are binary functions that can be
used to characterize any property of a pair (ẋ, c),
• ẋ is a vector representing an input element and c
is the class label
• f(x, c) = 1 if c = प्रसिद्ध असिनेता and star
follows cinema; otherwise = 0
Features
• We have two things in hand
– Empirical distribution
– The model p(c|x)
• The expected value of f with respect to the
empirical distribution is ~p ( f )   ~p ( x, c) f ( x, c)
x ,c
• The expected value of f with respect to the
model p(c|x) is
p( f )   ~
p ( x ) p ( c | x ) f ( x, c )
x ,c
• Our constraint is
p( f )  ~
p( f )
Classification
• For a given ẋ we need to know its class label c
– p(ẋ, c)
• Loglinear models
– General and very important class of models for
classification of categorical variables
– Logistic regression is another example

1 K f i ( x ,c )
K


p ( x , c)    i
log p( x , c)   log Z   f i ( x , c) log  i
Z i 1
i 1
– K is the number of features, i is the weight for the
feature fi and Z is a normalizing constant used to
ensure that a probability distribution results.
An example
• Text classification
• ẋ consists of a single element, indicating presence
or absence of the word profit in the article
• Classes, c
– two classes; earnings or not
• Features
– Two features
• f1: 1 if and only if the article is “earnings” and the word
K
profit is in it


f K 1 ( x , c)  C   f i ( x , c)
• f2: filler feature (fK+1)
– C is the greatest possible feature sum
i 1
An example
Ẋ
c
Profit
“earnings”
f1
f2
=f1log1+f2log2
2
(0)
0
(0)
1
(1)
0
(1)
1
• Parameters:
0
0
0
1
1
1
1
0
1
1
1
2
2
2
2
4
•
•
•
•
– log1 = 2.0, log2 = 1.0
K
2  i


f i ( x ,c )
    f i log  i
i 1
Z = 2+2+2+4 = 10
p(0,0) = p(0,1) = p(1,0) = 2/10 = 0.2
p(1,1) = 4/10 = 0.4
A data set that follows the same empirical
distribution
– ((0,0), (0,1),(1,0), (1,1),(1,1))
i
Computation of i and Z
• We search for a model p* such that
E p* f i  E ~p f i
• Empirical expectation


1
~
~
E p f i   p ( x , c) f i ( x , c) 

N
x ,c
N
 f i ( x j , c)
j 1
• In general Epfi cannot be computed efficiently
as it would require summing over all possible
combinations of ẋ and c, a huge or infinite set.
• Following approximation is followed



1
~
E p f i   p ( x ) p(c | x ) f ( x, c) 

N
x ,c
N


 p(c | x j ) f i ( x j , c)
j 1 c
Generalized Iterative Scaling Algo
• Step 1
– For all i =1, K+1, initialize i(1).
– Compute empirical expectation
– Set n =1
• Step 2
– Compute pn(x,c) for the distribution pn given by
the {j(n)} for each element (x,c) in the training
set
K 1

1
p(n)( x, c)   ( i( n ) )
Z i 1

fi ( x ,c )
K 1
where Z   ( i( n ) )

x,c i 1

fi ( x ,c )
Generalized Iterative Scaling Algo
• Step 3
– Compute Ep(fi) for all I = 1, … K+1 according
formula shown before
• Step 4
– Update the parameters i

• Step 5
( n 1)
i

(n)
i
 E ~p f i

 E (n) fi
 p




1
C
– If the parameters have converged, stop; otherwise
increment n and go to Step 2.
Application of MaxEnt in NLP
• POS tagger
– Stanford tagger
• At our lab
– Honorific information
– Use of this information for Anaphora Resolution
– BioNLP
• Entity tagger
• Stanford Univ. has open source code for MaxEnt
• You can also use their implementation for your
own task.
HMM, MaxEnt and CRF
• HMM
– Observation and class
• MaxEnt
– Local decision
• CRF
– Combines good of HMM and MaxEnt