Download Maximum Entropy Modeling and its application to NLP

Maximum Entropy Modeling and its application to NLP Utpal Garain Indian Statistical Institute, Kolkata http://www.isical.ac.in/~utpal Language Engineering in Daily Life In Our Daily Life • Message, Email – We can now type our message in my own language/script • Oftentimes I need not write the full text – My mobile understands what I intend to write!! – I ha reac saf – I have reached safely • Even if I am afraid of typing in my own language (so many letters, spellings are so difficult.. Uffss!!) – I type my language in “English” and my computer or my mobile types it in my language!! – mera bharat.. In Our Daily Life • I say “maa…” to my cell – and my mother’s number is called! • I have gone back to my previous days and left typing in the computer/mobile – I just write on a piece of paper or scribble on the screen – My letters are typed!! • Those days were so boring… – – – – If you are an exiting customer press 1 otherwise press 2 If you remember your customer ID press 1 otherwise press 2 So on and so on.. I just say “1”, “service”, “cricket” and the telephone understands what I want!! • My grandma can’t read English but she told she found her name written in Hindi in Railway reservation chart – Do Railway staff type so many names in Hindi everyday – NO!! Computer does this In Our Daily Life • Cross Lingual Information Search – I wanted to know what exactly happened that created such a big inter-community problem in UP – My friend told me read UP newspaper – I don’t know Hindi  – I gave query in the net in my language – I got news articles from UP local newspaper translated in my language!! Unbelievable!!! • Translation – I don’t know French – Still I can chat with my French friend  In Our Daily Life • I had problem to draw the diagram for this – ABCD is a parallelogram, DC is extended to E such that BCE is an equilateral triangle. – I gave it to my computer and it draws the diagram showing the steps!! • I got three history books for my son and couldn’t decide which one will be good for him – My computer suggested Book 2 as it has better readability for a grade-V student – Later on, I found it is right!!! • I type questions in the net and get answers (oftentimes they are correct!!) – How does it happen?!!! Language • Language is key to culture – Communication – Power and Influence – Identity – Cultural records • The multilingual character of Indian society – Need to preserve this character to move successfully towards closer cooperation at a political, economic, and social level • Language is both the basis for communication and a barrier Role of Language Courtesy: Simpkins and Diver Language Engineering • Application of knowledge of language to the development of computer systems – That can understand, interpret and generate human language in all its forms • Comprises a set of – Techniques and – Language resources Components of Language Engg. • Get material – Speech, typed/printed/handwritten text, image, video • Recognize the language and validate it – Encoding scheme, distinguishing separate words.. • Build an understanding of the meaning – Depending on the application you target • Build the application – Speech to text • Generate and present the results – Use monitor, printer, plotter, speaker, telephone… Language Resources • Lexicons – Repository of words and knowledge about them • Specialist lexicons – Proper names, Terminology – Wordnets • Grammars • Corpora – Language sample – Text, speech – Helps to train a machine NLP vs. Speech • Consider these two types of problems: – Problem set-1 • “I teach NLP at M.Tech. CS”=> what’s in Bengali? • Scan newspaper, pick out those news dealing with forest fires, fill up a database with relevant information – Problem set-2 • In someone’s utterance you might have difficulty to distinguish between “merry” from “very” or “pan” from “ban” • Context often overcomes this – Please give me the ??? (pan/ban) – The choice you made was ??? good. NLP • NLU community is more concerned about – Parsing sentences – Assigning semantic relations to the parts of a sentence – etc… • Speech recognition community – Predicting next word on the basis of the words so far • Extracting the most likely words from the signal • Deciding among these possibilities using knowledge about the language NLP • NLU demands “understanding” – Requires a lot of human effort • Speech people rely on statistical technology – Absence of any understanding limits its ability • Combination of these two techniques NLP • Understanding – Rule based • POS tagging • Tag using rule base – I am going to make some tea – I dislike the make of this shirt – Use grammatical rules – Statistical • Use probability • Probability of sequence/path – PN – PN VG V PREP ART V/N? V/N? ADJ PP N Basic Probability Probability Theory • X: random variable – Uncertain outcome of some event • V(X): outcome – Example event: open to some page of an English book and X is the word you pointed to – V(X) ranges over all possible words of English • If x is a possible outcome of X, i.e. x  V(X) – P(X=x) or P(x) • Wi is the i-th word prob. Of picking up the i-th word is P( w  w )  i | wi | w | w j | j 1 • if U denotes the universe of all possible outcomes then the denominator is |U|. Conditional Probability • Pick up two words which are in a row -> w1 and w2 – Or, given the first word, guess the second word – Choice of w1 changes things P( w2  w j | w1  wi )  | w1  wi , w2  w j | |w1  wi | • Bayes’ law: P(x|y) = P(x) * P(y|x)/P(y) – |x,y|/|y|=|x|/|U| * |y,x|/|U|/|y|/|U| • Given some evidence e, we want to pick up the best conclusion P(c|e)… it is done – if we know P(c|e) = P(c) * P(e|c) /P(e) • Once evidence is fixed then the denominator stays the same for all conclusions. Conditional Probabiliy • P(w,x|y,z) = P(w,x) P(y,z|w,x) / P(y,z) • Generalization: – P(w1,w2,…,wn) = P(w1) p(w2|w1) P(w3|w1,w2) …. P(wn|w1, w2, wn-1) – P(w1,w2,…,wn|x) = P(w1|x) p(w2|w1,x) P(w3|w1,w2,x) …. P(wn|w1, w2, wn-1,x) – P(W1,n = w1,n) • Example: – John went to ?? (hospital, pink, number, if) Conditional Probability • P(w1,n|speech signal) = P(w1,n) P(signal|w1,n)/P(signal) • Say, there are words (a1,a2,a3) (b1,b2) (c1,c2,c3,c4) • P(a2,b1,c4|signal) and P(a2,b1,c4) • P(a2,b1,c4|signal) = P(a2,b1,c4) * P(signal|a2,b1,c4) • Example: – The {big / pig} dog – P(the big dog) = P(the) P(big|the) P(dog|the big) – P(the pig dog) = P(the) P(pig|the) P(dog|the pig) Application Building • Predictive Text Entry – – – – – – Tod => Today I => I => have a => a ta => take, tal => talk Tod => today tod toddler • Techniques – Probability of the word – Probability of the word at position “x” – Conditional probability • What is the probability of writing “have” after writing two words “today” and “I” • Resource – Language corpus Application Building • Transliteration – Kamal => কমল • Indian Railways did it before – Rule based • Kazi => কাজী • Ka => ক or কা – Difficult to extend it other languages • Statistical model – – – – N-gram modeling Kamal=> ka am ma al; kazi => ka az zi কমল => কম মল; কাজী => কা াাজ জী Alignment of pairs (difficult computational problem) Transliteration • Probability is computed for – P(ka=>ক), P(ক), P(কমল ), etc. • Best probable word is the output • Advantage: – Easily extendable to any language pairs – Multiple choices are given (according to rank) • Resource needed – Name pairs – Language model Statistical Models and Methods Statistical models and methods • Intuition to make crude probability judgments • Entropy – Situation No occu 1st occu 2nd occu Both Prob. 0.5 0.125 o.125 0.25 • [1*1/2+2*1/4+3*(1/8+1/8)]bits = 1.75 bits • Random variable W takes on one of the several values V(W), entropy: H(W) = -P(w)log P(w); wV(W) • -logP(w) bits are required to code w Use in Speech • {the, a, cat, dog, ate, slept, here, there} • If use of each word is equal and independent • Then the entropy of the language -P(the)logP(the)-P(a)log P(a)… =8.(-1/8*log1/8) =3 – H(L) = lim [1/n P(w1,n)logP(w1,n)] Markov Chain • If we remove the numbers then it’s a finite state automaton which is acceptor as well as generator • Adding the probabilities we make it probabilistic finite state automaton => Markov Chain • Assuming all states are accepting states (Markov Process), we can compute the prob. of generating a given string • Product of probabilities of the arcs traversed in generating the string. Cross entropy • Per word entropy of the previous model is – [0.5log (1/2)] – At each state only two equi-probable choices so • H(p) = 1 • If we consider each word is equi-probable then H(pm) = 3 bits/word • Cross Entropy – Cross entropy of a set of random variables W1,n where correct model is P(w1,n) but the probabilities are estimated using the model Pm(w1,n) is H (W1,n , Pm)   P(w1,n ) log Pm (w1,n )  w1,n Cross entropy 1 H (W1, n , Pm ) n • Per word cross entropy is • Per word entropy of the given Markov Chain: 1 • If we slightly change the model: – Outgoing probabilities are 0.75 and 0.25 – per word entropy becomes • -[1/2log(3/4)+1/2log(1/4)] = - (1/2) [log 3 – log4 +log 1 – log4] = - (1/2) [log 3 – log4] = 2 – 1.7/2 = 1.2 • Incorrect model: – H(W1,n)  H(W1,n, PM) Cross entropy 1 H (W1, n , Pm ) n • Per word cross entropy is • Per word entropy of the given Markov Chain: 1 • If we slightly change the model: – Outgoing probabilities are 0.75 and 0.25 – per word entropy becomes • -[1/2log(3/4)+1/2log(1/4)] = - (1/2) [log 3 – log4 +log 1 – log4] = - (1/2) [log 3 – log4] = 2 – 1.7/2 = 1.2 • Incorrect model: – H(W1,n)  H(W1,n, PM) Cross entropy • Cross entropy of a language – A stochastic process is ergodic if its statistical properties (i.e. and ) can be computed from a single sufficiently large sample of the process. – Assuming L is an ergodic language – Cross entropy of L is – H (L, PM) 1  lim P( w1, n ) log PM ( w1, n ) = n  n  1   lim log PM ( w1, n ) n  n Corpus • Brown corpus – Coverage • • • • • • 500 text segments of 2000 words Press, reportage etc. 44 Press editorial etc. 27 Press, reviews 17 Religion books, periodicals..17 … Trigram Model Trigram models • N-gram model… • P(wn|w1…wn-1) = P(wn|wn-1wn-2) • P(w1,n)=P(w1)P(w2|w1)P(w3|w1w2).. P(wn|w1,n-1) =P(w1)P(w2|w1)P(w3|w1w2).. P(wn|wn-1,n-2) • P(w1,n)=P(w1)P(w2|w1)P(wi|wi-1wi-2) • Pseudo words: w-1, w0 • “to create such” • #to create such=? • #to create=? n P( w1, n )   P( w | w i i 1,i  2 ) i 1 Pe ( wi | wi 1,i  2 )  C ( wi  2,i ) C ( wi  2,i 1 ) Trigram as Markov Chain • It is not possible to determine state of the machine simply on the basis of the last output (the last two outputs are needed) • Markov chain of order 2 Problem of sparse data • Jelinek stuided – 1,500,000 word corpus – Extracted trigrams – Applied to 300,000 words – 25% trigram types were missing Maximum Entropy Model An example • Machine translation – Star in English • Translation in Hindi: – सितारा, तारा, तारक, प्रसिद्ध असिनेता, िाग्य • First statistics of this process – p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध असिनेता)+p(िाग्य) = 1 • There are infinite number of models p for which this identity holds • One model – p(सितारा) = 1 – This model always predicts सितारा • Another model – p(तारा) = ½ – p(प्रसिद्ध असिनेता) = ½ • These models offend our sensibilities – The expert always chose from the five choices – How can we justify either of these probability distribution ? – These models bold assumptions without empirical justification • What we know – Experts chose exclusively from these five words • the most intuitively appealing model is – p(सितारा) = 1/5 – p(तारा) = 1/5 – p(तारक) = 1/5 – p(प्रसिद्ध असिनेता) = 1/5 – p(िाग्य) = 1/5 • The most uniform model subject o our knowledge • Suppose we notice that the expert’s chose either सितारा or तारा 30% of the time • We apply this knowledge to update our model – p(सितारा)+p(तारा) = 3/10 – p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध असिनेता)+p(िाग्य) = 1 • Many probability distributions consistent with the above constraints • A reasonable choice for p is again the most uniform • i.e. the distribution which allocates its probability as evenly as possible, subject to the constraints – p(सितारा) = 3/20 – p(तारा) = 3/20 – p(तारक) = 7/30 – p(प्रसिद्ध असिनेता) = 7/30 – p(िाग्य) = 7/30 • Say we inspect the data once more and notice another interesting fact – In half the cases, the expert chose either सितारा or प्रसिद्ध असिनेता • So we add a third constraint – p(सितारा)+p(तारा) = 3/10 – p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध असिनेता)+p(िाग्य) = 1 – p(सितारा)+p(प्रसिद्ध असिनेता) = ½ • Now if we want to look for the most uniform p satisfying the constraints the choice is not as obvious • As complexity added, we have two difficulties – What is meant by “uniform” and how can we measure the uniformity of a model – How will we find the most uniform model subject to a set constraints? • Maximum entropy method (E. T. Jaynes) answers both of these questions Maximum Entropy Modeling • Consider a random process that produces an output value y (a member of a finite set, Y) • For the translation example just considered, the process generates a translation of the word star, and the output y can be any word in the set {सितारा, तारा, तारक, प्रसिद्ध असिनेता, िाग्य}. • In generating y, the process may be influenced by some contextual information x, a member of a finite set X. • In the present example, this information could include the words in the English sentence surrounding star. • Our task is to construct a stochastic model that accurately represents the behavior of the random process. Maximum Entropy Modeling • Such a model is a method of estimating the conditional probability that, given a context x, the process will output c. • We will denote by p(clx) the probability that the model assigns to y in context x. • We will denote by P the set of all conditional probability distributions. Thus a model p(c|x) is, by definition, just an element of P. Training Data • A large number of samples – (x1,c1), (x2, c2) . . . (xN, cN). • Each sample would consist of a phrase x containing the words surrounding star, together with the translation c of star that the process produced. • Empirical probability distribution p᷉ 1 ~ p ( x, c)   number of time that ( x, c) occurs in the sample N Features • The feature fi are binary functions that can be used to characterize any property of a pair (ẋ, c), • ẋ is a vector representing an input element and c is the class label • f(x, c) = 1 if c = प्रसिद्ध असिनेता and star follows cinema; otherwise = 0 Features • We have two things in hand – Empirical distribution – The model p(c|x) • The expected value of f with respect to the empirical distribution is ~p ( f )   ~p ( x, c) f ( x, c) x ,c • The expected value of f with respect to the model p(c|x) is p( f )   ~ p ( x ) p ( c | x ) f ( x, c ) x ,c • Our constraint is p( f )  ~ p( f ) Classification • For a given ẋ we need to know its class label c – p(ẋ, c) • Loglinear models – General and very important class of models for classification of categorical variables – Logistic regression is another example  1 K f i ( x ,c ) K   p ( x , c)    i log p( x , c)   log Z   f i ( x , c) log  i Z i 1 i 1 – K is the number of features, i is the weight for the feature fi and Z is a normalizing constant used to ensure that a probability distribution results. An example • Text classification • ẋ consists of a single element, indicating presence or absence of the word profit in the article • Classes, c – two classes; earnings or not • Features – Two features • f1: 1 if and only if the article is “earnings” and the word K profit is in it   f K 1 ( x , c)  C   f i ( x , c) • f2: filler feature (fK+1) – C is the greatest possible feature sum i 1 An example Ẋ c Profit “earnings” f1 f2 =f1log1+f2log2 2 (0) 0 (0) 1 (1) 0 (1) 1 • Parameters: 0 0 0 1 1 1 1 0 1 1 1 2 2 2 2 4 • • • • – log1 = 2.0, log2 = 1.0 K 2  i   f i ( x ,c )     f i log  i i 1 Z = 2+2+2+4 = 10 p(0,0) = p(0,1) = p(1,0) = 2/10 = 0.2 p(1,1) = 4/10 = 0.4 A data set that follows the same empirical distribution – ((0,0), (0,1),(1,0), (1,1),(1,1)) i Computation of i and Z • We search for a model p* such that E p* f i  E ~p f i • Empirical expectation   1 ~ ~ E p f i   p ( x , c) f i ( x , c)   N x ,c N  f i ( x j , c) j 1 • In general Epfi cannot be computed efficiently as it would require summing over all possible combinations of ẋ and c, a huge or infinite set. • Following approximation is followed    1 ~ E p f i   p ( x ) p(c | x ) f ( x, c)   N x ,c N    p(c | x j ) f i ( x j , c) j 1 c Generalized Iterative Scaling Algo • Step 1 – For all i =1, K+1, initialize i(1). – Compute empirical expectation – Set n =1 • Step 2 – Compute pn(x,c) for the distribution pn given by the {j(n)} for each element (x,c) in the training set K 1  1 p(n)( x, c)   ( i( n ) ) Z i 1  fi ( x ,c ) K 1 where Z   ( i( n ) )  x,c i 1  fi ( x ,c ) Generalized Iterative Scaling Algo • Step 3 – Compute Ep(fi) for all I = 1, … K+1 according formula shown before • Step 4 – Update the parameters i  • Step 5 ( n 1) i  (n) i  E ~p f i   E (n) fi  p     1 C – If the parameters have converged, stop; otherwise increment n and go to Step 2. Application of MaxEnt in NLP • POS tagger – Stanford tagger • At our lab – Honorific information – Use of this information for Anaphora Resolution – BioNLP • Entity tagger • Stanford Univ. has open source code for MaxEnt • You can also use their implementation for your own task. HMM, MaxEnt and CRF • HMM – Observation and class • MaxEnt – Local decision • CRF – Combines good of HMM and MaxEnt

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Maximum Entropy Modeling and its application to NLP