Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 8 9 August 2007 Lecture 1, 7/21/2005 Natural Language Processing 1 POS Tagging Task: assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a word in context POS taggers need to be fast in order to process large corpora full parsing is slow should take no more than time linear in the size of the corpora e.g. context-free grammar n3, n length of the sentence POS taggers try to assign correct tag without actually parsing the sentence Lecture 1, 7/21/2005 Natural Language Processing 2 POS Tagging Components: Dictionary of words Exhaustive list of closed class items Examples: the, a, an: determiner from, to, of, by: preposition and, or: coordination conjunction Large set of open class (e.g. noun, verbs, adjectives) items with frequency information Lecture 1, 7/21/2005 Natural Language Processing 3 POS Tagging Components: Mechanism to assign tags Context-free: by frequency Context: bigram, trigram, HMM, hand-coded rules Example: Det Noun/*Verb the walk… Mechanism to handle unknown words (extra-dictionary) Capitalization Morphology: -ed, -tion Lecture 1, 7/21/2005 Natural Language Processing 4 POS Tagging Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word. Lecture 1, 7/21/2005 Natural Language Processing These examples from Dekang Lin 5 How hard is POS tagging? Measuring ambiguity Lecture 1, 7/21/2005 Natural Language Processing 6 Algorithms for POS Tagging •Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags): Worse, 40% of the tokens are ambiguous. Lecture 1, 7/21/2005 Natural Language Processing 7 Problem Setup There are M types of POS tags The word vocabulary size is V Tag set: {t1,..,tM}. Vocabulary set: {w1,..,wV}. We have a word sequence of length n: W = w1,w2…wn Want to find the best sequence of POS tags: T = t1,t2…tn Tbest arg max Pr(T | W ) T Lecture 1, 7/21/2005 Natural Language Processing 8 Information sources for tagging All techniques are based on the same observations… some tag sequences are more probable than others ART+ADJ+N is more probable than ART+ADJ+VB Lexical information: knowing the word to be tagged gives a lot of information about the correct tag “table”: {noun, verb} but not a {adj, prep,…} “rose”: {noun, adj, verb} but not {prep, ...} Lecture 1, 7/21/2005 Natural Language Processing 9 Algorithms for POS Tagging Why can’t we just look them up in a dictionary? •Words that aren’t in the dictionary http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid =578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc •One idea: P(ti | wi) = the probability that a random hapax legomenon in the corpus has tag ti. Nouns are more likely than verbs, which are more likely than pronouns. •Another idea: use morphology. Lecture 1, 7/21/2005 Natural Language Processing 10 Algorithms for POS Tagging - Knowledge •Dictionary •Morphological rules, e.g., •_____-tion •_____-ly •capitalization •N-gram frequencies •to _____ •DET _____ N •But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish) •Combining these • V _____-ing Lecture 1, 7/21/2005 I was gracking vs. Gracking is fun. Natural Language Processing 11 POS Tagging - Approaches Approaches Rule-based tagging (ENGTWOL) Stochastic (=Probabilistic) tagging HMM (Hidden Markov Model) tagging Transformation-based tagging Brill tagger • Do we return one best answer or several answers and let later steps decide? • How does the requisite knowledge get entered? Lecture 1, 7/21/2005 Natural Language Processing 12 3 methods for POS tagging 1. Rule-based tagging Example: Karlsson (1995) EngCG tagger based on the Constraint Grammar architecture and ENGTWOL lexicon Basic Idea: Assign all possible tags to words (morphological analyzer used) Remove wrong tags according to set of constraint rules (typically more than 1000 handwritten constraint rules, but may be machinelearned) Lecture 1, 7/21/2005 Natural Language Processing 13 Sample rules N-IP rule: A tag N (noun) cannot be followed by a tag IP (interrogative pronoun) ... man who … man: {N} who: {RP, IP} --> {RP} relative pronoun ART-V rule: A tag ART (article) cannot be followed by a tag V (verb) ...the book… the: {ART} book: {N, V} --> {N} Lecture 1, 7/21/2005 Natural Language Processing 14 After The First Stage Example: He had a book. After the fırst stage: he he/pronoun had have/verbpast have/auxliarypast a a/article book book/noun book/verb Tagging Rule Rule-1: if (the previous tag is an article) then eliminate all verb tags Rule-2: if (the next tag is verb) then eliminate all verb tags Lecture 1, 7/21/2005 Natural Language Processing 15 Rule-Based POS Tagging ENGTWOL tagger (now ENGCG-2) http://www.lingsoft.fi/cgi-bin/engcg Lecture 1, 7/21/2005 Natural Language Processing 16 3 methods for POS tagging 2. Transformation-based tagging Example: Brill (1995) tagger - combination of rule-based and stochastic (probabilistic) tagging methodologies Basic Idea: Start with a tagged corpus + dictionary (with most frequent tags) Set the most probable tag for each word as a start value Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order (like rule-based taggers) machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach) Lecture 1, 7/21/2005 Natural Language Processing 17 An example 1. Assign to words their most likely tag P(NN|race) = .98 P(VB|race) = .02 2. Change some tags by applying transformation rules Rule Context (trigger) (apply the rule when…) Examples NN VB (noun verb) the previous tag is the preposition to go to sleep(VB) ? go to school(VB) VBR VB (past tense base form) one of the previous 3 tags is a modal (MD) you may cut (VB) JJR RBR (comparative adj comparative adv) next tag is an adjective (JJ) a more (RBR) valuable VBP VB (past tense base form) one of the previous 2 words is “n’t” should (VB) n’t Lecture 1, 7/21/2005 Natural Language Processing 18 Types of context lots of latitude… can be: tag-triggered transformation word- triggered transformation The preceding/following word this word … morphology- triggered transformation The preceding/following word is tagged this way The word two before/after is tagged this way ... The preceding/following word finishes with an s … a combination of the above The preceding word is tagged this ways AND the following word is this word Lecture 1, 7/21/2005 Natural Language Processing 19 Learning the transformation rules Input: A corpus with each word: correctly tagged (for reference) tagged with its most frequent tag (C0) Output: A bag of transformation rules Algorithm: Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0 Change tag a to tag b when… The preceding/following word is tagged z The word two before/after is tagged z One of the 2 preceding/following words is tagged z One of the 2 preceding words is z … Lecture 1, 7/21/2005 Natural Language Processing 20 Learning the transformation rules (con't) Run the initial tagger and compile types of errors <incorrect tag, desired tag, # of occurrences> For each error type, instantiate all templates to generate candidate transformations Apply each candidate transformation to the corpus and count the number of corrections and errors that it produces Save the transformation that yields the greatest improvement Stop when no transformation can reduce the error rate by a predetermined threshold Lecture 1, 7/21/2005 Natural Language Processing 21 Example if the initial tagger mistags 159 words as verbs instead of nouns create the error triple: <verb, noun, 159> Suppose template #3 is instantiated as the rule: Change the tag from <verb> to <noun> if one of the two preceding words is tagged as a determiner. When this template is applied to the corpus: it corrects 98 of the 159 errors but it also creates 18 new errors Error reduction is 98-18=80 Lecture 1, 7/21/2005 Natural Language Processing 22 Learning the best transformations input: a corpus with each word: correctly tagged (for reference) tagged with its most frequent tag (C0) a bag of unordered transformation rules output: an ordering of the best transformation rules Lecture 1, 7/21/2005 Natural Language Processing 23 Learning the best transformations (con’t) let: E(Ck) = nb of words incorrectly tagged in the corpus at iteration k v(C) = the corpus obtained after applying rule v on the corpus C ε = minimum number of errors desired for k:= 0 step 1 do bt := argmint (E(t(Ck)) // find the transformation t that minimizes if ((E(Ck) - E(bt(Ck))) // the error rate < ε) // if bt does not improve the tagging significantly then goto finished Ck+1 := bt(Ck) // apply rule bt to the current corpus Tk+1 := bt // bt will be kept as the current transformation rule end finished: the sequence T1 T2 … Tk is the ordered transformation rules Lecture 1, 7/21/2005 Natural Language Processing 24 Strengths of transformation-based tagging exploits a wider range of lexical and syntactic regularities can look at a wider context condition the tags on preceding/next words not just preceding tags. can use more context than bigram or trigram. transformation rules are easier to understand than matrices of probabilities Lecture 1, 7/21/2005 Natural Language Processing 25 How TBL Rules are Applied Before the rules are applied the tagger labels every word with most likely tag. We get these most likely tags from a tagged corpus. Example: He is expected to race tomorrow he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN After selecting most-likely tags, we apply transformation rules. Change NN to VB when the previous tag is TO This rule converts race/NN into race/VB This may not work for every case its ….. According to race Lecture 1, 7/21/2005 Natural Language Processing 26 How TBL Rules are Learned We will assume that we have a tagged corpus. Brill’s TBL algorithm has three major steps. Tag the corpus with the most likely tag for each (unigram model) Choose a transformation that deterministically replaces an existing tag with a new tag such that the resulting tagged training corpus has the lowest error rate out of all transformations. Apply the transformation to the training corpus. These steps are repeated until a stopping criterion is reached. The result (which will be our tagger) will be: First tags using most-likely tags Then apply the learned transformations Lecture 1, 7/21/2005 Natural Language Processing 27 Transformations A transformation is selected from a small set of templates. Change tag a to tag b when - The preceding (following) word is tagged z. - The word two before (after) is tagged z. - One of two preceding (following) words is tagged z. - One of three preceding (following) words is tagged z. - The preceding word is tagged z and the following word is tagged w. - The preceding (following) word is tagged z and the word two before (after) is tagged w. Lecture 1, 7/21/2005 Natural Language Processing 28 3 methods for POS tagging 3. Stochastic (=Probabilistic) tagging Assume that a word’s tag only depends on the previous tags (not following ones) Use a training set (manually tagged corpus) to: learn the regularities of tag sequences learn the possible tags for a word model this info through a language model (n-gram) Example: HMM (Hidden Markov Model) tagging a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context Lecture 1, 7/21/2005 Natural Language Processing 29 Topics Probability Conditional Probability Independence Bayes Rule HMM tagging Markov Chains Hidden Markov Models Lecture 1, 7/21/2005 Natural Language Processing 30 6. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Sample Space (S) Example the set of all possible outcomes finite or infinite coin toss experiment possible outcomes: S = {heads, tails} Example die toss experiment possible outcomes: S = {1,2,3,4,5,6} Lecture 1, 7/21/2005 Natural Language Processing 31 Introduction to Probability Definition of sample space depends on what we are asking Sample Space (S): the set of all possible outcomes Example die toss experiment for whether the number is even or odd possible outcomes: {even,odd} not {1,2,3,4,5,6} Lecture 1, 7/21/2005 Natural Language Processing 32 More definitions Events an event is any subset of outcomes from the sample space Example die toss experiment let A represent the event such that the outcome of the die toss experiment is divisible by 3 A = {3,6} A is a subset of the sample space S= {1,2,3,4,5,6} Lecture 1, 7/21/2005 Natural Language Processing 33 Introduction to Probability Some definitions Events an event is a subset of sample space simple and compound events Example Quic kT i me™ and a T IFF (Unc ompres s ed) dec ompres s or are needed t o s ee thi s pi c ture. deck of cards draw experiment suppose sample space S = {heart,spade,club,diamond} (four suits) let A represent the event of drawing a heart let B represent the event of drawing a red card A = {heart} (simple event) B = {heart} u {diamond} = {heart,diamond} (compound event) a compound event can be expressed as a set union of simple events Example alternative sample space S = set of 52 cards A and B would both be compound events Lecture 1, 7/21/2005 Natural Language Processing QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this pict ure. 34 Introduction to Probability Some definitions Counting suppose an operation oi can be performed in ni ways, a set of k operations o1o2...ok can be performed in n1 n2 ... nk ways Example dice toss experiment, 6 possible outcomes two dice are thrown at the same time number of sample points in sample space = 6 6 = 36 Lecture 1, 7/21/2005 Natural Language Processing 35 Definition of Probability The probability law assigns to an event a nonnegative number Called P(A) Also called the probability A That encodes our knowledge or belief about the collective likelihood of all the elements of A Probability law must satisfy certain properties Lecture 1, 7/21/2005 Natural Language Processing 36 Probability Axioms Nonnegativity P(A) >= 0, for every event A Additivity If A and B are two disjoint events, then the probability of their union satisfies: P(A U B) = P(A) + P(B) Normalization The probability of the entire sample space S is equal to 1, i.e. P(S) = 1. Lecture 1, 7/21/2005 Natural Language Processing 37 An example An experiment involving a single coin toss There are two possible outcomes, H and T Sample space S is {H,T} If coin is fair, should assign equal probabilities to 2 outcomes Since they have to sum to 1 P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0 Lecture 1, 7/21/2005 Natural Language Processing 38 Another example Experiment involving 3 coin tosses Outcome is a 3-long string of H or T S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT} Assume each outcome is equiprobable “Uniform distribution” What is probability of the event that exactly 2 heads occur? A = {HHT,HTH,THH} 3 events/outcomes P(A) = P({HHT})+P({HTH})+P({THH}) additivity - union of the probability of the individual events = 1/8 + 1/8 + 1/8 = 3/8 Lecture 1, 7/21/2005 total 8 events/outcomes Natural Language Processing 39 Probability definitions In summary: Probability of drawing a spade from 52 well-shuffled playing cards: Lecture 1, 7/21/2005 Natural Language Processing 40 Moving toward language What’s the probability of drawing a 2 from a deck of 52 cards with four 2s? 4 1 P(drawing a two) .077 52 13 What’s the probability of a random word (from a random dictionary page) being a verb? # of ways to get a verb P(drawing a verb) all words Lecture 1, 7/21/2005 Natural Language Processing 41 Probability and part of speech tags • What’s the probability of a random word (from a random dictionary page) being a verb? P(drawing a verb) # of ways to get a verb all words • How to compute each of these • • • All words = just count all the words in the dictionary # of ways to get a verb: # of words which are verbs! If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is 10000/50000 = 1/5 = .20 Lecture 1, 7/21/2005 Natural Language Processing 42 Conditional Probability A way to reason about the outcome of an experiment based on partial information In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”? How likely is it that a person has a disease given that a medical test was negative? A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft? Lecture 1, 7/21/2005 Natural Language Processing 43 More precisely Given an experiment, a corresponding sample space S, and a probability law Suppose we know that the outcome is some event B We want to quantify the likelihood that the outcome also belongs to some other event A We need a new probability law that gives us the conditional probability of A given B P(A|B) Lecture 1, 7/21/2005 Natural Language Processing 44 An intuition • • • • • • • Let’s say A is “it’s raining”. Let’s say P(A) in Kharagpur is 0.2 Let’s say B is “it was sunny ten minutes ago” P(A|B) means “what is the probability of it raining now if it was sunny 10 minutes ago” P(A|B) is probably way less than P(A) Perhaps P(A|B) is .0001 Intuition: The knowledge about B should change our estimate of the probability of A. Lecture 1, 7/21/2005 Natural Language Processing 45 Conditional Probability let A and B be events in the sample space P(A|B) = the conditional probability of event A occurring given some fixed event B occurring definition: P(A|B) = P(A B) / P(B) Lecture 1, 7/21/2005 Natural Language Processing 46 Conditional probability P(A|B) = P(A B)/P(B) Or P( A | B) P( A, B) P( B) Note: P(A,B)=P(A|B) · P(B) Also: P(A,B) = P(B,A) A Lecture 1, 7/21/2005 A,B B Natural Language Processing 47 Independence What is P(A,B) if A and B are independent? P(A,B)=P(A) · P(B) iff A,B independent. P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25 Note: P(A|B)=P(A) iff A,B independent Also: P(B|A)=P(B) iff A,B independent Lecture 1, 7/21/2005 Natural Language Processing 48 Bayes Theorem P( A | B) P( B) P( B | A) P( A) • Idea: The probability of an A conditional on another event B is generally different from the probability of B conditional on A. There is a definite relationship between the two. Lecture 1, 7/21/2005 Natural Language Processing 49 Deriving Bayes Rule The probability of event A given event B is P(A B) P(A | B) P(B) Lecture 1, 7/21/2005 Natural Language Processing 50 Deriving Bayes Rule The probability of event B given event A is P(A B) P(B | A) P(A) Lecture 1, 7/21/2005 Natural Language Processing 51 Deriving Bayes Rule P(A B) P(A B) P(B | A) P(A | B) P(A) P(B) P(A | B)P(B) P(A B) P(B | A)P(A) P(A B) Lecture 1, 7/21/2005 Natural Language Processing 52 Deriving Bayes Rule P(A B) P(A B) P(B | A) P(A | B) P(A) P(B) P(A | B)P(B) P(A B) P(B | A)P(A) P(A B) | B)P(B) P(B | A)P(A) P(A Lecture 1, 7/21/2005 P(B | A)P(A) P(A | B) P(B) Natural Language Processing 53 Deriving Bayes Rule P(B | A)P(A) P(A | B) P(B) the theorem may be paraphrased as conditional/posterior probability = (LIKELIHOOD multiplied by PRIOR) divided by NORMALIZING CONSTANT Lecture 1, 7/21/2005 Natural Language Processing 54 Hidden Markov Model (HMM) Tagging Using an HMM to do POS tagging HMM is a special case of Bayesian inference It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition) Lecture 1, 7/21/2005 Natural Language Processing 55 Hidden Markov Model (HMM) Taggers Goal: maximize P(word|tag) x P(tag|previous n tags) Lexical information Syntagmatic information P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix) P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix) Lecture 1, 7/21/2005 Natural Language Processing 56 POS tagging as a sequence classification task We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow sequence of n words w1…wn. What is the best sequence of tags which corresponds to this sequence of observations? Probabilistic/Bayesian view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn. Lecture 1, 7/21/2005 Natural Language Processing 57 Getting to HMM Let T = t1,t2,…,tn Let W = w1,w2,…,wn Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn Hat ^ means “our estimate of the best = the most probable tag sequence” Argmaxx f(x) means “the x such that f(x) is maximized” it maximazes our estimate of the best tag sequence Lecture 1, 7/21/2005 Natural Language Processing 58 Getting to HMM This equation is guaranteed to give us the best tag sequence But how do we make it operational? How do we compute this value? Intuition of Bayesian classification: Use Bayes rule to transform it into a set of other probabilities that are easier to compute Thomas Bayes: British mathematician (1702-1761) Lecture 1, 7/21/2005 Natural Language Processing 59 Bayes Rule Breaks down any conditional probability P(x|y) into three other probabilities P(x|y): The conditional probability of an event x assuming that y has occurred Lecture 1, 7/21/2005 Natural Language Processing 60 Bayes Rule We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words Lecture 1, 7/21/2005 Natural Language Processing 61 Bayes Rule Lecture 1, 7/21/2005 Natural Language Processing 62 Likelihood and prior n Lecture 1, 7/21/2005 Natural Language Processing 63 Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger Lecture 1, 7/21/2005 Natural Language Processing 64 Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n WORDS the koala put the keys on the table Lecture 1, 7/21/2005 TAGS N V P DET Natural Language Processing 65 Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption Lecture 1, 7/21/2005 Natural Language Processing 66 Likelihood and prior Further Simplifications 3. The most probable tag sequence estimated by the bigram tagger --------------------------------------------------------------------------------------------------------------- n biagram assumption Lecture 1, 7/21/2005 Natural Language Processing 67 Two kinds of probabilities (1) Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and nouns That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high But P(DT|JJ) to be:? Lecture 1, 7/21/2005 Natural Language Processing 68 Two kinds of probabilities (1) Tag transition probabilities p(ti|ti-1) Compute P(NN|DT) by counting in a labeled corpus: # of times DT is followed by NN Lecture 1, 7/21/2005 Natural Language Processing 69 Two kinds of probabilities (2) Word likelihood probabilities p(wi|ti) P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is” If we were expecting a third person singular verb, how likely is it that this verb would be is? Compute P(is|VBZ) by counting in a labeled corpus: Lecture 1, 7/21/2005 Natural Language Processing 70 An Example: the verb “race” Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN How do we pick the right tag? Lecture 1, 7/21/2005 Natural Language Processing 71 Disambiguating “race” Lecture 1, 7/21/2005 Natural Language Processing 72 Disambiguating “race” P(NN|TO) = .00047 P(VB|TO) = .83 The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’ P(race|NN) = .00057 P(race|VB) = .00012 Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB. P(NR|VB) = .0027 P(NR|NN) = .0012 tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun P(VB|TO)P(NR|VB)P(race|VB) = .00000027 P(NN|TO)P(NR|NN)P(race|NN)=.00000000032 Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins Lecture 1, 7/21/2005 Natural Language Processing 73 Hidden Markov Models What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM) Let’s just spend a bit of time tying this into the model In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model. Lecture 1, 7/21/2005 Natural Language Processing 74 Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences Lecture 1, 7/21/2005 Natural Language Processing 75 Markov chain = “First-order observed Markov Model” a set of states Q = q1, q2…qN; the state at time t is qt a set of transition probabilities: a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A aij P(qt j | qt1 i) 1 i, j N N a ij 1; 1 i N j1 Distinguished start and end states Special initial probability vector i the probability that the MM will start in state i, each i expresses the probability p(qi|START) Lecture 1, 7/21/2005 Natural Language Processing 76 Markov chain = “First-order observed Markov Model” Markov Chain for weather: Example 1 three types of weather: sunny, rainy, foggy we want to find the following conditional probabilities: P(qn|qn-1, qn-2, …, q1) - I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days - We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences Problem: the larger n is, the more observations we must collect. Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories Lecture 1, 7/21/2005 Natural Language Processing 77 Markov chain = “First-order observed Markov Model” Therefore, we make a simplifying assumption, called the (first-order) Markov assumption for a sequence of observations q1, … qn, current state only depends on previous state the joint probability of certain past and current observations Lecture 1, 7/21/2005 Natural Language Processing 78 Markov chain = “First-order observable Markov Model” Lecture 1, 7/21/2005 Natural Language Processing 79 Markov chain = “First-order observed Markov Model” Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy? Using the Markov assumption and the probabilities in table 1, this translates into: Lecture 1, 7/21/2005 Natural Language Processing 80 The weather figure: specific example Markov Chain for weather: Example 2 Lecture 1, 7/21/2005 Natural Language Processing 81 Markov chain for weather What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is 3-3-3-3 P(3,3,3,3) = 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432 Lecture 1, 7/21/2005 Natural Language Processing 82 Hidden Markov Model For Markov chains, the output symbols are the same as the states. See sunny weather: we’re in state sunny But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states. This means we don’t know which state we are in. Lecture 1, 7/21/2005 Natural Language Processing 83 Markov chain for weather Lecture 1, 7/21/2005 Natural Language Processing 84 Markov chain for words Observed events: words Hidden events: tags Lecture 1, 7/21/2005 Natural Language Processing 85 Hidden Markov Models States Q = q1, q2…qN; Observations O = o1, o2…oN; Transition probabilities (prior) Transition probability matrix A = {aij} Observation likelihoods (likelihood) Each observation is a symbol from a vocabulary V = {v1,v2,…vV} Output probability matrix B={bi(ot)} a set of observation likelihoods, each expressing the probability of an observation ot being generated from a state i, emission probabilities Special initial probability vector i the probability that the HMM will start in state i, each i expresses the probability p(qi|START) Lecture 1, 7/21/2005 Natural Language Processing 86 Assumptions Markov assumption: the probability of a particular state depends only on the previous state P(qi | q1...qi1) P(qi | qi1) Output-independence assumption: the probability of an output observation depends only on the state that produced that observation Lecture 1, 7/21/2005 Natural Language Processing 87 HMM for Ice Cream You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Boston, MA for summer of 2007 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date that summer Our job: figure out how hot it was Lecture 1, 7/21/2005 Natural Language Processing 88 Noam task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… (cp. with output symbols) Produce: Weather Sequence: C,C,H,C,C,C,H … (cp. with hidden states, causing states) Lecture 1, 7/21/2005 Natural Language Processing 89 HMM for ice cream Lecture 1, 7/21/2005 Natural Language Processing 90 Different types of HMM structure Bakis = left-to-right Lecture 1, 7/21/2005 Natural Language Processing Ergodic = fully-connected 91 HMM Taggers Two kinds of probabilities A transition probabilities (PRIOR) B observation likelihoods (LIKELIHOOD) HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability Lecture 1, 7/21/2005 Natural Language Processing 92 Weighted FSM corresponding to hidden states of HMM, showing A probs Lecture 1, 7/21/2005 Natural Language Processing 93 B observation likelihoods for POS HMM Lecture 1, 7/21/2005 Natural Language Processing 94 The A matrix for the POS HMM Lecture 1, 7/21/2005 Natural Language Processing 95 The B matrix for the POS HMM Lecture 1, 7/21/2005 Natural Language Processing 96 HMM Taggers The probabilities are trained on hand-labeled training corpora (training set) Combine different N-gram levels Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard) Lecture 1, 7/21/2005 Natural Language Processing 97 The Viterbi Algorithm best tag sequence for "John likes to fish in the sea"? efficiently computes the most likely state sequence given a particular output sequence based on dynamic programming Lecture 1, 7/21/2005 Natural Language Processing 98 A smaller b a example 0.4 start 0.3 0.7 q 1 0.6 0.5 a b 0.2 0.8 r end 1 0.5 What is the best sequence of states for the input string “bbba”? Computing all possible paths and finding the one with the max probability is exponential Lecture 1, 7/21/2005 Natural Language Processing 99 A smaller example (con’t) For each state, store the most likely sequence that could lead to it (and its probability) Path probability matrix: An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the prob. for being in each state at the preceding time. Best sequence leading to q coming from q Input sequence / time ε --> b b --> b bb --> b bbb --> a ε --> q 0.6 (1.0x0.6) q --> q 0.108 (0.6x0.3x0.6) qq --> q 0.01944 (0.108x0.3x0.6) qrq --> q 0.018144 (0.1008x0.3x0.4) r --> q 0 (0x0.5x0.6) qr --> q 0.1008 (0.336x0.5x 0.6) qrr --> q 0.02688 (0.1344x0.5x0.4) q --> r 0.336 (0.6x0.7x0.8) qq --> r 0.0648 (0.108x0.7x0.8) qrq --> r 0.014112 (0.1008x0.7x0.2) r --> r 0 (0x0.5x0.8) qr --> r 0.1344 (0.336x0.5x0.8) qrr --> r 0.01344 (0.1344x0.5x0.2) coming from r leading to r coming from q coming from r Lecture 1, 7/21/2005 ε --> r 0 (0x0.8) Natural Language Processing 100 Viterbi intuition: we are looking for the best ‘path’ S1 S2 S3 S4 S5 JJ DT VB VB NNP NN RB NN VBN TO VBD promised Lecture 1, 7/21/2005 to back the bill Natural Language Processing 101 Slide from Dekang Lin The Viterbi Algorithm Lecture 1, 7/21/2005 Natural Language Processing 102 Intuition The value in each cell is computed by taking the MAX over all paths that lead to this cell. An extension of a path from state i at time t-1 is computed by multiplying: Previous path probability from previous cell viterbi[t1,i] Transition probability aij from previous state I to current state j Observation likelihood bj(ot) that current state j matches observation symbol t Lecture 1, 7/21/2005 Natural Language Processing 103 Viterbi example Lecture 1, 7/21/2005 Natural Language Processing 104 Smoothing of probabilities Data sparseness is a problem when estimating probabilities based on corpus data. The “add one” smoothing technique – P w1,n C w1,n 1 NB C- absolute frequency N: no of training instances B: no of different types Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams: P ti | t1,i 1 1P1 (ti ) 2 P2 (ti | ti 1 ) 3 P3 (ti | ti 1,i 2 ) 0 i 1, i 1 i The lambda values are automatically determined using a variant of the Expectation Maximization algorithm. Lecture 1, 7/21/2005 Natural Language Processing 105 Possible improvements in bigram POS tagging, we condition a tag only on the preceding tag why not... use more context (ex. use trigram model) more precise: “is clearly marked” --> verb, past participle “he clearly marked” --> verb, past tense combine trigram, bigram, unigram models condition on words too but with an n-gram approach, this is too costly (too many parameters to model) Lecture 1, 7/21/2005 Natural Language Processing 108 Further issues with Markov Model tagging Unknown words are a problem since we don’t have the required probabilities. Possible solutions: Assign the word probabilities based on corpus-wide distribution of POS Use morphological cues (capitalization, suffix) to assign a more calculated guess. Using higher order Markov models: Using a trigram model captures more context However, data sparseness is much more of a problem. Lecture 1, 7/21/2005 Natural Language Processing 110 TnT Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000 Underlying model: T arg max P(ti | ti 1 , ti 2 ) P( wi | ti ) P(tT 1 | tT ) Trigram modelling t1tT i –1 The probability of a POS only depends on its two preceding POS The probability of a word appearing at a particular position given that its POS occurs at that position is independent of everything else. Lecture 1, 7/21/2005 Natural Language Processing 111 Training Maximum likelihood estimates: c(t3 ) N ^ c(t , t ) Bigrams : P (t3 | t 2 ) 2 3 c(t3 ) ^ Unigrams : P (t 3 ) ^ Trigrams : P(t3 | t1 , t 2 ) Lexical : P( w3 | t3 ) c(t1 , t 2 , t3 ) c(t 2 , t3 ) c( w3 , t3 ) c(t3 ) Smoothing : context-independent variant of linear interpolation. ^ ^ ^ P(t3 | t1 , t2 ) 1 P(t3 ) 2 P(t3 | t2 ) 3 P(t3 | t1 , t2 ) Lecture 1, 7/21/2005 Natural Language Processing 112 Smoothing algorithm Set λi=0 For each trigram t1 t2 t3 with f(t1,t2,t3 )>0 Depending on the max of the following three values: Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 ) Case (f(t2,t3 )-1)/ f(t2) : incr λ2 by f(t1,t2,t3 ) Case (f(t3 )-1)/ N-1 : incr λ1 by f(t1,t2,t3 ) Normalize λi Lecture 1, 7/21/2005 Natural Language Processing 113 Evaluation of POS taggers compared with gold-standard of human performance metric: accuracy = % of tags that are identical to gold standard most taggers ~96-97% accuracy must compare accuracy to: ceiling (best possible results) how do human annotators score compared to each other? (9697%) so systems are not bad at all! baseline (worst possible results) what if we take the most-likely tag (unigram model) regardless of previous tags ? (90-91%) so anything less is really bad Lecture 1, 7/21/2005 Natural Language Processing 114 More on tagger accuracy is 95% good? that’s 5 mistakes every 100 words if on average, a sentence is 20 words, that’s 1 mistake per sentence when comparing tagger accuracy, beware of: size of training corpus difference between training & testing corpora (genre, domain…) the closer, the better the results size of tag set the bigger, the better the results Prediction versus classification unknown words the more unknown words (not in dictionary), the worst the results Lecture 1, 7/21/2005 Natural Language Processing 115 Error Analysis Look at a confusion matrix (contingency table) E.g. 4.4% of the total errors caused by mistagging VBD as VBN See what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Adverb (RB) vs Particle (RP) vs Prep (IN) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ) ERROR ANALYSIS Lecture 1, 7/21/2005 IS ESSENTIAL!!! Natural Language Processing 116 Tag indeterminacy Lecture 1, 7/21/2005 Natural Language Processing 117 Major difficulties in POS tagging Unknown words (proper names) because we do not know the set of tags it can take and knowing this takes you a long way (cf. baseline POS tagger) possible solutions: assign all possible tags with probabilities distribution identical to lexicon as a whole use morphological cues to infer possible tags ex. word ending in -ed are likely to be past tense verbs or past participles Frequently confused tag pairs preposition vs particle <running> <up> a hill (prep) / <running up> a bill (particle) verb, past tense vs. past participle vs. adjective Lecture 1, 7/21/2005 Natural Language Processing 118 Unknown Words Most-frequent-tag approach. What about words that don’t appear in the training set? Suffix analysis: The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix. Suffix estimation – Calculate the probability of a tag t given the last i letters of an n letter word. Smoothing: successive abstraction through sequences of increasingly more general contexts (i.e., omit more and more characters of the suffix) Use a morphological analyzer to get the restriction on the possible tags. Lecture 1, 7/21/2005 Natural Language Processing 119 Unknown words Lecture 1, 7/21/2005 Natural Language Processing 120 Alternative graphical models for part of speech tagging Lecture 1, 7/21/2005 Natural Language Processing 121 Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields Lecture 1, 7/21/2005 Natural Language Processing 122 Hidden Markov Model (HMM) : Generative Modeling y Source Model PY P(y ) P( yi | yi 1 ) i Lecture 1, 7/21/2005 x Noisy Channel PX|Y Natural Language Processing P(x | y ) P( xi | yi ) i 123 Dependency (1st order) X k 2 X k 1 P ( X k 2 | Yk 2 ) P ( X k 1 | Yk 1 ) P (Yk 1 | Yk 2 ) Yk 2 Lecture 1, 7/21/2005 P ( X k | Yk ) P (Yk | Yk 1 ) Yk 1 X k 1 Xk P ( X k 1 | Yk 1 ) P (Yk 1 | Yk ) Yk Natural Language Processing Yk 1 124 Disadvantage of HMMs (1) No Rich Feature Information Rich information are required Example: POS Tagging How to evaluate Pwk|tk for unknown words wk ? Useful features When xk is complex When data of xk is sparse Suffix, e.g., -ed, -tion, -ing, etc. Capitalization Generative Model Parameter estimation: maximize the joint likelihood of training examples log 2 P(X x, Y y) ( x , y )T Lecture 1, 7/21/2005 Natural Language Processing 125 Generative Models Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples Lecture 1, 7/21/2005 Natural Language Processing 126 Generative Models (cont’d) Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations Lecture 1, 7/21/2005 Natural Language Processing 127 Better Approach Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples log 2 P(Y y | X x) ( x , y )T Lecture 1, 7/21/2005 Natural Language Processing 128 Maximum Entropy modeling N-gram model : probabilities depend on the previous few tokens. We may identify a more heterogeneous set of features which contribute in some way to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc) Maxent combines these features in a probabilistic model. The given features provide a constraint on the model. We would like to have a probability distribution which, outside of these constraints, is as uniform as possible – has the maximum entropy among all models that satisfy these constraints. Lecture 1, 7/21/2005 Natural Language Processing 129 Maximum Entropy Markov Model Discriminative Sub Models Unify two parameters in generative model into one conditional model Two parameters in generative model, parameter in source model noisy channel Unified conditional model and parameter in P( yk | yk 1 ) P ( xk | y k ) Employ maximum entropy principle P ( y k | xk , y k 1 ) P(y | x) P( yi | yi 1 , xi ) i Maximum Entropy Markov Model Lecture 1, 7/21/2005 Natural Language Processing 130 General Maximum Entropy Principle Model Model distribution PY |X with a set of features {f1,f2,,fl} defined on X and Y Idea Collect information of features from training data Principle Model what is known Assume nothing else Flattest distribution Distribution with the maximum Entropy Lecture 1, 7/21/2005 Natural Language Processing 131 Example (Berger et al., 1996) example Model translation of word “in” from English to French Need to model P(wordFrench) Constraints 1: Possible translations: dans, en, à, au course de, pendant 2: “dans” or “en” used in 30% of the time 3: “dans” or “à” in 50% of the time Lecture 1, 7/21/2005 Natural Language Processing 132 Features Features 0-1 indicator functions 1 if x, y satisfies a predefined condition 0 if not Example: POS Tagging 1, if x ends with - tion and y is NN f1 ( x, y) 0, otherwise 1, if x starts with Captializa tion and y is NNP f 2 ( x, y ) 0, otherwise Lecture 1, 7/21/2005 Natural Language Processing 133 Constraints Empirical Information Statistics from training data T 1 Pˆ ( f i ) f i ( x, y) | T | ( x , y )T Expected Value From the distribution PY |X we want to model P( f i ) Constraints Lecture 1, 7/21/2005 1 P(Y y | X x) f i ( x, y) | T | ( x , y )T yD (Y ) Pˆ ( f i ) P( f i ) Natural Language Processing 134 Maximum Entropy: Objective Entropy 1 I P(Y y | X x) log 2 P(Y y | X x) | T | ( x , y )T Pˆ ( x) P(Y y | X x) log P(Y y | X x) x 2 y Maximization Problem max I P (Y | X ) s.t. Pˆ ( f ) P ( f ) Lecture 1, 7/21/2005 Natural Language Processing 135 Dual Problem Dual Problem Conditional model l P(Y y | X x) exp( i f i ( x, y)) i 1 Maximum likelihood of conditional data max 1 ,, l log 2 P(Y y | X x) ( x , y )T Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al. 2000) Lecture 1, 7/21/2005 Natural Language Processing 136 Maximum Entropy Markov Model Use Maximum Entropy Approach to Model 1st order P(Yk yk | X k xk , Yk 1 yk 1 ) Features Basic features (like parameters in HMM) Bigram (1st order) or trigram (2nd order) in source model State-output pair feature Xk xk, Yk yk Advantage: incorporate other advanced features on xk, yk Lecture 1, 7/21/2005 Natural Language Processing 137 HMM vs MEMM (1st order) Xk Xk P ( X k | Yk ) P (Yk | Yk 1 ) Yk 1 Yk HMM P(Yk | X k , Yk 1 ) Yk 1 Yk Maximum Entropy Markov Model (MEMM) Performance in POS Tagging POS Tagging Data set: WSJ Features: HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) Results (Lafferty et al. 2001) 1st order HMM 1st order MEMM Lecture 1, 7/21/2005 94.31% accuracy, 54.01% OOV accuracy 95.19% accuracy, 73.01% OOV accuracy Natural Language Processing 139 ME applications Part of Speech (POS) Tagging (Ratnaparkhi, 1996) P(POS tag | context) Information sources Word window (4) Word features (prefix, suffix, capitalization) Previous POS tags Lecture 1, 7/21/2005 Natural Language Processing 140 ME applications Abbreviation expansion (Pakhomov, 2002) Information sources Word Sense Disambiguation (WSD) (Chao & Dyer, 2002) Information sources Word window (4) Document title Word window (4) Structurally related words (4) Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997) Information sources Token features (prefix, suffix, capitalization, abbreviation) Word window (2) Lecture 1, 7/21/2005 Natural Language Processing 141 Solution Global Optimization Optimize parameters in a global model simultaneously, not in sub models separately Alternatives Conditional random fields Application of perceptron algorithm Lecture 1, 7/21/2005 Natural Language Processing 142 Why ME? Advantages Combine multiple knowledge sources Local Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996)) Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002)) Token prefix, suffix, capitalization, abbreviation (Sentence Boundary (Reynar & Ratnaparkhi, 1997)) Global N-grams (Rosenfeld, 1997) Word window Document title (Pakhomov, 2002) Structurally related words (Chao & Dyer, 2002) Sentence length, conventional lexicon (Och & Ney, 2002) Combine dependent knowledge sources Lecture 1, 7/21/2005 Natural Language Processing 143 Why ME? Advantages Add additional knowledge sources Implicit smoothing Disadvantages Computational Expected value at each iteration Normalizing constant Overfitting Feature selection Cutoffs Basic Feature Selection (Berger et al., 1996) Lecture 1, 7/21/2005 Natural Language Processing 144 Conditional Models Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models Lecture 1, 7/21/2005 Natural Language Processing 145 Discriminative Models Maximum Entropy Markov Models (MEMMs) Exponential model Given training set X with label sequence Y: Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization Lecture 1, 7/21/2005 Natural Language Processing 146 MEMMs (cont’d) MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”) Subject to Label Bias Problem Bias toward states with fewer outgoing transitions Lecture 1, 7/21/2005 Natural Language Processing 147 Label Bias Problem • Consider this MEMM: • P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) • Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x • However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). • Per-state normalization does not allow the required expectation Lecture 1, 7/21/2005 Natural Language Processing 148 Solve the Label Bias Problem Change the state-transition structure of the model Not always practical to change the set of states Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction) Lecture 1, 7/21/2005 Natural Language Processing 149 Random Field Lecture 1, 7/21/2005 Natural Language Processing 150 Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations Lecture 1, 7/21/2005 Natural Language Processing 151 Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences Lecture 1, 7/21/2005 Natural Language Processing 152 Example of CRFs Lecture 1, 7/21/2005 Natural Language Processing 153 Graphical comparison among HMMs, MEMMs and CRFs HMM Lecture 1, 7/21/2005 MEMM Natural Language Processing CRF 154 Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: p (y | x) exp k f k (e, y |e , x) k gk (v, y |v , x) vV ,k eE,k x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features (1 , 2 , , n ; 1 , 2 , , n ); k and k are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v Lecture 1, 7/21/2005 Natural Language Processing 155 Conditional Distribution (cont’d) • CRFs use the observation-dependent normalization Z(x) for the conditional distributions: 1 p (y | x) exp k f k (e, y |e , x) k gk (v, y |v , x) Z (x) vV ,k eE,k Z(x) is a normalization over the data sequence x Lecture 1, 7/21/2005 Natural Language Processing 156 Parameter Estimation for CRFs The paper provided iterative scaling algorithms It turns out to be very inefficient Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient Lecture 1, 7/21/2005 Natural Language Processing 157 Training of CRFs (From Prof. Dietterich) • First, we take the log of the equation log p ( y | x) eE,k k f k (e, y |e , x) g vV ,k k k (v, y |v , x) log Z (x) • Then, take the derivative of the above equation log p ( y | x) k f k (e, y |e , x) k gk (v, y |v , x) log Z (x) eE,k vV ,k • For training, the first 2 items are easy to get. • For example, for each k, fk is a sequence of Boolean numbers, such as 00101110100111. is just the total number of 1’s in the sequence. k f k (e, y |e , x) • The hardest thing is how to calculate Z(x) Natural Language Processing Lecture 1, 7/21/2005 158 Training of CRFs (From Prof. Dietterich) (cont’d) • Maximal cliques y1 c1 y2 c1 c2 c2 y3 c3 y4 c3 c1 : exp( (y1 ,x) (y2 ,x) (y1 ,y2 ,x)) c1 (y1,y2 ,x) c2 : exp( (y3 ,x) (y2 ,y3 ,x)) c2 (y2 ,y3 ,x) c3 : exp( (y4 ,x) (y3 ,y4 ,x)) c3 (y3 ,y4 ,x) Z (x) c1 (y1 ,y 2 ,x)c2 (y 2 ,y3 ,x)c3 (y 3 ,y 4 ,x) y1 ,y 2 ,y3 ,y 4 c1 (y1 ,y 2 ,x) c2 (y 2 ,y3 ,x) c3 (y3 ,y 4 ,x) y y2 Lecture1 1, 7/21/2005 y3 y Natural Language 4Processing 159 POS tagging Experiments Lecture 1, 7/21/2005 Natural Language Processing 160 POS tagging Experiments (cont’d) • Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging • Each word in a given input sentence must be labeled with one of 45 syntactic tags • Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies • oov = out-of-vocabulary (not observed in the training set) Lecture 1, 7/21/2005 Natural Language Processing 161 Summary Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance Lecture 1, 7/21/2005 Natural Language Processing 162