Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 8 9 August 2007 Lecture 1, 7/21/2005 Natural Language Processing 1 Why Do We Care about Parts of Speech? •Pronunciation Hand me the lead pipe. •Predicting what words can be expected next Personal pronoun (e.g., I, she) ____________ •Stemming -s means singular for verbs, plural for nouns •As the basis for syntactic parsing and then meaning extraction I will lead the group into the lead smelter. •Machine translation • (E) content +N (F) contenu +N • (E) content +Adj (F) content +Adj or satisfait +Adj Lecture 1, 7/21/2005 Natural Language Processing 2 What is a Part of Speech? Is this a semantic distinction? For example, maybe Noun is the class of words for people, places and things. Maybe Adjective is the class of words for properties of nouns. Consider: green book book is a Noun green is an Adjective Now consider: book worm This green is very soothing. Lecture 1, 7/21/2005 Natural Language Processing 4 How Many Parts of Speech Are There? A first cut at the easy distinctions: Open classes: •nouns, verbs, adjectives, adverbs Closed classes: function words •conjunctions: and, or, but •pronounts: I, she, him •prepositions: with, on •determiners: the, a, an Lecture 1, 7/21/2005 Natural Language Processing 5 Part of speech tagging 8 (ish) traditional parts of speech Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS We’ll use POS most frequently I’ll assume that you all know what these are Lecture 1, 7/21/2005 Natural Language Processing 6 POS examples N V ADJ ADV P PRO DET Lecture 1, 7/21/2005 noun verb chair, bandwidth, pacing study, debate, munch adj purple, tall, ridiculous adverb unfortunately, slowly, preposition of, by, to pronoun I, me, mine determiner the, a, that, those Natural Language Processing 7 Tagsets Brown corpus tagset (87 tags): http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html Penn Treebank tagset (45 tags): http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6) C7 tagset (146 tags) http://www.comp.lancs.ac.uk/ucrel/claws7tags.html Lecture 1, 7/21/2005 Natural Language Processing 8 POS Tagging: Definition The process of assigning a part-of-speech or lexical class marker to each word in a corpus: WORDS TAGS the koala put the keys on the table Lecture 1, 7/21/2005 N V P DET Natural Language Processing 9 POS Tagging example Lecture 1, 7/21/2005 WORD tag the koala put the keys on the table DET N V DET N P DET N Natural Language Processing 10 POS tagging: Choosing a tagset There are so many parts of speech, potential distinctions we can draw To do POS tagging, need to choose a standard set of tags to work with Could pick very coarse tagets N, V, Adj, Adv. More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags PRP$, WRB, WP$, VBG Even more fine-grained tagsets exist Lecture 1, 7/21/2005 Natural Language Processing 11 Penn TreeBank POS Tag set Lecture 1, 7/21/2005 Natural Language Processing 12 Using the UPenn tagset The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”) Except the preposition/complementizer “to” is just marked “to”. Lecture 1, 7/21/2005 Natural Language Processing 13 POS Tagging Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word. Lecture 1, 7/21/2005 Natural Language Processing These examples from Dekang Lin 14 How hard is POS tagging? Measuring ambiguity Lecture 1, 7/21/2005 Natural Language Processing 15 Algorithms for POS Tagging •Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags): Worse, 40% of the tokens are ambiguous. Lecture 1, 7/21/2005 Natural Language Processing 16 Algorithms for POS Tagging Why can’t we just look them up in a dictionary? •Words that aren’t in the dictionary http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid =578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc •One idea: P(ti | wi) = the probability that a random hapax legomenon in the corpus has tag ti. Nouns are more likely than verbs, which are more likely than pronouns. •Another idea: use morphology. Lecture 1, 7/21/2005 Natural Language Processing 17 Algorithms for POS Tagging - Knowledge •Dictionary •Morphological rules, e.g., •_____-tion •_____-ly •capitalization •N-gram frequencies •to _____ •DET _____ N •But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish) •Combining these • V _____-ing Lecture 1, 7/21/2005 I was gracking vs. Gracking is fun. Natural Language Processing 18 POS Tagging - Approaches Approaches Rule-based tagging (ENGTWOL) Stochastic (=Probabilistic) tagging HMM (Hidden Markov Model) tagging Transformation-based tagging Brill tagger • Do we return one best answer or several answers and let later steps decide? • How does the requisite knowledge get entered? Lecture 1, 7/21/2005 Natural Language Processing 19 3 methods for POS tagging 1. Rule-based tagging Example: Karlsson (1995) EngCG tagger based on the Constraint Grammar architecture and ENGTWOL lexicon Basic Idea: Assign all possible tags to words (morphological analyzer used) Remove wrong tags according to set of constraint rules (typically more than 1000 handwritten constraint rules, but may be machinelearned) Lecture 1, 7/21/2005 Natural Language Processing 20 3 methods for POS tagging 2. Transformation-based tagging Example: Brill (1995) tagger - combination of rule-based and stochastic (probabilistic) tagging methodologies Basic Idea: Start with a tagged corpus + dictionary (with most frequent tags) Set the most probable tag for each word as a start value Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order (like rule-based taggers) machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach) Lecture 1, 7/21/2005 Natural Language Processing 21 3 methods for POS tagging 3. Stochastic (=Probabilistic) tagging Example: HMM (Hidden Markov Model) tagging a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context Lecture 1, 7/21/2005 Natural Language Processing 22 Topics Probability Conditional Probability Independence Bayes Rule HMM tagging Markov Chains Hidden Markov Models Lecture 1, 7/21/2005 Natural Language Processing 23 6. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Sample Space (S) Example the set of all possible outcomes finite or infinite coin toss experiment possible outcomes: S = {heads, tails} Example die toss experiment possible outcomes: S = {1,2,3,4,5,6} Lecture 1, 7/21/2005 Natural Language Processing 24 Introduction to Probability Definition of sample space depends on what we are asking Sample Space (S): the set of all possible outcomes Example die toss experiment for whether the number is even or odd possible outcomes: {even,odd} not {1,2,3,4,5,6} Lecture 1, 7/21/2005 Natural Language Processing 25 More definitions Events an event is any subset of outcomes from the sample space Example die toss experiment let A represent the event such that the outcome of the die toss experiment is divisible by 3 A = {3,6} A is a subset of the sample space S= {1,2,3,4,5,6} Lecture 1, 7/21/2005 Natural Language Processing 26 Introduction to Probability Some definitions Events an event is a subset of sample space simple and compound events Example Quic kT i me™ and a T IFF (Unc ompres s ed) dec ompres s or are needed t o s ee thi s pi c ture. deck of cards draw experiment suppose sample space S = {heart,spade,club,diamond} (four suits) let A represent the event of drawing a heart let B represent the event of drawing a red card A = {heart} (simple event) B = {heart} u {diamond} = {heart,diamond} (compound event) a compound event can be expressed as a set union of simple events Example alternative sample space S = set of 52 cards A and B would both be compound events Lecture 1, 7/21/2005 Natural Language Processing QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this pict ure. 27 Introduction to Probability Some definitions Counting suppose an operation oi can be performed in ni ways, a set of k operations o1o2...ok can be performed in n1 n2 ... nk ways Example dice toss experiment, 6 possible outcomes two dice are thrown at the same time number of sample points in sample space = 6 6 = 36 Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. Lecture 1, 7/21/2005 Natural Language Processing 28 Definition of Probability The probability law assigns to an event a nonnegative number Called P(A) Also called the probability A That encodes our knowledge or belief about the collective likelihood of all the elements of A Probability law must satisfy certain properties Lecture 1, 7/21/2005 Natural Language Processing 29 Probability Axioms Nonnegativity P(A) >= 0, for every event A Additivity If A and B are two disjoint events, then the probability of their union satisfies: P(A U B) = P(A) + P(B) Normalization The probability of the entire sample space S is equal to 1, i.e. P(S) = 1. Lecture 1, 7/21/2005 Natural Language Processing 30 An example An experiment involving a single coin toss There are two possible outcomes, H and T Sample space S is {H,T} If coin is fair, should assign equal probabilities to 2 outcomes Since they have to sum to 1 P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0 Lecture 1, 7/21/2005 Natural Language Processing 31 Another example Experiment involving 3 coin tosses Outcome is a 3-long string of H or T S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT} Assume each outcome is equiprobable “Uniform distribution” What is probability of the event that exactly 2 heads occur? A = {HHT,HTH,THH} 3 events/outcomes P(A) = P({HHT})+P({HTH})+P({THH}) additivity - union of the probability of the individual events = 1/8 + 1/8 + 1/8 = 3/8 Lecture 1, 7/21/2005 total 8 events/outcomes Natural Language Processing 32 Probability definitions In summary: Probability of drawing a spade from 52 well-shuffled playing cards: Lecture 1, 7/21/2005 Natural Language Processing 33 Moving toward language What’s the probability of drawing a 2 from a deck of 52 cards with four 2s? 4 1 P(drawing a two) .077 52 13 What’s the probability of a random word (from a random dictionary page) being a verb? # of ways to get a verb P(drawing a verb) all words Lecture 1, 7/21/2005 Natural Language Processing 34 Probability and part of speech tags • What’s the probability of a random word (from a random dictionary page) being a verb? P(drawing a verb) # of ways to get a verb all words • How to compute each of these • • • All words = just count all the words in the dictionary # of ways to get a verb: # of words which are verbs! If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is 10000/50000 = 1/5 = .20 Lecture 1, 7/21/2005 Natural Language Processing 35 Conditional Probability A way to reason about the outcome of an experiment based on partial information In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”? How likely is it that a person has a disease given that a medical test was negative? A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft? Lecture 1, 7/21/2005 Natural Language Processing 36 More precisely Given an experiment, a corresponding sample space S, and a probability law Suppose we know that the outcome is some event B We want to quantify the likelihood that the outcome also belongs to some other event A We need a new probability law that gives us the conditional probability of A given B P(A|B) Lecture 1, 7/21/2005 Natural Language Processing 37 An intuition • • • • • • • Let’s say A is “it’s raining”. Let’s say P(A) in dry Florida is .01 Let’s say B is “it was sunny ten minutes ago” P(A|B) means “what is the probability of it raining now if it was sunny 10 minutes ago” P(A|B) is probably way less than P(A) Perhaps P(A|B) is .0001 Intuition: The knowledge about B should change our estimate of the probability of A. Lecture 1, 7/21/2005 Natural Language Processing 38 Conditional Probability let A and B be events in the sample space P(A|B) = the conditional probability of event A occurring given some fixed event B occurring definition: P(A|B) = P(A B) / P(B) S QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Lecture 1, 7/21/2005 Natural Language Processing 39 Conditional probability P(A|B) = P(A B)/P(B) Or P( A | B) P( A, B) P( B) Note: P(A,B)=P(A|B) · P(B) Also: P(A,B) = P(B,A) A Lecture 1, 7/21/2005 A,B B Natural Language Processing 40 Independence What is P(A,B) if A and B are independent? P(A,B)=P(A) · P(B) iff A,B independent. P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25 Note: P(A|B)=P(A) iff A,B independent Also: P(B|A)=P(B) iff A,B independent Lecture 1, 7/21/2005 Natural Language Processing 41 Bayes Theorem P( A | B) P( B) P( B | A) P( A) • Idea: The probability of an A conditional on another event B is generally different from the probability of B conditional on A. There is a definite relationship between the two. Lecture 1, 7/21/2005 Natural Language Processing 42 Deriving Bayes Rule The probability of event A given event B is P(A B) P(A | B) P(B) Lecture 1, 7/21/2005 Natural Language Processing 43 Deriving Bayes Rule The probability of event B given event A is P(A B) P(B | A) P(A) Lecture 1, 7/21/2005 Natural Language Processing 44 Deriving Bayes Rule P(A B) P(A B) P(B | A) P(A | B) P(A) P(B) P(A | B)P(B) P(A B) P(B | A)P(A) P(A B) Lecture 1, 7/21/2005 Natural Language Processing 45 Deriving Bayes Rule P(A B) P(A B) P(B | A) P(A | B) P(A) P(B) P(A | B)P(B) P(A B) P(B | A)P(A) P(A B) | B)P(B) P(B | A)P(A) P(A Lecture 1, 7/21/2005 P(B | A)P(A) P(A | B) P(B) Natural Language Processing 46 Deriving Bayes Rule P(B | A)P(A) P(A | B) P(B) the theorem may be paraphrased as conditional/posterior probability = (LIKELIHOOD multiplied by PRIOR) divided by NORMALIZING CONSTANT Lecture 1, 7/21/2005 Natural Language Processing 47 Hidden Markov Model (HMM) Tagging Using an HMM to do POS tagging HMM is a special case of Bayesian inference It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition) Lecture 1, 7/21/2005 Natural Language Processing 48 POS tagging as a sequence classification task We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow sequence of n words w1…wn. What is the best sequence of tags which corresponds to this sequence of observations? Probabilistic/Bayesian view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn. Lecture 1, 7/21/2005 Natural Language Processing 49 Getting to HMM Let T = t1,t2,…,tn Let W = w1,w2,…,wn Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn Hat ^ means “our estimate of the best = the most probable tag sequence” Argmaxx f(x) means “the x such that f(x) is maximized” it maximazes our estimate of the best tag sequence Lecture 1, 7/21/2005 Natural Language Processing 50 Getting to HMM This equation is guaranteed to give us the best tag sequence But how do we make it operational? How do we compute this value? Intuition of Bayesian classification: Use Bayes rule to transform it into a set of other probabilities that are easier to compute Thomas Bayes: British mathematician (1702-1761) Lecture 1, 7/21/2005 Natural Language Processing 51 Bayes Rule Breaks down any conditional probability P(x|y) into three other probabilities P(x|y): The conditional probability of an event x assuming that y has occurred Lecture 1, 7/21/2005 Natural Language Processing 52 Bayes Rule We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words Lecture 1, 7/21/2005 Natural Language Processing 53 Bayes Rule Lecture 1, 7/21/2005 Natural Language Processing 54 Likelihood and prior n Lecture 1, 7/21/2005 Natural Language Processing 55 Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger Lecture 1, 7/21/2005 Natural Language Processing 56 Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n WORDS the koala put the keys on the table Lecture 1, 7/21/2005 TAGS N V P DET Natural Language Processing 57 Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption Lecture 1, 7/21/2005 Natural Language Processing 58 Likelihood and prior Further Simplifications 3. The most probable tag sequence estimated by the bigram tagger --------------------------------------------------------------------------------------------------------------- n biagram assumption Lecture 1, 7/21/2005 Natural Language Processing 59 Two kinds of probabilities (1) Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and nouns That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high But P(DT|JJ) to be:? Lecture 1, 7/21/2005 Natural Language Processing 60 Two kinds of probabilities (1) Tag transition probabilities p(ti|ti-1) Compute P(NN|DT) by counting in a labeled corpus: # of times DT is followed by NN Lecture 1, 7/21/2005 Natural Language Processing 61 Two kinds of probabilities (2) Word likelihood probabilities p(wi|ti) P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is” If we were expecting a third person singular verb, how likely is it that this verb would be is? Compute P(is|VBZ) by counting in a labeled corpus: Lecture 1, 7/21/2005 Natural Language Processing 62 An Example: the verb “race” Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN How do we pick the right tag? Lecture 1, 7/21/2005 Natural Language Processing 63 Disambiguating “race” Lecture 1, 7/21/2005 Natural Language Processing 64 Disambiguating “race” P(NN|TO) = .00047 P(VB|TO) = .83 The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’ P(race|NN) = .00057 P(race|VB) = .00012 Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB. P(NR|VB) = .0027 P(NR|NN) = .0012 tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun P(VB|TO)P(NR|VB)P(race|VB) = .00000027 P(NN|TO)P(NR|NN)P(race|NN)=.00000000032 Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins Lecture 1, 7/21/2005 Natural Language Processing 65 Hidden Markov Models What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM) Let’s just spend a bit of time tying this into the model In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model. Lecture 1, 7/21/2005 Natural Language Processing 66 Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences Lecture 1, 7/21/2005 Natural Language Processing 67 Markov chain = “First-order observed Markov Model” a set of states Q = q1, q2…qN; the state at time t is qt a set of transition probabilities: a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A aij P(qt j | qt1 i) 1 i, j N N a ij 1; 1 i N j1 Distinguished start and end states Special initial probability vector i the probability that the MM will start in state i, each i expresses the probability p(qi|START) Lecture 1, 7/21/2005 Natural Language Processing 68 Markov chain = “First-order observed Markov Model” Markov Chain for weather: Example 1 three types of weather: sunny, rainy, foggy we want to find the following conditional probabilities: P(qn|qn-1, qn-2, …, q1) - I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days - We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences Problem: the larger n is, the more observations we must collect. Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories Lecture 1, 7/21/2005 Natural Language Processing 69 Markov chain = “First-order observed Markov Model” Therefore, we make a simplifying assumption, called the (first-order) Markov assumption for a sequence of observations q1, … qn, current state only depends on previous state the joint probability of certain past and current observations Lecture 1, 7/21/2005 Natural Language Processing 70 Markov chain = “First-order observable Markov Model” Lecture 1, 7/21/2005 Natural Language Processing 71 Markov chain = “First-order observed Markov Model” Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy? Using the Markov assumption and the probabilities in table 1, this translates into: Lecture 1, 7/21/2005 Natural Language Processing 72 The weather figure: specific example Markov Chain for weather: Example 2 Lecture 1, 7/21/2005 Natural Language Processing 73 Markov chain for weather What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is 3-3-3-3 P(3,3,3,3) = 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432 Lecture 1, 7/21/2005 Natural Language Processing 74 Hidden Markov Model For Markov chains, the output symbols are the same as the states. See sunny weather: we’re in state sunny But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states. This means we don’t know which state we are in. Lecture 1, 7/21/2005 Natural Language Processing 75 Markov chain for weather Lecture 1, 7/21/2005 Natural Language Processing 76 Markov chain for words Observed events: words Hidden events: tags Lecture 1, 7/21/2005 Natural Language Processing 77 Hidden Markov Models States Q = q1, q2…qN; Observations O = o1, o2…oN; Transition probabilities (prior) Transition probability matrix A = {aij} Observation likelihoods (likelihood) Each observation is a symbol from a vocabulary V = {v1,v2,…vV} Output probability matrix B={bi(ot)} a set of observation likelihoods, each expressing the probability of an observation ot being generated from a state i, emission probabilities Special initial probability vector i the probability that the HMM will start in state i, each i expresses the probability p(qi|START) Lecture 1, 7/21/2005 Natural Language Processing 78 Assumptions Markov assumption: the probability of a particular state depends only on the previous state P(qi | q1...qi1) P(qi | qi1) Output-independence assumption: the probability of an output observation depends only on the state that produced that observation Lecture 1, 7/21/2005 Natural Language Processing 79 HMM for Ice Cream You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Boston, MA for summer of 2007 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date that summer Our job: figure out how hot it was Lecture 1, 7/21/2005 Natural Language Processing 80 Noam task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… (cp. with output symbols) Produce: Weather Sequence: C,C,H,C,C,C,H … (cp. with hidden states, causing states) Lecture 1, 7/21/2005 Natural Language Processing 81 HMM for ice cream Lecture 1, 7/21/2005 Natural Language Processing 82 Different types of HMM structure Bakis = left-to-right Lecture 1, 7/21/2005 Natural Language Processing Ergodic = fully-connected 83 HMM Taggers Two kinds of probabilities A transition probabilities (PRIOR) (slide 36) B observation likelihoods (LIKELIHOOD) (slide 36) HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability Lecture 1, 7/21/2005 Natural Language Processing 84 Weighted FSM corresponding to hidden states of HMM, showing A probs Lecture 1, 7/21/2005 Natural Language Processing 85 B observation likelihoods for POS HMM Lecture 1, 7/21/2005 Natural Language Processing 86 HMM Taggers The probabilities are trained on hand-labeled training corpora (training set) Combine different N-gram levels Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard) Lecture 1, 7/21/2005 Natural Language Processing 87 Next Time Minimum Edit Distance A “dynamic programming” algorithm A probabilistic version of this called “Viterbi” is a key part of the Hidden Markov Model! Lecture 1, 7/21/2005 Natural Language Processing 88 Evaluation Lecture 1, 7/21/2005 Natural Language Processing 89 Error Analysis Lecture 1, 7/21/2005 Natural Language Processing 90 Tag indeterminacy Lecture 1, 7/21/2005 Natural Language Processing 91 Unknown words Lecture 1, 7/21/2005 Natural Language Processing 92