Download ppt

Probabilistic Parsing and Treebanks L545 Spring 2000 Page 1 Motivation and Outline  Previously, we used CFGs to parse with, but: - Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse - We did not discuss how to obtain such grammars  Where we’re going: - Probabilistic Context-Free Grammars (PCFGs) + some discussion of treebanks - Lexicalized PCFGs  We’ll only cover PCFGs in very broad strokes - L645 offers Page 2 more details Statistical Parsing  Basic idea - Start with a treebank  a collection of sentences with syntactic annotation, i.e., already-parsed sentences - Examine which parse trees occur frequently - Extract grammar rules corresponding to those parse trees, estimating the probability of the grammar rule based on its frequency  Result: a CFG augmented with probabilities Page 3 Probabilistic Context-Free Grammars (PCFGs)  Definition of a CFG (review): - Set of non-terminals (N) - Set of terminals (T) - Set of rules/productions (P), of the form Α  β - Designated start symbol (S)  Definition of a PCFG: - Same as a CFG, but with one more function, D - D assigns probabilities to each rule in P Page 4 Probabilities  The function D gives probabilities for a non-terminal A to be expanded to a sequence β. - Written as P(A  β) - or as P(A  β|A)  Idea: given A as the mother non-terminal (LHS), what is the likelihood that β is the correct RHS? - Note that Σi (A  βi | A) = 1  i.e., these are all the ways of expanding A - For example, we would augment a CFG with these probabilities:  P(S  NP VP | S) = .80  P(S  Aux NP VP | S) = .15 Page 5  P(S  VP | S) = .05 Estimating Probabilities using a Treebank  Given a corpus of sentences annotated with syntactic annotation (e.g., the Penn Treebank) - Consider all parse trees - (1) Each time you have a rule of the form Aβ applied in a parse tree, increment a counter for that rule - (2) Also count the number of times A is on the left hand side of a rule - Divide (1) by (2)  P(Aβ|A) = Count(Aβ)/Count(A)  If you don’t have annotated data, parse the corpus (as we’ll Page 6 describe next) and estimate the probabilities … which are then Using Probabilities to Parse  P(T): probability of a particular parse tree  P(T) = ΠnєT p(r(n)) i.e., the product of the probabilities of the rules r used to expand each node n in the parse tree Page 7 Computing probabilities  We have the following rules and probabilities (adapted from Figure 14.1): - S  VP .05 - VP  V NP .40 - NP  Det N .20 - V  book .30 - Det  that .05 - N  flight .25  P(T) = P(SVP)*P(VPV NP)*…*P(Nflight) = .05*.40*.20*.30*.05*.25 = .000015, or 1.5 x 10-5 Page 8 Using probabilities  So, the probability for that parse is 0.000015. this mean? What’ does - Probabilities are useful for comparing with other probabilities  Whereas we couldn’t decide between two parses using a regular CFG, we now can.  For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [NP John] [NP money]) or one NP: - A: [book [TWA] [flights]] - B: [book [TWA flights]]  Probabilities allows us to choose choice B Page 9 Obtaining the best parse  Call the best parse T(S), where S is your sentence - Get the tree which has the highest probability, i.e. - T(S) = argmaxTєparse-trees(S) P(T)  Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse Page 10 The CYK algorithm  Base case - Add words to the chart - Store P(A  w_i) for every category A in the chart  Recursive case - Get the probability for A at this node by multiplying the probabilities for B and for C by P(A  BC)  P(B)*P(C)*P(A  BC) - Only calculate this once - Rules must be of the form A  BC, i.e., exactly two items on the RHS (Chomsky Normal Form (CNF))  For a given A, only keep the maximum probability Page 11 - Previously, we kept A, but without any probabilities Problems with PCFGs  It’s still only a CFG, so dependencies on non-CFG info not captured - e.g., Pronouns are more likely to be subjects than objects: - P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]  Ignores lexical information (statistics), which is usually crucial for disambiguation - (T1) America sent [[250,000 soldiers] [into Iraq]] - (T2) America sent [250,000 soldiers] [into Iraq]  send with into-PP is the correct analysis (T2) because they “go well” together Page 12  To handle lexical information, we’ll turn to lexicalized PCFGs Lexicalized Grammars  Remember how head information is passed up in a syntactic analysis? - e.g., VP[head [1]]  V[head [1]] NP - If you follow this down all the way to the bottom of a tree, you wind up with a head word  In some sense, we can say that Book that flight is not just an S, but an S rooted in book - Thus, book is the headword of the whole sentence  By adding headword information to nonterminals, we wind up with a lexicalized grammar Page 13 Lexicalized PCFGs  Lexicalized Parse Trees - Each PCFG rule in a tree is augmented to identify one RHS constituent to be the head daughter - The headword for a node is set to the head word of its head daughter Page 14 [book] [book] [flight] [flight] Incorporating Head Probabilities: Wrong Way  Simply adding headword w to node won’t work: - So, the node A becomes A[w] - e.g., P(A[w]β|A) =Count(A[w]β)/Count(A)  The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities - VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10-10 - VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10-11  These probabilities are tiny, and others will never occur Page 15 Incorporating head probabilities: Right way  Previously, we conditioned on the mother node (A): - P(Aβ|A)  Now, we can condition on the mother node and the headword of A (h(A)): - P(Aβ|A, h(A))  We’re no longer conditioning on simply the mother category A, but on the mother category when h(A) is the head - e.g., P(VPVBD NP PP | VP, dumped) - The likelihood of VP expanding to VBD NP PP when dumped is the head is different than with ate Page 16 Calculating rule probabilities  We’ll write the probability more generally as: - P(r(n) | n, h(n)) - where n = node, r = rule, and h = headword  We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total: P(VP  VBD NP PP | VP, dumped) = C(VP(dumped)  VBD NP PP)/ Σβ C(VP(dumped)  β) Page 17 Adding info about word-word dependencies  We want to take into account one other factor: the probability of being a head word (in a given context) - P(h(n)=word | …)  We condition this probability on two things: 1. the category of the node (n), and 2. the headword of the mother (h(m(n))) - P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n))) - P(sacks | NP, dumped)  What we’re really doing is factoring in how words relate to each other  These are dependency relations: sacks is dependent on dumped, in this case Page 18 Putting it all together  See sec. 14.6 for an example lexicalized parse tree for workers dumped sacks into a bin  For rules r, category n, head h, mother m P(T) = ΠnєT p(r(n)| n, h(n)) e.g., P(VP VBD NP PP |VP, dumped) subcategorization info * p(h(n) | n, h(m(n))) e.g. P(sacks | NP, dumped) dependency info between words Page 19 Evaluating Parser Output  Traditional measures of parser accuracy: - Labeled bracketing precision: # correct constituents in parse/# constituents in parse - Labeled bracketing recall: # correct constituents in parse/# (correct) constituents in treebank parse  There are known problems with these measures, so people are trying to use dependency-based measures instead - How many dependency relations did the parse get correct? Page 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt