Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Corpora and Statistical Methods Lecture 5 Albert Gatt Application 3: Verb selectional restrictions Observation Some verbs place high restrictions on the semantic category of the NPs they take as arguments. Assumption: we’re focusing attention on Direct Objects only e.g. eat selects for FOOD DOs: eat cake eat some fresh vegetables grow selects for LEGUME DOs: grow potatoes Not all verbs are equally constraining Some verbs seem to place fewer restrictions than others: see doesn’t seem too restrictive: see John see the potato see the fresh vegetables … Problem definition For a given verb and a potential set of arguments (nouns), we want to learn to what extent the verb selects for those arguments rather than individual nouns, we’re better off using noun classes (FOOD etc), since these allow us to generalise more can obtain these using a standard resource, e.g. WordNet A short detour: Kullback-Leibler divergence Kullback-Leibler divergence We are often in a position where we estimate a probability distribution from (incomplete) data This problem is inherent in sampling. We end up with a distribution P, which is intended as a model of distribution Q. How good is P as a model? Kullback-Leibler divergence tells us how well our model matches the actual distribution. Motivating example Suppose I’m interested in the semantic type or class to which a noun belongs, e.g.: cake, meat, cauliflower are types of FOOD (among other things) potato, carrot are types of LEGUME (among other things) How do I infer this? It helps if I know that certain predicates, like grow select for some types of DO, not others *grow meat, *grow cake grow potatoes, grow carrots Motivating example cont/d Ingredients C: the class of interest (e.g. LEGUME) v: the verb of interest (e.g. grow) P(C) = probability of class C prior probability of finding some element of C as DO of any verb P(C|v) = probability of C given that we know that a noun is a DO of grow this is my posterior probability More precise way of asking the question: Does the probability distribution of C change given the info about v? Ingredients for KL Divergence some prior distribution P some posterior distribution Q Intuition: KL-Divergence measures how much information we gain about P, given that we know Q if it’s 0, then we gain no info Given two probability distributions P and Q, with probability mass functions p(x) and q(x), KL-Divergence is denoted D(p||q) Calculating KL-Divergence p( x) D( p || q) p( x) log q( x) x divergence between prior and posterior probability distributions More on the interpretation of KL-Divergence If probability distribution P is interpreted as “the truth” and distribution Q is my approximation, then: D(p||q) tells me how much extra info I need to add to Q to get to the actual truth Back to our problem: Applying KL-divergence to selectional restrictions Resnik’s model (Resnik 1996) 2 main ingredients: Selectional Preference Strength (S): how strongly a verb constrains its direct object (a global estimate) 2. Selectional Association (A): how much a verb v is associated with a given noun class (a specific estimate for a given class) 1. Notation v = a verb of interest S(v) = the selectional preference strength of v c = a noun class C = the set of all the noun classes A(v,c) = the selectional association between v and class c Selectional Preference Strength S(v) is the KL-Divergence between: the overall prior distribution of all noun classes the posterior distribution of noun classes in the direct object position of v how much info we gain from knowing the probability that members of a class occur as DO of v works as a global estimate of how much v constrains its arguments semantically the more it constrains them, the more info we stand to gain from knowing that an argument occurs as DO of v S(grow): prior vs. posterior Source: Resnik 1996, p. 135 Calculating S(v) S (v) D( P(C | v) || P(C )) P (c | v ) P(c | v) log P (c ) cC This quantifies the extent to which our prior and posterior probability estimates diverge. how much info do we gain about C by knowing it’s the object of v? Some more examples class P(c) P(c|eat) P(c|see) P(c|find) people 0.25 0.01 0.25 0.33 furniture 0.25 0.01 0.25 0.33 food 0.25 0.97 0.25 0.33 action 0.25 0.01 0.25 0.01 1.76 0.00 0.35 SPS: S(v) How much info do we gain if we know what a noun is DO of? quite a lot if it’s an argument of eat not much if it’s an argument of find none if it’s an argument of see Source: Manning and Schutze 1999, p. 290 Selectional association This is estimated based on selectional preference strength tells us how much a verb is associated with a specific class, given the extent to which it constrains its arguments given a class c, A(v,c) tells us how much of S(v) is contributed by c Calculating A(v,c) this is part of our summation for S(v) P (c | v ) P (c | v) log P (c ) A(v, c) S (v ) dividing by S(v) gives the proportion of S(v) which is caused by class c From A(v,c) to A(v,n) We know how to estimate the association strength of a class with v Problem: some nouns can occur in more than one class Let classes(n) be the classes in which noun n belongs: A(v, n) arg maxcclasses(n) A(v, c) Example Susan interrupted the chair. chair is in class FURNITURE chair is in class PEOPLE A(interrupt,PEOPLE) > A(interrupt,FURNITURE) A(interrupt,chair) = A(interrupt,PEOPLE) Note that this is a kind of word-sense disambiguation! Some results from Resnik 1996 Verb (v) Noun (n) Class (c) A(V,n) answer answer request tragedy Speech act 4.49 communication 3.88 hear story communication 1.89 hear issue communication 1.89 There are some fairly atypical examples: these are due to the disambiguation method e.g. tragedy can be in COMM class, and so is assigned A(answer,COMM) as it’s a(v,n) Overall evaluation Resnik’s results were shown to correlate very well with results from a psycholinguistic study The method is promising: seems to mirror human intuitions may have some psychological validity Possibly an alternative, data-driven account of the semantic bootstrapping hypothesis of Pinker 1989?