Download pptx

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Probability box wikipedia , lookup

Transcript
Corpora and Statistical Methods
Lecture 5
Albert Gatt
Application 3: Verb selectional restrictions
Observation
 Some verbs place high restrictions on the semantic category
of the NPs they take as arguments.
 Assumption: we’re focusing attention on Direct Objects only
 e.g. eat selects for FOOD DOs:
 eat cake
 eat some fresh vegetables
 grow selects for LEGUME DOs:
 grow potatoes
Not all verbs are equally constraining
 Some verbs seem to place fewer restrictions than others:
 see doesn’t seem too restrictive:
 see John
 see the potato
 see the fresh vegetables
 …
Problem definition
 For a given verb and a potential set of arguments (nouns), we
want to learn to what extent the verb selects for those
arguments
 rather than individual nouns, we’re better off using noun classes
(FOOD etc), since these allow us to generalise more
 can obtain these using a standard resource, e.g. WordNet
A short detour: Kullback-Leibler divergence
Kullback-Leibler divergence
 We are often in a position where we estimate a probability
distribution from (incomplete) data
 This problem is inherent in sampling.
 We end up with a distribution P, which is intended as a model of
distribution Q.
 How good is P as a model?
 Kullback-Leibler divergence tells us how well our model
matches the actual distribution.
Motivating example
 Suppose I’m interested in the semantic type or class to which
a noun belongs, e.g.:
 cake, meat, cauliflower are types of FOOD (among other things)
 potato, carrot are types of LEGUME (among other things)
 How do I infer this?
 It helps if I know that certain predicates, like grow select for
some types of DO, not others
 *grow meat, *grow cake
 grow potatoes, grow carrots
Motivating example cont/d
 Ingredients
 C: the class of interest (e.g. LEGUME)
 v: the verb of interest (e.g. grow)
 P(C) = probability of class C
 prior probability of finding some element of C as DO of any verb
 P(C|v) = probability of C given that we know that a noun is a DO of
grow
 this is my posterior probability
 More precise way of asking the question:
 Does the probability distribution of C change given the info about v?
Ingredients for KL Divergence
 some prior distribution P
 some posterior distribution Q
 Intuition: KL-Divergence measures how much information
we gain about P, given that we know Q
 if it’s 0, then we gain no info
 Given two probability distributions P and Q, with probability
mass functions p(x) and q(x), KL-Divergence is denoted
D(p||q)
Calculating KL-Divergence
p( x)
D( p || q)   p( x) log
q( x)
x
divergence
between prior
and posterior
probability distributions
More on the interpretation of KL-Divergence
 If probability distribution P is interpreted as “the truth” and
distribution Q is my approximation, then:
 D(p||q) tells me how much extra info I need to add to Q to
get to the actual truth
Back to our problem: Applying KL-divergence to
selectional restrictions
Resnik’s model (Resnik 1996)
 2 main ingredients:
Selectional Preference Strength (S): how strongly a verb
constrains its direct object (a global estimate)
2. Selectional Association (A): how much a verb v is associated
with a given noun class (a specific estimate for a given class)
1.
Notation
 v = a verb of interest
 S(v) = the selectional preference strength of v
 c = a noun class
 C = the set of all the noun classes
 A(v,c) = the selectional association between v and class c
Selectional Preference Strength
 S(v) is the KL-Divergence between:
 the overall prior distribution of all noun classes
 the posterior distribution of noun classes in the direct object position
of v
 how much info we gain from knowing the probability that
members of a class occur as DO of v
 works as a global estimate of how much v constrains its arguments
semantically
 the more it constrains them, the more info we stand to gain from
knowing that an argument occurs as DO of v
S(grow): prior vs. posterior
Source: Resnik 1996, p. 135
Calculating S(v)
S (v)  D( P(C | v) || P(C ))
P (c | v )
  P(c | v) log
P (c )
cC
This quantifies the extent to which our prior and posterior
probability estimates diverge.
 how much info do we gain about C by knowing it’s the
object of v?
Some more examples
class
P(c)
P(c|eat)
P(c|see)
P(c|find)
people
0.25
0.01
0.25
0.33
furniture
0.25
0.01
0.25
0.33
food
0.25
0.97
0.25
0.33
action
0.25
0.01
0.25
0.01
1.76
0.00
0.35
SPS: S(v)
How much info do we gain if we know what a noun is DO of?
 quite a lot if it’s an argument of eat
 not much if it’s an argument of find
 none if it’s an argument of see
Source: Manning and Schutze 1999, p. 290
Selectional association
 This is estimated based on selectional preference strength
 tells us how much a verb is associated with a specific class, given
the extent to which it constrains its arguments
 given a class c, A(v,c) tells us how much of S(v) is contributed
by c
Calculating A(v,c)
this is part
of our
summation
for S(v)
P (c | v )
P (c | v) log
P (c )
A(v, c) 
S (v )
dividing by S(v)
gives the proportion
of S(v) which is
caused by class c
From A(v,c) to A(v,n)
 We know how to estimate the association strength of a class with v
 Problem:
 some nouns can occur in more than one class
 Let classes(n) be the classes in which noun n belongs:
A(v, n)  arg maxcclasses(n) A(v, c)
Example
 Susan interrupted the chair.
 chair is in class FURNITURE
 chair is in class PEOPLE
 A(interrupt,PEOPLE) > A(interrupt,FURNITURE)
 A(interrupt,chair) = A(interrupt,PEOPLE)
 Note that this is a kind of word-sense disambiguation!
Some results from Resnik 1996
Verb (v)
Noun (n)
Class (c)
A(V,n)
answer
answer
request
tragedy
Speech act
4.49
communication 3.88
hear
story
communication 1.89
hear
issue
communication 1.89
There are some fairly atypical examples:
 these are due to the disambiguation method
 e.g. tragedy can be in COMM class, and so is assigned
A(answer,COMM) as it’s a(v,n)
Overall evaluation
 Resnik’s results were shown to correlate very well with
results from a psycholinguistic study
 The method is promising:
 seems to mirror human intuitions
 may have some psychological validity
 Possibly an alternative, data-driven account of the semantic
bootstrapping hypothesis of Pinker 1989?