Download Concrete Compositional Sentence Spaces1 Background

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Concrete Compositional Sentence Spaces1
S. Clark (Cambridge), B. Coecke, E. Grefenstette, S. Pulman, M. Sadrzadeh (Oxford)2
Background
In [1], a mathematical framework for a compositional distributional model of meaning is developed,
based on the intuition that syntactic analysis guides the semantic composition. The setting consists of
two parts: a formalism for a type-logical syntax and another for vector space semantics. To each word one
assigns a grammatical type and a meaning vector in the space corresponding to its type. The meaning of a
sentence is obtained by applying the function corresponding to the grammatical structure of the sentence
to the tensor product of the meanings of the words therein. Based on the type-logic used, some words
will have atomic types and some compound function types. The compound types live in a tensor space
where the vectors are weighted sums (i.e. superpositions) of the bases of each space. Compound types
are “applied” to their arguments by taking inner products, in a similar manner to Montague semantics.
For the type-logic we use Lambek’s Pregroup grammars. The use of pregoups is not essential, but
leads to a more elegant formalism, given its proximity to the categorical structure of vector spaces (see
[1]). A Pregroup is a partially ordered monoid where each element has a right and left cancelling element,
referred to as adjoint. It can be seen as the algebraic counterpart of the cancelation calculus of Z. Harris.
The operational difference between a Pregroup and Lambek’s Syntactic Calculus is that in the latter rather
than the elements themselves, the monoid multiplication of the algebra (used to model juxtaposition of
the types of the words) has a right and a left adjoint. The adjoints types are used to denote functions, e.g.
that of a transitive verb inputing a subject and object and outputting a sentence. In the Pregroup setting,
these function types are still denoted by adjoints, but this time the adjoints of the elements.
As an example, consider the sentence “dogs chase cats”. We assign the type n to “dog” and “cat”, and
nr snl to “chase” where nr and nl are the right and left adjoints of n and s is the type of a grammatical
(declarative) sentence. The type nr snl expresses the fact that the verb is a function that inputs two
arguments of type n, on its right and left, then outputs the type s of a sentence. The parsing of the
sentence is the reduction n(nr snl )n ≤ 1s1 = s. This is based on the fact that n and nr , similarly nl and
n cancel out, i.e. nnr ≤ 1 and nl n ≤ 1 for 1 the unit of juxtaposition. The reduction expresses that the
juxtaposition of the types of the words reduce to the type of a sentence.
On the semantic side, we assign the vector space N to the type n, and the tensor space N ⊗ S ⊗ N to
the type nr snl . Very briefly, for those unfamiliar with the concept of a tensor space, note that the tensor
space A ⊗ B has as basis the cartesian product of the basis of A with the basis of B. Recall also that any
→
−
→
−
vector can be expressed as a weighted sum of its basis vectors; e.g. if ( b1 , . . . , bn ) is the basis of A then
P
→
−
→
→
any vector −
a ∈ A can be written as −
a = i Ci bi where each Ci ∈ R is a weighting factor. Therefore
→
→
P
→
−
→ P 0−
−
→
−
→ P
−
→ −
−
→
→
→
if −
a =
C b and b =
C b0 , then there is −
a ⊗ b ∈ A ⊗ B such that −
a⊗b =
C b ⊗ b0
i
i i
i
i i
ij
ij i
j
where the weighting factor Cij = Ci Cj0 . In terms of linear algebra, this simply corresponds to the product
→
−
→
−
→
→
of the line matrix of −
a , which write as h−
a | with the column matrix of b , which we write as | b i. The
→
−
→
−
→
→
tensor product −
a ⊗ b is thus simply the outer product |−
a i · h b |. The dual operation, the inner product
→
−
→
−
→
−
→
→
→
h−
a | · | b i (abbreviated as h−
a | b i) is the more familiar dot product operation for vectors, −
a · b.
−−→ −−→
Back to the example, for the meaning of nouns we have dogs, cats ∈ N , and for the meaning of verb
P
−−−→
→
→
→) for the basis vectors of
we have chase ∈ N ⊗ S ⊗ N , i.e. the superposition ijk Cijk (−
ni ⊗ −
sj ⊗ −
n
k
→
−
−
→
→
−
N ni and nk , and the basis vectors of S, sj . From the categorical translation method presented in [1]
and the grammatical reduction n(nr snl )n ≤ s we obtain the categorification of the parse: the linear map
1
2
This abstract is a sequel to Diagrammatic Reasoning about Meaning of Sentences, to be presented in the same workshop.
The list of authors is alphabetical, emails: firstname.lastname@{comlab.ox, cl.cam}.ac.uk
1
N ⊗ 1s ⊗ N : N ⊗ (N ⊗ S ⊗ N ) ⊗ N → S. Using this map, the meaning of the sentence is computed
as follows:
−−→ −−−→
−−−−−−−−−−→
−→
dogs chase cats = (N ⊗ 1s ⊗ N ) dogs ⊗ chase ⊗ cats




X
−−→
→
→
→
→) ⊗ −
= (N ⊗ 1s ⊗ N ) dogs ⊗ 
Cijk (−
ni ⊗ −
sj ⊗ −
n
cats
k
ijk
=
X
−−→ → −
→
→|−
Cijk hdogs | −
ni i→
sj h−
n
k catsi
ijk
The key features of this operation are: first that the inner-products reduce dimensionality by ‘consuming’
→
−
→
−
→
→
tensored vectors (by virtue of the component function N : N ⊗ N → R :: −
a ⊗ b 7→ h−
a | b i) thus
−−→ −−−→ −→
mapping the tensored word vectors dogs ⊗ chase ⊗ cats into a sentence space S which is common to all
sentence representations regardless of grammatical structure or complexity. Second, that computationally
−−→ −−−→ −→
speaking, the tensor product dogs ⊗ chase ⊗ cats never needs to be calculated, as all that is required for
the simplification of the calculation shown on the last line is the noun vectors and the Cijk weights for
the verb. Hence the computation of the sentence representation is simply a matter of building a vector
using inner products, a computationally simple operation. Therefore this formalism avoids the two major
problems faced by approaches in the vein [7, 2], which use the tensor product as a composition operation,
namely that grammatically different sentences have representations with different dimensionalities, and
thus cannot be compared directly using inner products.
From Truth-Theoretic to Corpus-based Meaning
The model presented above is compositional and distributional, but still abstract. To make it concrete,
one has to construct N and S by provide a method for determining the Ci weightings. To obtain a
truth-theoretic meaning, we assume that N is spanned by all animals and S is the two-dimensional space
−−→
−→
spanned by true and false. We use the degree of freedom provided to us by the superposition type of the
verb, to define a model-theoretic truth, “à la Montague”, as follows:
(−→
→
→) = true ,
true chase(−
ni , −
n
k
→
Cijk −
sj = −−→
false o.w.
The definition of our meaning map ensures that this value propagates to the meaning of the whole
−−→ −−→
sentence. So the “dogs chase cats” becomes true whenever chase(dogs, cats) is true and false whenever
otherwise. This is exactly how meaning is computed in the model-theoretic view on semantics. One
→
→) has degrees of truth, for
way to generalize this truth-theoretic meaning is to assume that chase(−
ni , −
n
k
instance by defining chase as a combination of run and catch, e.g. chase = 32 run+ 12 catch. Again, the
meaning map ensures that these degrees propagate to the meaning of the whole sentence. For a worked
out example see [1]. But neither of these examples provide a distributional meaning.
Here we take a first step towards a corpus-based model, by attempting to retrieve a meaning for
the sentence from a corpus based on the meanings of the words within the sentence. But this meaning
goes beyond just composing the meaning of words using a vector combinator, like tensor product, or
summation and multiplication. Our computation of meaning primarily uses the syntactic structure of
the sentence, treats some as functions others as its arguments, applies the function to its arguments,
and only then by following this prescription builds a vector for the meaning of the sentence in terms of
vector combinators of the words. The á la Firth intuition behind this approach is that you shall know
2
the meaning of a sentence by its grammatical structure and the meanings of the words in it. One can
say that this approach, similar to all the distributional models, generalizes the truth-theoretic notion of
meaning to one of ‘common knowledge’ and ‘usage’ rather than of logical truth. We merely suggest this
as an illustration of how special concrete instantiations of this formalism might be used to also model
truth-theoretic semantics, but by no means are we limited to such an approach.
The contribution of this abstract is to introduce concrete constructions for a corpus-based model of
compositional meaning. This construction shall demonstrate how the mathematical model of [1] can
be implemented in a concrete setting which introduces a richer, not-necessarily-truth-theoretic notion of
natural language semantics which is closer to the notion of meaning underlying classical distributional
semantic modelling of word meaning. We leave the evaluation to future work, i.e. to perform experiments
and determine how the following method should provides better results for particular language processing
tasks, e.g. computing sentence similarity, evaluating paraphrases, text classification, etc.
A simplistic construction would be to build N in a standard way, i.e. take the bases to be words from
a dictionary and build vectors of words by counting co-occurrence. Then take S to be N ⊗ N , so its
→
→
→). The corresponding weights C can be set to be the number of
bases are of the form −
sj = (−
ni , −
n
k
ijk
times the two bases have co-occurred with the verb (or some weighted/normalized version of the count).
This model suffers from a standard problem: the meaning of “cats chase dogs” and “dogs chase cats”
will become similar. To over come that, one can strengthen the co-occurrence and count how many time
→
−
→ has been its object. This model will solve the first problem,
ni , has been the subject of the verb and −
n
j
but suffer from another, namely ignoring the meanings of subject and object words. As a result “bankers
sells stock” and “dogs chase cats” will have similar meanings, since “dog” and “cat” have co-occurred
with “chase” as much as “banker” and “stock”.
To overcome this problem, we count how many times words that have similar meanings to subject
and object have co-occurred with the verb in the same grammatical structure. We implement this idea by
taking N to be a structured vector space, as in [4, 5]. The bases of N are now annotated by ‘properties’
obtained by combining dependency relations with nouns and verbs. For example, basis vectors might
be associated with properties such as “arg-fluffy” denoting argument of adjective fluffy, “subj-chase”
denoting subject of verb chase, “obj-chase” denoting object of verb chase. We construct the vector for a
noun by counting how many times in the corpus a word has been the argument of ‘fluffy’, the subject of
→
→) are
‘chase’, the object of ‘chase’, and so on. S is still N ⊗ N and again the vectors therein Cijk (−
ni , −
n
k
set by counting how many times something that is ni (e.g. is argument of fluffy) has been subject of the
verb and something that is nk (e.g. is object of buys) has been its object.
Adjectives phrases are dealt with in a similar way. We give them the syntactic type nnl and build
their vectors in N ⊗ N . The syntactic reduction nnl n → n associated with applying an adjective to a
noun gives us the map 1N ⊗ N by which we semantically compose an adjective with a noun, as follows:
X
−−−−→
−
→ −→
→
→
→|−
red fox = (1N ⊗ N )(red ⊗ fox) =
Cij −
ni h−
n
j foxi
ij
The Cij counts for an adjective a are obtained in a similar manner to transitive or intransitive verbs:
(P
−−−−−−−→ →
→
→
mc harg-of (ac ) | −
ni i if −
ni = −
n
j
a
∈C
c
Cij =
0
o.w.
where mc = 1 if ac = a and 0 otherwise, and arg-of (ac ) = nounc if ac is an adjective with argument
nounc , and εn otherwise. We can view the counts as determining what sort of properties the arguments
of an adjective typically have (e.g. arg-red, arg-colourful for the adjective “red”).
3
Example
We briefly elaborate on an example. Let C be the multiset of tokens in our corpus, vc ∈ C be such a
token, and the functions subj -of (and similarly obj -of ) is defined as follows:
(
nounc ∈ C if vc is a verb with subject nounc
subj-of (vc ) =
εn
o.w.
where εn is the empty string. We express Cijk for a verb v as follows:
(P
−−−−−−−−→ −
−−−−−−−→ −
→
→i if −
→
→
→)
m
h
subj-of
(v
)
|
n
ih
obj-of (vc ) | n
sj = (−
ni , −
n
c
c
i
k
k
v
∈C
c
Cijk =
0
o.w.
where mc = 1 if vc = v and 0 otherwise. Thus we construct Cijk for verb v only for cases where the
→
subject-property ni and the object property nk are paired in the basis −
sj . This is done by counting the
number of times the subject of v has property ni and the number of times the object of v has property
nk , then multiplying them, as prescribed by the inner products.
Here is a worked out example. We first manually define the distributions for nouns, which would
normally be learned from a corpus:
(1) arg-fluffy
(2) arg-ferocious
(3) obj-buys
(4) arg-shrewd
(5) arg-valuable
bankers
0
4
0
6
0
cats
7
1
4
3
1
dogs
3
6
2
1
2
stock
0
0
7
0
8
kittens
2
0
0
1
0
We aim to make these match our intuitions, in that bankers are shrewd an a little ferocious but not furry,
cats are furry but not often valuable, and so on.
Likewise we define the distributions for the transitive verbs ‘chase’, ‘pursue’ and ‘sell’. These are
also manually specified according to intuitions about how these verbs are used. Since in the formalism
→
→
→) we can simplify the weight matrices for transitive verbs to
proposed above, Cijk = 0 if −
sj 6= (−
ni , −
n
k
two dimensional Cik matrices as shown below, where Cik corresponds to the number of times the verb
has a subject with attribute ni and an object with attribute nk . For example, we read that something
ferocious (i = 2) chases something fluffy (k = 1) seven times in the hypothetical corpus from which we
might have obtained these distributions.






0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
 0 0 3 0 4 
 7 1 2 3 1 
 4 2 2 2 4 






sell
pursue
chase





0
0
0
0
0
C
= 0 0 0 0 0  C
= 0 0 0 0 0  C =


 2 0 1 0 1 
 3 0 2 0 1 
 0 0 5 0 8 
0 0 1 0 1
1 0 0 0 0
0 0 0 0 0
Using these, we can perform sentence comparisons by simple calculation (best done with a script):
−−−−−−−−−−→ −−−−−−−−−−−−−→
hdogs chase cats | dogs pursue kittensi =
 
+
*
X
X
−−−→ 
→  
pursue −−→ −
→
→|−
→
→|−
chase −−→ −
= 
Cijk
hdogs | →
ni i−
sj h−
n
Cijk
hdogs | →
ni i−
sj h−
n
k catsi k kittensi
ijk
ijk
X
−−→ → −
→ −
→|−
→ −−−−→
chase pursue −−→ −
=
Cijk
Cijk hdogs | →
ni ihdogs | −
ni ihn
k catsihnk | kittensi
ijk
4
The raw number obtained from the above calculation is 14844. Normalising it by the product of the
length of both sentence vectors gives us the cosine measure value of 0.979. Let’s now contrast this
−−−−−−−−−−→ −−−−−−−−−−→
with hdogs chase cats | cats chase dogsi, which is lexically speaking a fairly similar pair of sentences
that nonetheless diverge in meaning. The raw number calculated from this inner product is 7341, its
normalised cosine measure is 0.656, which demonstrates the sharp drop in similarity obtained from
changing sentence structure. We expect some similarity since there is some non-trivial overlap between
the properties identifying cats and those identifying dogs (namely those salient to the act of chasing).
−−−−−−−−−−→ −−−−−−−−−−−→
The final example for transitive sentences is hdogs chase cats | bankers sell stocki, as two sentences that
diverge in meaning completely. The raw number is 6024, its cosine measure is 0.042, demonstrating very
low semantic similarity between these two sentences. Indeed, while dogs often chase cats, and bankers
typically sell stock, the properties related by these verbs and present in these individuals have very low
overlap, hence the formalism can establish the lack of semantic links between the two simple transitive
sentences. To summarize, our example vectors provide us with the following similarity measures:
Sentence 1
dogs chase cats
dogs chase cats
dogs chase cats
Sentence 2
dogs pursue kittens
cats chase dogs
bankers sell stock
Degree of similarity
0.979
0.656
0.042
Comparing meanings of grammatically different sentences
For reasons of space, we have only presented the treatment of sentences with transitive verbs here. For
sentences with intransitive verbs, the sentence space suffices to be just N . To compare the meaning of a
transitive sentence with anPintransitive one, we embed the meaning of the latter from N into the former
→
→
→
→
ni , i.e. the superposition of all basis vectors of N . Now since h−
εn | −
ni i = 1
N ⊗N , by taking −
εn to be i −
→
→
→) we have set C for any intransitive verb to
for any basis vector ni , we obtain that for −
sj = (−
ni , −
n
k
ijk
P
−−−−−−−−→ →
m hsubj-of (v ) | −
n i, computed as follows:
vc ∈C
c
X
c
i
X
−−−−−−−−→ → −−−−−−−→ −
−−−−−−−−→ → −
→i =
→i
mc hsubj-of (vc ) | −
ni ihobj-of (vc ) | n
mc hsubj-of (vc ) | −
ni ih→
εn | −
n
k
k
vc ∈C
vc ∈C
We can now compare meaning of these two types of sentences, simply by taking the inner product of
their meanings (despite their different ‘arities’) and then normalise it by vector length to obtain the cosine
measure. For example:
+
 
*
X
X
−−→ → −
−−−−−−−−−−→ −−−−−−−→
−−→ → −
−→  
→|−
0
hdogs chase cats | dogs chasei = 
Cijk hdogs | −
ni i→
sj h−
n
Cijk
hdogs | −
ni i→
sj 
k cats i ijk
ijk
X
−
−
→
−
−
→
−
→
→
→
→ | catsi
0
=
Cijk Cijk
hdogs | −
ni ihdogs | −
ni ih−
n
k
ijk
The raw number is 14092 and its normalised cosine measure is 0.961, indicating high similarity (but
some difference) between a sentence with a transitive verb and one where the subject remains the same,
but the verb is used intransitively.
The simplest possible approach to generalize the above, would be based on the pregroup typing of
the grammatical structure. According to this, one builds the sentence space of sentences with indirect
objects by adding more tensor products of N . For instance, to compute the meaning of the sentence
“Dogs chase cats in the garden.” we need a triple of basis, hence S has to be N ⊗ N ⊗ N ; similarly for
5
“Dogs chase cats in the garden at night.” we need a quadruple of basis, hence S = N ⊗ N ⊗ N ⊗ N . If
there is an adverb present, one needs to add a tensor for that too, more than one adverb is roughly treated
like the adjective case and does need more tensors. To be able to compare meanings of all these different
types of sentences, we might want to take S to be an infinite spaces and embed the smaller spaces into
it via N ,→ N ⊗ N , or generalizations thereof. The good news is that since this possibly large space is
only used to form inner products to compare meanings of sentences, we actually do not need to build the
whole space, hence avoiding complexity blow ups. We aim to investigate the use of other type-logics,
such as CCG to see whether they would provide us with a different generalization.
Related Work
In [6] a multiplicative model for vector composition is introduced and evaluated. The abstract model
of [1] on which this work is based, is a general framework which can model any kind of vector combination. The particular concrete construction of this paper differs from that of [6] in three ways: (1) we can
compute truth-theoretic as well as corpus-based meaning, (2) our meaning construction relies on and is
guided by the grammatical structure of the sentence, (3) we work with structured vector spaces, which
go beyond only looking for co-occurrence of words. The approach of [4] is more in the spirit of ours,
in that extra information about syntax is used to compose meaning. Similar to us, they use a structured
vector space to “integrate lexical information with selectional preferences”. For each word, the extra
information is modelled as a set of vectors, each representing the expectations that the word supports,
examples are subject and object of a verb. Indeed, a more tedious comparison/unification shall constitute
future work, but it seems that our approach is still more general. Apart from the truth-theoretic meaning,
in our corpus-based model we do not have to pre-specify the expectations and limit each word to a fixed
set of them. We consider all possible such expectations as bases of our vector space and let the corpus
decide which words have a higher weight for which expectations.
We aim to use similar evaluation methods to the above papers to investigate applications of our
setting. In addition, we propose more extensive evaluation of this formalism using a corpus of paraphrase
pairs, and using inter-annotator agreement as the golden standard, as discussed in [3].
References
[1] S. Clark B.Coecke, M. Sadrzadeh.
Mathematical Foundations for a Compositional Distributional Model of Meaning, volume 36.
Linguistic Analysis (Lambek Festschrift), 2010.
http://arxiv.org/abs/1003.4394.
[2] S. Clark and S. Pulman. Combining symbolic and distributional models of meaning. In Proceedings of AAAI
Spring Symposium on Quantum Interaction. AAAI Press, 2007.
[3] T. Cohn, C. Callison-Burch, and M. Lapata. Constructing corpora for development and evaluation of paraphrase systems. Computational Lingustics, 34(4):597–614, 2008.
[4] K. Erk and S. Padó. A structured vector space model for word meaning in context. In EMNLP’08, Conference
on Empirical Methods in Natural Language Processing, pages 897–906. ACL, 2008.
[5] G. Grefenstette. Use of syntactic context to produce term association lists for text retrieval. In Nicholas J.
Belkin, Peter Ingwersen, and Annelise Mark Pejtersen, editors, SIGIR, pages 89–97. ACM, 1992.
[6] J. Mitchel and M. Lapata. Vector-based models of semantic composition. In Association for Computational
Linguistics, pages 236–244, 2008.
[7] P. Smolensky and G. Legendre. The Harmonic Mind: From Neural Computation to Optimality-Theoretic
Grammar Vol. I: Cognitive Architecture Vol. II: Linguistic and Philosophical Implications. MIT Press, 2005.
6