Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Concrete Compositional Sentence Spaces1 S. Clark (Cambridge), B. Coecke, E. Grefenstette, S. Pulman, M. Sadrzadeh (Oxford)2 Background In [1], a mathematical framework for a compositional distributional model of meaning is developed, based on the intuition that syntactic analysis guides the semantic composition. The setting consists of two parts: a formalism for a type-logical syntax and another for vector space semantics. To each word one assigns a grammatical type and a meaning vector in the space corresponding to its type. The meaning of a sentence is obtained by applying the function corresponding to the grammatical structure of the sentence to the tensor product of the meanings of the words therein. Based on the type-logic used, some words will have atomic types and some compound function types. The compound types live in a tensor space where the vectors are weighted sums (i.e. superpositions) of the bases of each space. Compound types are “applied” to their arguments by taking inner products, in a similar manner to Montague semantics. For the type-logic we use Lambek’s Pregroup grammars. The use of pregoups is not essential, but leads to a more elegant formalism, given its proximity to the categorical structure of vector spaces (see [1]). A Pregroup is a partially ordered monoid where each element has a right and left cancelling element, referred to as adjoint. It can be seen as the algebraic counterpart of the cancelation calculus of Z. Harris. The operational difference between a Pregroup and Lambek’s Syntactic Calculus is that in the latter rather than the elements themselves, the monoid multiplication of the algebra (used to model juxtaposition of the types of the words) has a right and a left adjoint. The adjoints types are used to denote functions, e.g. that of a transitive verb inputing a subject and object and outputting a sentence. In the Pregroup setting, these function types are still denoted by adjoints, but this time the adjoints of the elements. As an example, consider the sentence “dogs chase cats”. We assign the type n to “dog” and “cat”, and nr snl to “chase” where nr and nl are the right and left adjoints of n and s is the type of a grammatical (declarative) sentence. The type nr snl expresses the fact that the verb is a function that inputs two arguments of type n, on its right and left, then outputs the type s of a sentence. The parsing of the sentence is the reduction n(nr snl )n ≤ 1s1 = s. This is based on the fact that n and nr , similarly nl and n cancel out, i.e. nnr ≤ 1 and nl n ≤ 1 for 1 the unit of juxtaposition. The reduction expresses that the juxtaposition of the types of the words reduce to the type of a sentence. On the semantic side, we assign the vector space N to the type n, and the tensor space N ⊗ S ⊗ N to the type nr snl . Very briefly, for those unfamiliar with the concept of a tensor space, note that the tensor space A ⊗ B has as basis the cartesian product of the basis of A with the basis of B. Recall also that any → − → − vector can be expressed as a weighted sum of its basis vectors; e.g. if ( b1 , . . . , bn ) is the basis of A then P → − → → any vector − a ∈ A can be written as − a = i Ci bi where each Ci ∈ R is a weighting factor. Therefore → → P → − → P 0− − → − → P − → − − → → → if − a = C b and b = C b0 , then there is − a ⊗ b ∈ A ⊗ B such that − a⊗b = C b ⊗ b0 i i i i i i ij ij i j where the weighting factor Cij = Ci Cj0 . In terms of linear algebra, this simply corresponds to the product → − → − → → of the line matrix of − a , which write as h− a | with the column matrix of b , which we write as | b i. The → − → − → → tensor product − a ⊗ b is thus simply the outer product |− a i · h b |. The dual operation, the inner product → − → − → − → → → h− a | · | b i (abbreviated as h− a | b i) is the more familiar dot product operation for vectors, − a · b. −−→ −−→ Back to the example, for the meaning of nouns we have dogs, cats ∈ N , and for the meaning of verb P −−−→ → → →) for the basis vectors of we have chase ∈ N ⊗ S ⊗ N , i.e. the superposition ijk Cijk (− ni ⊗ − sj ⊗ − n k → − − → → − N ni and nk , and the basis vectors of S, sj . From the categorical translation method presented in [1] and the grammatical reduction n(nr snl )n ≤ s we obtain the categorification of the parse: the linear map 1 2 This abstract is a sequel to Diagrammatic Reasoning about Meaning of Sentences, to be presented in the same workshop. The list of authors is alphabetical, emails: firstname.lastname@{comlab.ox, cl.cam}.ac.uk 1 N ⊗ 1s ⊗ N : N ⊗ (N ⊗ S ⊗ N ) ⊗ N → S. Using this map, the meaning of the sentence is computed as follows: −−→ −−−→ −−−−−−−−−−→ −→ dogs chase cats = (N ⊗ 1s ⊗ N ) dogs ⊗ chase ⊗ cats X −−→ → → → →) ⊗ − = (N ⊗ 1s ⊗ N ) dogs ⊗ Cijk (− ni ⊗ − sj ⊗ − n cats k ijk = X −−→ → − → →|− Cijk hdogs | − ni i→ sj h− n k catsi ijk The key features of this operation are: first that the inner-products reduce dimensionality by ‘consuming’ → − → − → → tensored vectors (by virtue of the component function N : N ⊗ N → R :: − a ⊗ b 7→ h− a | b i) thus −−→ −−−→ −→ mapping the tensored word vectors dogs ⊗ chase ⊗ cats into a sentence space S which is common to all sentence representations regardless of grammatical structure or complexity. Second, that computationally −−→ −−−→ −→ speaking, the tensor product dogs ⊗ chase ⊗ cats never needs to be calculated, as all that is required for the simplification of the calculation shown on the last line is the noun vectors and the Cijk weights for the verb. Hence the computation of the sentence representation is simply a matter of building a vector using inner products, a computationally simple operation. Therefore this formalism avoids the two major problems faced by approaches in the vein [7, 2], which use the tensor product as a composition operation, namely that grammatically different sentences have representations with different dimensionalities, and thus cannot be compared directly using inner products. From Truth-Theoretic to Corpus-based Meaning The model presented above is compositional and distributional, but still abstract. To make it concrete, one has to construct N and S by provide a method for determining the Ci weightings. To obtain a truth-theoretic meaning, we assume that N is spanned by all animals and S is the two-dimensional space −−→ −→ spanned by true and false. We use the degree of freedom provided to us by the superposition type of the verb, to define a model-theoretic truth, “à la Montague”, as follows: (−→ → →) = true , true chase(− ni , − n k → Cijk − sj = −−→ false o.w. The definition of our meaning map ensures that this value propagates to the meaning of the whole −−→ −−→ sentence. So the “dogs chase cats” becomes true whenever chase(dogs, cats) is true and false whenever otherwise. This is exactly how meaning is computed in the model-theoretic view on semantics. One → →) has degrees of truth, for way to generalize this truth-theoretic meaning is to assume that chase(− ni , − n k instance by defining chase as a combination of run and catch, e.g. chase = 32 run+ 12 catch. Again, the meaning map ensures that these degrees propagate to the meaning of the whole sentence. For a worked out example see [1]. But neither of these examples provide a distributional meaning. Here we take a first step towards a corpus-based model, by attempting to retrieve a meaning for the sentence from a corpus based on the meanings of the words within the sentence. But this meaning goes beyond just composing the meaning of words using a vector combinator, like tensor product, or summation and multiplication. Our computation of meaning primarily uses the syntactic structure of the sentence, treats some as functions others as its arguments, applies the function to its arguments, and only then by following this prescription builds a vector for the meaning of the sentence in terms of vector combinators of the words. The á la Firth intuition behind this approach is that you shall know 2 the meaning of a sentence by its grammatical structure and the meanings of the words in it. One can say that this approach, similar to all the distributional models, generalizes the truth-theoretic notion of meaning to one of ‘common knowledge’ and ‘usage’ rather than of logical truth. We merely suggest this as an illustration of how special concrete instantiations of this formalism might be used to also model truth-theoretic semantics, but by no means are we limited to such an approach. The contribution of this abstract is to introduce concrete constructions for a corpus-based model of compositional meaning. This construction shall demonstrate how the mathematical model of [1] can be implemented in a concrete setting which introduces a richer, not-necessarily-truth-theoretic notion of natural language semantics which is closer to the notion of meaning underlying classical distributional semantic modelling of word meaning. We leave the evaluation to future work, i.e. to perform experiments and determine how the following method should provides better results for particular language processing tasks, e.g. computing sentence similarity, evaluating paraphrases, text classification, etc. A simplistic construction would be to build N in a standard way, i.e. take the bases to be words from a dictionary and build vectors of words by counting co-occurrence. Then take S to be N ⊗ N , so its → → →). The corresponding weights C can be set to be the number of bases are of the form − sj = (− ni , − n k ijk times the two bases have co-occurred with the verb (or some weighted/normalized version of the count). This model suffers from a standard problem: the meaning of “cats chase dogs” and “dogs chase cats” will become similar. To over come that, one can strengthen the co-occurrence and count how many time → − → has been its object. This model will solve the first problem, ni , has been the subject of the verb and − n j but suffer from another, namely ignoring the meanings of subject and object words. As a result “bankers sells stock” and “dogs chase cats” will have similar meanings, since “dog” and “cat” have co-occurred with “chase” as much as “banker” and “stock”. To overcome this problem, we count how many times words that have similar meanings to subject and object have co-occurred with the verb in the same grammatical structure. We implement this idea by taking N to be a structured vector space, as in [4, 5]. The bases of N are now annotated by ‘properties’ obtained by combining dependency relations with nouns and verbs. For example, basis vectors might be associated with properties such as “arg-fluffy” denoting argument of adjective fluffy, “subj-chase” denoting subject of verb chase, “obj-chase” denoting object of verb chase. We construct the vector for a noun by counting how many times in the corpus a word has been the argument of ‘fluffy’, the subject of → →) are ‘chase’, the object of ‘chase’, and so on. S is still N ⊗ N and again the vectors therein Cijk (− ni , − n k set by counting how many times something that is ni (e.g. is argument of fluffy) has been subject of the verb and something that is nk (e.g. is object of buys) has been its object. Adjectives phrases are dealt with in a similar way. We give them the syntactic type nnl and build their vectors in N ⊗ N . The syntactic reduction nnl n → n associated with applying an adjective to a noun gives us the map 1N ⊗ N by which we semantically compose an adjective with a noun, as follows: X −−−−→ − → −→ → → →|− red fox = (1N ⊗ N )(red ⊗ fox) = Cij − ni h− n j foxi ij The Cij counts for an adjective a are obtained in a similar manner to transitive or intransitive verbs: (P −−−−−−−→ → → → mc harg-of (ac ) | − ni i if − ni = − n j a ∈C c Cij = 0 o.w. where mc = 1 if ac = a and 0 otherwise, and arg-of (ac ) = nounc if ac is an adjective with argument nounc , and εn otherwise. We can view the counts as determining what sort of properties the arguments of an adjective typically have (e.g. arg-red, arg-colourful for the adjective “red”). 3 Example We briefly elaborate on an example. Let C be the multiset of tokens in our corpus, vc ∈ C be such a token, and the functions subj -of (and similarly obj -of ) is defined as follows: ( nounc ∈ C if vc is a verb with subject nounc subj-of (vc ) = εn o.w. where εn is the empty string. We express Cijk for a verb v as follows: (P −−−−−−−−→ − −−−−−−−→ − → →i if − → → →) m h subj-of (v ) | n ih obj-of (vc ) | n sj = (− ni , − n c c i k k v ∈C c Cijk = 0 o.w. where mc = 1 if vc = v and 0 otherwise. Thus we construct Cijk for verb v only for cases where the → subject-property ni and the object property nk are paired in the basis − sj . This is done by counting the number of times the subject of v has property ni and the number of times the object of v has property nk , then multiplying them, as prescribed by the inner products. Here is a worked out example. We first manually define the distributions for nouns, which would normally be learned from a corpus: (1) arg-fluffy (2) arg-ferocious (3) obj-buys (4) arg-shrewd (5) arg-valuable bankers 0 4 0 6 0 cats 7 1 4 3 1 dogs 3 6 2 1 2 stock 0 0 7 0 8 kittens 2 0 0 1 0 We aim to make these match our intuitions, in that bankers are shrewd an a little ferocious but not furry, cats are furry but not often valuable, and so on. Likewise we define the distributions for the transitive verbs ‘chase’, ‘pursue’ and ‘sell’. These are also manually specified according to intuitions about how these verbs are used. Since in the formalism → → →) we can simplify the weight matrices for transitive verbs to proposed above, Cijk = 0 if − sj 6= (− ni , − n k two dimensional Cik matrices as shown below, where Cik corresponds to the number of times the verb has a subject with attribute ni and an object with attribute nk . For example, we read that something ferocious (i = 2) chases something fluffy (k = 1) seven times in the hypothetical corpus from which we might have obtained these distributions. 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 3 0 4 7 1 2 3 1 4 2 2 2 4 sell pursue chase 0 0 0 0 0 C = 0 0 0 0 0 C = 0 0 0 0 0 C = 2 0 1 0 1 3 0 2 0 1 0 0 5 0 8 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 Using these, we can perform sentence comparisons by simple calculation (best done with a script): −−−−−−−−−−→ −−−−−−−−−−−−−→ hdogs chase cats | dogs pursue kittensi = + * X X −−−→ → pursue −−→ − → →|− → →|− chase −−→ − = Cijk hdogs | → ni i− sj h− n Cijk hdogs | → ni i− sj h− n k catsi k kittensi ijk ijk X −−→ → − → − →|− → −−−−→ chase pursue −−→ − = Cijk Cijk hdogs | → ni ihdogs | − ni ihn k catsihnk | kittensi ijk 4 The raw number obtained from the above calculation is 14844. Normalising it by the product of the length of both sentence vectors gives us the cosine measure value of 0.979. Let’s now contrast this −−−−−−−−−−→ −−−−−−−−−−→ with hdogs chase cats | cats chase dogsi, which is lexically speaking a fairly similar pair of sentences that nonetheless diverge in meaning. The raw number calculated from this inner product is 7341, its normalised cosine measure is 0.656, which demonstrates the sharp drop in similarity obtained from changing sentence structure. We expect some similarity since there is some non-trivial overlap between the properties identifying cats and those identifying dogs (namely those salient to the act of chasing). −−−−−−−−−−→ −−−−−−−−−−−→ The final example for transitive sentences is hdogs chase cats | bankers sell stocki, as two sentences that diverge in meaning completely. The raw number is 6024, its cosine measure is 0.042, demonstrating very low semantic similarity between these two sentences. Indeed, while dogs often chase cats, and bankers typically sell stock, the properties related by these verbs and present in these individuals have very low overlap, hence the formalism can establish the lack of semantic links between the two simple transitive sentences. To summarize, our example vectors provide us with the following similarity measures: Sentence 1 dogs chase cats dogs chase cats dogs chase cats Sentence 2 dogs pursue kittens cats chase dogs bankers sell stock Degree of similarity 0.979 0.656 0.042 Comparing meanings of grammatically different sentences For reasons of space, we have only presented the treatment of sentences with transitive verbs here. For sentences with intransitive verbs, the sentence space suffices to be just N . To compare the meaning of a transitive sentence with anPintransitive one, we embed the meaning of the latter from N into the former → → → → ni , i.e. the superposition of all basis vectors of N . Now since h− εn | − ni i = 1 N ⊗N , by taking − εn to be i − → → →) we have set C for any intransitive verb to for any basis vector ni , we obtain that for − sj = (− ni , − n k ijk P −−−−−−−−→ → m hsubj-of (v ) | − n i, computed as follows: vc ∈C c X c i X −−−−−−−−→ → −−−−−−−→ − −−−−−−−−→ → − →i = →i mc hsubj-of (vc ) | − ni ihobj-of (vc ) | n mc hsubj-of (vc ) | − ni ih→ εn | − n k k vc ∈C vc ∈C We can now compare meaning of these two types of sentences, simply by taking the inner product of their meanings (despite their different ‘arities’) and then normalise it by vector length to obtain the cosine measure. For example: + * X X −−→ → − −−−−−−−−−−→ −−−−−−−→ −−→ → − −→ →|− 0 hdogs chase cats | dogs chasei = Cijk hdogs | − ni i→ sj h− n Cijk hdogs | − ni i→ sj k cats i ijk ijk X − − → − − → − → → → → | catsi 0 = Cijk Cijk hdogs | − ni ihdogs | − ni ih− n k ijk The raw number is 14092 and its normalised cosine measure is 0.961, indicating high similarity (but some difference) between a sentence with a transitive verb and one where the subject remains the same, but the verb is used intransitively. The simplest possible approach to generalize the above, would be based on the pregroup typing of the grammatical structure. According to this, one builds the sentence space of sentences with indirect objects by adding more tensor products of N . For instance, to compute the meaning of the sentence “Dogs chase cats in the garden.” we need a triple of basis, hence S has to be N ⊗ N ⊗ N ; similarly for 5 “Dogs chase cats in the garden at night.” we need a quadruple of basis, hence S = N ⊗ N ⊗ N ⊗ N . If there is an adverb present, one needs to add a tensor for that too, more than one adverb is roughly treated like the adjective case and does need more tensors. To be able to compare meanings of all these different types of sentences, we might want to take S to be an infinite spaces and embed the smaller spaces into it via N ,→ N ⊗ N , or generalizations thereof. The good news is that since this possibly large space is only used to form inner products to compare meanings of sentences, we actually do not need to build the whole space, hence avoiding complexity blow ups. We aim to investigate the use of other type-logics, such as CCG to see whether they would provide us with a different generalization. Related Work In [6] a multiplicative model for vector composition is introduced and evaluated. The abstract model of [1] on which this work is based, is a general framework which can model any kind of vector combination. The particular concrete construction of this paper differs from that of [6] in three ways: (1) we can compute truth-theoretic as well as corpus-based meaning, (2) our meaning construction relies on and is guided by the grammatical structure of the sentence, (3) we work with structured vector spaces, which go beyond only looking for co-occurrence of words. The approach of [4] is more in the spirit of ours, in that extra information about syntax is used to compose meaning. Similar to us, they use a structured vector space to “integrate lexical information with selectional preferences”. For each word, the extra information is modelled as a set of vectors, each representing the expectations that the word supports, examples are subject and object of a verb. Indeed, a more tedious comparison/unification shall constitute future work, but it seems that our approach is still more general. Apart from the truth-theoretic meaning, in our corpus-based model we do not have to pre-specify the expectations and limit each word to a fixed set of them. We consider all possible such expectations as bases of our vector space and let the corpus decide which words have a higher weight for which expectations. We aim to use similar evaluation methods to the above papers to investigate applications of our setting. In addition, we propose more extensive evaluation of this formalism using a corpus of paraphrase pairs, and using inter-annotator agreement as the golden standard, as discussed in [3]. References [1] S. Clark B.Coecke, M. Sadrzadeh. Mathematical Foundations for a Compositional Distributional Model of Meaning, volume 36. Linguistic Analysis (Lambek Festschrift), 2010. http://arxiv.org/abs/1003.4394. [2] S. Clark and S. Pulman. Combining symbolic and distributional models of meaning. In Proceedings of AAAI Spring Symposium on Quantum Interaction. AAAI Press, 2007. [3] T. Cohn, C. Callison-Burch, and M. Lapata. Constructing corpora for development and evaluation of paraphrase systems. Computational Lingustics, 34(4):597–614, 2008. [4] K. Erk and S. Padó. A structured vector space model for word meaning in context. In EMNLP’08, Conference on Empirical Methods in Natural Language Processing, pages 897–906. ACL, 2008. [5] G. Grefenstette. Use of syntactic context to produce term association lists for text retrieval. In Nicholas J. Belkin, Peter Ingwersen, and Annelise Mark Pejtersen, editors, SIGIR, pages 89–97. ACM, 1992. [6] J. Mitchel and M. Lapata. Vector-based models of semantic composition. In Association for Computational Linguistics, pages 236–244, 2008. [7] P. Smolensky and G. Legendre. The Harmonic Mind: From Neural Computation to Optimality-Theoretic Grammar Vol. I: Cognitive Architecture Vol. II: Linguistic and Philosophical Implications. MIT Press, 2005. 6