Download Semantic memory for syntactic disambiguation

Semantic memory for syntactic disambiguation Deryle Lonsdale ([email protected]) Department of Linguistics & English Language, 4039 JFSB Provo, UT 84602 USA Jeremiah McGhee ([email protected]) Department of Linguistics & English Language, 4039 JFSB Provo, UT 84602 USA Nathan Glenn ([email protected]) Department of Linguistics & English Language, 4039 JFSB Provo, UT 84602 USA Seth Wood ([email protected]) Department of Linguistics & English Language, 4039 JFSB Provo, UT 84602 USA Tory Anderson ([email protected]) Department of Linguistics & English Language, 4039 JFSB Provo, UT 84602 USA Abstract (a) My favorite cousin was singing. In this paper we address a type of verb-phrase ambiguity that has proven problematic for syntactic parsers. We discuss the ambiguity from a linguistic theory perspective and explain how we carried out a corpus analysis task using widely consulted databases to find examples and quantify the extent of the problem. We then sketch how we were able to encode the associated knowledge in the agent’s semantic memory to provide a resolution when examples like those from the corpus were encountered. When such prior experience cannot be brought to bear, we mention how the system uses a semantic similarity measure to resolve the ambiguity, and then stores the result in semantic memory for future access. We conclude with some results, remarks about performance, and future related work. Keywords: syntactic ambiguity, corpus databases, semantic memory, incremental parsing (b) My favorite activity was singing. Figure 1: Two sentences illustrating the perfective (a) vs. copula+gerund (b) ambiguity. main verb “singing” has progressive aspect. In the second sentence, “was” serves as a form of the main verb “be”, often called a copular or linking verb. In these cases there is little trouble distinguishing between the function of these verbs with an appeal to semantics, but more difficult cases exist. Some English syntactic ambiguity problems have been well studied in the literature. For example, the issue of prepositional phrase (PP) attachment ambiguity is pervasive in English and has been treated from several perspectives; see (Hindle & Rooth, 1993; Merlo & Ferrer, 2006) for comprehensive discussions of processing techniques and strategies. In earlier work (Lonsdale & Manookin, 2004) we showed how our cognitive modeling agent resolved PP attachment decisions by drawing on another language modeling approach, analogical modeling, hereafter AM (Skousen, 1989). When a decision about the proper attachment site was requred, the system queried the separate AM system which in turn consulted a pre-annotated corpus of PP attachment decisions and returned the result. While the system performed very well, it was somewhat unsatisfying that we had to extend processing outside the cognitive agent to invoke an external knowledge source to assure this functionality. We have recently shown elsewhere (Lonsdale, McGhee, Anderson, & Glenn, 2011) a more elegant solution to the PP attachment problem based on enhancements to the system framework, which will also be Introduction Ambiguity arises when more than one interpretation is possible given some information. Human languages have the potential for ambiguity in copious amounts and at different levels. For example, in English the word “might” exhibits partof-speech ambiguity since it may function either as a noun or a modal verb. The word “axes” is morphologically ambiguous since it may be reduced to its base form in various ways: it might be the plural of the word “axis”, it may be the plural of the noun “axe” or “ax”, or it may be the present tense conjugation of the verb “ax” or “axe”. The noun “pike” exhibits semantic (or lexical) ambiguity since it has two unrelated senses: a type of fish, or a type of medieval weapon. In this paper we will be principally concerned with syntactic ambiguity, where a phrase or other syntactic structure’s interaction with other structures in the sentence may lead to meaning differences. In particular, we discuss the structure of sentences like those shown in Figure 1. In the first sentence, the word “was” serves as an auxiliary verb, signaling that the 378 S<(VP</ˆVB/<(VP<VBG)) S<(VP<(/ˆVB/ . (NP-PRD<NP<<,VBG))) mentioned below. In this paper we discuss how we resolve the progressive versus copula+gerund ambiguity in a similar fashion. Figure 4: Queries to retrieve the two sentence types: progressive (top) and copula+gerund (bottom). Background The work described in this paper extends the framework of a natural language modeling system built upon the Soar cognitive architecture (Laird, 1984; Newell, 1990). The original version of the system was designed to syntactically parse sentences in order to model human parsing difficulties based on structural constraints (Lewis, 1993). The system was subsequently extended to handle semantic processing (Rytting & Lonsdale, 2005), natural language generation (Rubinoff & Lehman, 1994), and discourse processing (Green & Lehman, 2002) among several other task-related capabilities. In all of this work a theory of syntax was assumed that reflected thinking around the late 1980’s and early 1990’s, and was built upon the basic cognitive architecture available at the time. Sentences are input into the agent one word at a time. A lexical access operator is initiated for each word in turn; it retrieves from several knowledge sources various kinds of lexical, orthographic, morphological, syntactic, and semantic information for that word. Each lexical item is associated with a zero-level node which is then projected to bar-level and phrasal-level nodes. According to standard X-bar theory, when features license combining words into phrases, corresponding syntactic constituents are built. This paper reports on empirical work done on a newer version of the system that we are in the process of developing. The system improvements enabling this work on our natural language agent stem from recent developments in two areas: One reason is mounting competition from new Japanese car plants . The main reason for the production decline is shrinking output of light crude . Mr. Leinberger is managing partner of a real-estate advisory firm . Conference committees are breeding grounds for mischief . Figure 5: Sample non-progressive sentences collected from the Penn Treebank. node or the V(erb) node is some form the verb “be” (i.e. “is”, “were”, “are”, “be”, etc.). In this paper we ignore potential ambiguity with passive constructions, which also use this auxiliary but whose main verb is encoded as a past participle. The corpus • The syntax has been upgraded to reflect more recent assumptions on constituency. Whereas in the past we assumed a Principles and Parameters (or Government and Binding) model for syntax, we now use a model based on the Minimalist Program (Chomsky, 1995). We also use WordNet (Fellbaum, 1998) as a resource for lexical knowledge; the current system now supports the most recent version (WordNet 3.0). In order to ground this work in actual language use, we began by collecting data from available corpus and treebank database resources. We resolved to first consult treebanks, because of their extensive syntactic annotation. We thus used the tgrep2 tool to retrieve relevant sentences from the Penn Treebank (PTB) (Marcus, Santorini, Marcinkiewicz, & Taylor, 1999). It was readily apparent that examples of the progressive construction were numerous (over 4500 sentences); see Figure 4 for the relevant tgrep2 queries. On the other hand, the PTB’s yield of copula+gerund constructions was surprisingly meager, resulting in only a handful of instances such as those in Figure 51 . In fact, each of these is arguably not even of the copula+gerund type since the “-ing” word is in fact a present participle filling the role of an attributive adjective to another noun. Since copula+gerund examples were in such short supply in PTB, we resolved to consult three other corpora to find and extract sentences with this type of construction: • The cognitive modeling system has also evolved in substantial ways (Laird, 2012), some tangential to this discussion. What is relevant is that the basic architecture now supports semantic memory (Derbinsky & Laird, 2010), which represents learned knowledge facts distinct from any episodic context. • The British National Corpus (Davies, 2004) was originally created in the late 1980’s by Oxford University Press and contains 100 million words of English. The BYU interface to the British National Corpus allows for many types of lexical investigation including word and phrase lookup and limited wildcard matching. The use of our new syntactic model has implications for the progressive/gerund ambiguity discussed earlier. We assume an explicit phrasal structure of the various auxiliary verbs for English in the form of separate X-bar projections for progressive, perfective, and passive items (Adger, 2003; Lonsdale, 2006). Figures 2 and 3 show the respective parses for the progressive construction (which uses the progressive projection) and the gerund construction (that uses a main verb’s projection). Note that in either case the verb in the Prog(ressive) • The Corpus of Contemporary American English (COCA) (Davies, 2009) is the largest freely-available online database of English, and the only large and balanced corpus of American English. It contains more than 410 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts, including 20 million words each year from 1990-2010. 1 These sentences have had some unnecessary material removed for clarity. 379 Figure 2: Parse tree for the sentence “My favorite activity was singing.” Figure 3: Parse tree for the sentence “My favorite cousin was singing.” 380 Rosemary’s hobbies are gardening, walking, ... The charges are kidnapping, torture, ... Examples of such activities are mountaineering, rock climbing, ... In tundra and taiga the natural resources are fishing and hunting. Included in our amusements are gardening, riding, shooting, ... Common treatments have been counseling and psychotherapy. Part of the issue is timing. Bowling is bowling, and I know what it takes. The central fact is that cloning is cloning is cloning. His passion is fishing. The bottleneck is training and finance. Betting is gambling. My talent is gambling. Our theme was gardening ... My passion was teaching. His one good distraction was gardening. His main business was ranching. His favorite recreations were gardening and fishing. • The Corpus of Historical American English (COHA) (Davies, 2010) is the largest structured online database of historical English. This corpus consists of 400 million words in texts from 1810 to 2009. Though these databases do not have full treebank-style syntactic parses, their words are tagged for part of speech (POS) and have interfaces that support querying for limited sequences of tags, words, and other similar information. Finding sentences with progressive tenses was straightforward; there were tens of thousands of such instances. Again, the copula+gerund examples were not as prevalent. Part of this is due to POS tagging errors, which for progressives and gerunds are particularly common. For example, in the COCA corpus a sequence of noun plus copula plus gerund is coded as: [n*] [vb*] *ing.[n*] but it erroneously returns sequences like “heart is pounding”, “storm is brewing”, “people are spending”, and so forth. This demonstrates the difficulty of processing this type of construction and the limited state of the art in addressing this syntactic ambiguity type. In this effort we excluded sentences where other factors complicated the syntactic analysis and disambiguation strategy; in particular, we leave for future work sentences where: Figure 6: Example sentences of the copula+gerund type mined from several corpora and annotated. Executing disambiguation decisions We next discuss how we used the corpus data along with the Soar system enhancements to allow the natural language agent to better handle the ambiguity type in question. As mentioned above, the corpus collection process revealed a large discrepancy in the ratio of progressive sentences to copula+gerund ones. Hence from a global perspective a rational choice, and most productive baseline, would be to assume that all such sentences would be of the progressive type. However, since the parser is incremental, at the time it receives the word “was” it doesn’t know which construction will emerge as other words are added. Instead, the default processing step is to assume that “was” is a main verb, since sentences with “was” as a main verb usually outnumber sentences with progressive forms. Now sometimes the next word, the -ing word, is unambiguous. If it’s an unambiguous noun, the system will link in the noun as the object of the main verb and the sentence terminates with a successful parse. This case is uninteresting for the work reported here and will not be further mentioned. Similarly, if the -ing word is unambiguously a verb the system recognizes the inconsistency with its temporary assumption, and “snips” the main verb node from the tree, repairing the construction by replacing it with the progressive auxiliary node. Then the -ing form can be incorporated into the tree as its complement, and the sentence terminates successfully. Again, since there is no ambiguity in the -ing word, the case is not directly interesting to us for this paper. What is relevant for us in this paper is when the -ing word is ambiguous between a noun and a verb—for example the words “fishing” or “golfing”. When the system parses sentences like “His hobby was fishing.”, it will parse correctly using the default strategy since the assumption that “was” is a • potential ambiguity exists between contracted auxiliaries and the possessive: The court’s reasoning that the two jurisdictions are soverign... • multi-word expressions and fixed phrases mask syntactic structure: Chances are homebuilding will require... • the post-copula gerund forms the initial part of a nominal compound: On the floor were sleeping bags, knapsacks, ... • the subject is a pronoun and hence determining reference is difficult: ...some were counseling caution... ...others are shuffling... • the mood is interrogative: In which country was fencing originated? Figure 6 shows some copula+gerund sentences which have been matched and extracted from the corpora mentioned earlier. In total we gathered about 80 copula+gerund sentences and, naturally, thousands of progressives. In most cases resolution of the ambiguity is fairly straightforward for humans: most are clearly progressive or clearly copula+gerund. However, some sentences illustrate that indeterminism does exist, especially at the sentence level. For example, the sentence: His business is advertising. could be construed as involving either construction, depending on wider context. 381 main verb was appropriate. However, the system would parse a sentence like “The president was golfing.” in exactly the same way; the result would be an incorrect copula construction, whereas the correct construal would be a progressive construction. What was needed was a way for the agent to encapsulate and access information that would help identify, when confronted with an incoming -ing word, when to proceed with one parse versus the other. In order to process copula+gerund constructions, though, we need to rely on further information. First, we converted all of the copula+gerund corpus examples into data that could be loaded into semantic memory. This allows the agent to query semantic memory when deciding which alternative to pursue. In particular, given word1 (the subject) and word2 (the gerund), if a match is found, the copula+gerund construal is preferred over the progressive one. If no match is found, the agent tries for a partial match by abstracting away from the two lexical items themselves and querying instead based on their WordNet semantic classes. Thus if semantic memory contains a match for “activity” and “reading” (such as from the sentence “My favorite activity was reading.”), but the agent is processing a sentence like “My favorite pastime was reading.”, a partial match will result since “activity” and “pastime” are close semantically. Finally, if no exact or partial matches are found in semantic memory based on lexical or WordNet semantic class features, the agent executes one more strategy. A similarity metric is generated for the subject/gerund word pair using the Java WordNet::Similarity implementation2. The program provides similarity measurements given by several algorithms, so based on our corpus collection, and with the aid of the Eureqa software tool3 , we determined the best metric for our disambiguation procedure. The Resnik measurement (Resnik, 1995) correlated most highly with the data. In particular, to assess its usefulness we used a training corpus of 62 copula+gerund sentences and 86 progressive sentences. The best threshold value was 1.73; if the Resnik similarity values was greater than that, we mark the sentence as copula+gerund, else as a progressive. We tested the result on a corpus of 54 progressives and 38 copula+gerunds, and it had an error of 0.197. This function was then implemented in Java to provide the agent a means of identifying copula+gerund instances not yet present in semantic memory. Then the result is stored in semantic memory for future access when pertinent. ambiguity we address in this paper. The system performed very well, as expected, for this set; most parsing failures were due to words missing in WordNet and hence not available to the system for subsequent processing: “telecommuting” or “face-painting” as verbs, for example. We also created a test corpus of 92 sentences, 52 with progressive forms and 40 with copula forms. These were extracted from WordNet sample sentences from the verb and noun databases. Table 1 shows the results from this test. The forcenoun strategy forces the corpus+gerund interpretation; forceverb forces the progressive construal. Performance of the system dropped slightly on this unseen data: missing vocabulary and ambiguity elsewhere in the sentence contributed to failed parses. Using the Resnik metric for attachment decisions improved performance in some cases: for example, there were several tautological sentences like “Bowling is bowling” “Cloning is cloning.”, and “Beekeeping is beekeeping” where semantic similarity (namely identity) was detected by the measure, resulting in correctly licensing the copula. Querying directly against semantic memory did not noticeably improve results since coverage was sparse and there was little overlap between the training data and testing data. With more extensive training data we expect this to improve. This was also the case for the hypernyms in semantic memory. Strategy Forcenoun Forceverb Resnik Random Smem Wrong 49 36 36 37 35 Failed 4 4 4 4 4 Table 1: System results on 92-sentence test set. Finally, several of the test sentences had pronominal subjects, which contribute little semantic information and hence are not conducive to the semantic relatedness tests we have so far developed. This matter will require further study. Conclusion and future work In this paper we have sketched an approach for disambiguating a problematic syntactic construction. We first extracted examples from sizable corpora that we have subsequently annotated. We then converted the instances to knowledge that was uploaded into our agent’s semantic memory. Results were modest, but further work will involve developing more extensive knowledge sources via other corpus resources. Though we have concentrated only on the progressive vs. copula+gerund ambiguity, related difficulties abound, especially when semantics is also implicated, as it is in our agent. We have mentioned earlier several phenomena that we have left beyond the scope of this paper which nevertheless must be addressed to assure robust treatment. There are still other closely related constructions. Consider, for example, the following sentence: Evaluation and results For our first evaluation of the system we created a training set of some 430 unique sentences from the corpora listed above; 372 were progressive forms, and 61 were copula forms. We ran the system on a simplified version of these sentences, removing adjuncts and other material not crucial to the determination of the attachment site in incremental parsing for the 2 Downloaded 3 Available Correct 39 52 52 51 53 from http://www.cogs.susx.ac.uk/users/drh21/. for download at: http://ccsl.mae.cornell.edu/eureqa. 382 Twenty minutes after they’ve arrived, chicken is frying, lamb ragout is simmering, ...” ceedings of the Sixth International Conference on Cognitive Modeling (ICCM 2004) (p. 160-165). Lawrence Erlbaum Associates. Lonsdale, D., McGhee, J., Anderson, T., & Glenn, N. (2011). Resolving a syntactic ambiguity type with semantic memory. In Proceedings of the 20th Behavior Representation in Modeling & Simulation (BRIMS) Conference (p. 288-289). Marcus, M., Santorini, B., Marcinkiewicz, M., & Taylor, A. (1999). Treebank-3. Merlo, P., & Ferrer, E. E. (2006). The notion of argument in PP attachment. Computational Linguistics, 32(3), 1–35. Newell, A. (1990). Unified theories of cognition. Cambridge, MA: Harvard University Press. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (p. 448-453). Rubinoff, R., & Lehman, J. F. (1994). Real-time natural language generation in NL-Soar. In Proceedings of the Seventh International Workshop on Natural Language Generation. Rytting, C. A., & Lonsdale, D. (2005). An operator-based account of semantic processing. In V. P. Alessandro Lenci Simonetta Montemagni (Ed.), Acquisition and representation of word meaning: theoretical and computational perspectives (Vol. XXII-XXIII, p. 117137). Pisa/Rome: Istituti editoriali e poligrafici internazionali. Skousen, R. (1989). Analogical modeling of language. Dordrecht: Kluwer. In this sentence “chicken” is syntactically the subject of a progressive construction. However, it is the object of the verb; this is often called an unaccusative or middle verb construction. A syntax/semantics mismatch like this must be detected if the sentence is to be understood correctly. We intend to use the techniques sketched in this paper, namely semantic memory and semantic distance, to tease apart unaccusatives from more standard (i.e. unergative) predicates. Acknowledgments We would like to thank Erika Hunt and Jason Housley for linguistic and corpus support. References Adger, D. (2003). Core Syntax: A Minimalist Approach. Oxford: Oxford University Press. Chomsky, N. (1995). The Minimalist Program. Cambridge, MA: MIT Press. Davies, M. (2004). BYU-BNC: The British National Corpus. (Available online at http://corpus.byu.edu/bnc) Davies, M. (2009). The 385+ Million Word Corpus of Contemporary American English (1990-2008+): Design, Architecture, and Linguistic Insights. International Journal of Corpus Linguistics, 14, 159-90. Davies, M. (2010). The Corpus of Historical American English (COHA): 400+ million words, 1810-2009. (Available online at http://corpus.byu.edu/coha) Derbinsky, N., & Laird, J. E. (2010). Extending Soar with Dissociated Symbolic Memories. (Symposium on Human Memory for Artificial Agents, AISB (2010)) Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press. Green, N., & Lehman, J. F. (2002). An integrated discourse recipe-based model for task-oriented dialogue. Discourse Processes, 33(2), 133-158. Hindle, D., & Rooth, M. (1993). Structural ambiguity and lexical relations. Computational Linguistics, 19(1), 103-120. Laird, J. E. (1984). Universal subgoaling. Unpublished doctoral dissertation, Carnegie Mellon University. (Available as Carnegie Mellon University Computer Science Technical Report 83-138.) Laird, J. E. (2012). The Soar Cognitive Architecture. MIT Press. Lewis, R. (1993). An architecturally-based theory of human sentence comprehension. Unpublished doctoral dissertation, Carnegie Mellon. Lonsdale, D. (2006). Learning in minimalism-based language modeling. In Proceedings of the 28th Annual Meeting of the Cognitive Science Society (p. 2651). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Lonsdale, D., & Manookin, M. (2004). Combining learning approaches for incremental on-line parsing. In Pro- 383

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Semantic memory for syntactic disambiguation