Download Semantic memory for syntactic disambiguation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ancient Greek grammar wikipedia , lookup

Kannada grammar wikipedia , lookup

Focus (linguistics) wikipedia , lookup

Untranslatability wikipedia , lookup

Navajo grammar wikipedia , lookup

Agglutination wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Integrational theory of language wikipedia , lookup

Transformational grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Polish grammar wikipedia , lookup

Semantic holism wikipedia , lookup

Japanese grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

English clause syntax wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Semantic memory wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Continuous and progressive aspects wikipedia , lookup

Portuguese grammar wikipedia , lookup

Chinese grammar wikipedia , lookup

Equative wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Ambiguity wikipedia , lookup

Malay grammar wikipedia , lookup

Turkish grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Pleonasm wikipedia , lookup

Junction Grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Parsing wikipedia , lookup

Cognitive semantics wikipedia , lookup

Pipil grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Transcript
Semantic memory for syntactic disambiguation
Deryle Lonsdale ([email protected])
Department of Linguistics & English Language, 4039 JFSB
Provo, UT 84602 USA
Jeremiah McGhee ([email protected])
Department of Linguistics & English Language, 4039 JFSB
Provo, UT 84602 USA
Nathan Glenn ([email protected])
Department of Linguistics & English Language, 4039 JFSB
Provo, UT 84602 USA
Seth Wood ([email protected])
Department of Linguistics & English Language, 4039 JFSB
Provo, UT 84602 USA
Tory Anderson ([email protected])
Department of Linguistics & English Language, 4039 JFSB
Provo, UT 84602 USA
Abstract
(a) My favorite cousin was singing.
In this paper we address a type of verb-phrase ambiguity that
has proven problematic for syntactic parsers. We discuss the
ambiguity from a linguistic theory perspective and explain how
we carried out a corpus analysis task using widely consulted
databases to find examples and quantify the extent of the problem. We then sketch how we were able to encode the associated knowledge in the agent’s semantic memory to provide
a resolution when examples like those from the corpus were
encountered. When such prior experience cannot be brought
to bear, we mention how the system uses a semantic similarity
measure to resolve the ambiguity, and then stores the result in
semantic memory for future access. We conclude with some
results, remarks about performance, and future related work.
Keywords: syntactic ambiguity, corpus databases, semantic
memory, incremental parsing
(b) My favorite activity was singing.
Figure 1: Two sentences illustrating the perfective (a) vs.
copula+gerund (b) ambiguity.
main verb “singing” has progressive aspect. In the second
sentence, “was” serves as a form of the main verb “be”, often
called a copular or linking verb. In these cases there is little trouble distinguishing between the function of these verbs
with an appeal to semantics, but more difficult cases exist.
Some English syntactic ambiguity problems have been
well studied in the literature. For example, the issue of
prepositional phrase (PP) attachment ambiguity is pervasive
in English and has been treated from several perspectives; see
(Hindle & Rooth, 1993; Merlo & Ferrer, 2006) for comprehensive discussions of processing techniques and strategies.
In earlier work (Lonsdale & Manookin, 2004) we showed
how our cognitive modeling agent resolved PP attachment decisions by drawing on another language modeling approach,
analogical modeling, hereafter AM (Skousen, 1989). When
a decision about the proper attachment site was requred, the
system queried the separate AM system which in turn consulted a pre-annotated corpus of PP attachment decisions and
returned the result. While the system performed very well, it
was somewhat unsatisfying that we had to extend processing
outside the cognitive agent to invoke an external knowledge
source to assure this functionality. We have recently shown
elsewhere (Lonsdale, McGhee, Anderson, & Glenn, 2011) a
more elegant solution to the PP attachment problem based on
enhancements to the system framework, which will also be
Introduction
Ambiguity arises when more than one interpretation is possible given some information. Human languages have the potential for ambiguity in copious amounts and at different levels. For example, in English the word “might” exhibits partof-speech ambiguity since it may function either as a noun
or a modal verb. The word “axes” is morphologically ambiguous since it may be reduced to its base form in various
ways: it might be the plural of the word “axis”, it may be
the plural of the noun “axe” or “ax”, or it may be the present
tense conjugation of the verb “ax” or “axe”. The noun “pike”
exhibits semantic (or lexical) ambiguity since it has two unrelated senses: a type of fish, or a type of medieval weapon.
In this paper we will be principally concerned with syntactic ambiguity, where a phrase or other syntactic structure’s
interaction with other structures in the sentence may lead to
meaning differences. In particular, we discuss the structure of
sentences like those shown in Figure 1. In the first sentence,
the word “was” serves as an auxiliary verb, signaling that the
378
S<(VP</ˆVB/<(VP<VBG))
S<(VP<(/ˆVB/ . (NP-PRD<NP<<,VBG)))
mentioned below. In this paper we discuss how we resolve
the progressive versus copula+gerund ambiguity in a similar
fashion.
Figure 4: Queries to retrieve the two sentence types: progressive (top) and copula+gerund (bottom).
Background
The work described in this paper extends the framework of
a natural language modeling system built upon the Soar cognitive architecture (Laird, 1984; Newell, 1990). The original
version of the system was designed to syntactically parse sentences in order to model human parsing difficulties based on
structural constraints (Lewis, 1993). The system was subsequently extended to handle semantic processing (Rytting
& Lonsdale, 2005), natural language generation (Rubinoff &
Lehman, 1994), and discourse processing (Green & Lehman,
2002) among several other task-related capabilities. In all of
this work a theory of syntax was assumed that reflected thinking around the late 1980’s and early 1990’s, and was built
upon the basic cognitive architecture available at the time.
Sentences are input into the agent one word at a time. A
lexical access operator is initiated for each word in turn; it
retrieves from several knowledge sources various kinds of
lexical, orthographic, morphological, syntactic, and semantic information for that word. Each lexical item is associated
with a zero-level node which is then projected to bar-level
and phrasal-level nodes. According to standard X-bar theory,
when features license combining words into phrases, corresponding syntactic constituents are built.
This paper reports on empirical work done on a newer version of the system that we are in the process of developing.
The system improvements enabling this work on our natural
language agent stem from recent developments in two areas:
One reason is mounting competition from new Japanese car plants .
The main reason for the production decline is shrinking output of
light crude .
Mr. Leinberger is managing partner of a real-estate advisory firm .
Conference committees are breeding grounds for mischief .
Figure 5: Sample non-progressive sentences collected from
the Penn Treebank.
node or the V(erb) node is some form the verb “be” (i.e. “is”,
“were”, “are”, “be”, etc.). In this paper we ignore potential ambiguity with passive constructions, which also use this
auxiliary but whose main verb is encoded as a past participle.
The corpus
• The syntax has been upgraded to reflect more recent assumptions on constituency. Whereas in the past we assumed a Principles and Parameters (or Government and
Binding) model for syntax, we now use a model based on
the Minimalist Program (Chomsky, 1995). We also use
WordNet (Fellbaum, 1998) as a resource for lexical knowledge; the current system now supports the most recent version (WordNet 3.0).
In order to ground this work in actual language use, we began by collecting data from available corpus and treebank
database resources. We resolved to first consult treebanks,
because of their extensive syntactic annotation. We thus used
the tgrep2 tool to retrieve relevant sentences from the Penn
Treebank (PTB) (Marcus, Santorini, Marcinkiewicz, & Taylor, 1999). It was readily apparent that examples of the progressive construction were numerous (over 4500 sentences);
see Figure 4 for the relevant tgrep2 queries.
On the other hand, the PTB’s yield of copula+gerund constructions was surprisingly meager, resulting in only a handful of instances such as those in Figure 51 . In fact, each of
these is arguably not even of the copula+gerund type since
the “-ing” word is in fact a present participle filling the role
of an attributive adjective to another noun.
Since copula+gerund examples were in such short supply
in PTB, we resolved to consult three other corpora to find and
extract sentences with this type of construction:
• The cognitive modeling system has also evolved in substantial ways (Laird, 2012), some tangential to this discussion. What is relevant is that the basic architecture now
supports semantic memory (Derbinsky & Laird, 2010),
which represents learned knowledge facts distinct from any
episodic context.
• The British National Corpus (Davies, 2004) was originally
created in the late 1980’s by Oxford University Press and
contains 100 million words of English. The BYU interface
to the British National Corpus allows for many types of
lexical investigation including word and phrase lookup and
limited wildcard matching.
The use of our new syntactic model has implications for the
progressive/gerund ambiguity discussed earlier. We assume
an explicit phrasal structure of the various auxiliary verbs for
English in the form of separate X-bar projections for progressive, perfective, and passive items (Adger, 2003; Lonsdale,
2006). Figures 2 and 3 show the respective parses for the progressive construction (which uses the progressive projection)
and the gerund construction (that uses a main verb’s projection). Note that in either case the verb in the Prog(ressive)
• The Corpus of Contemporary American English (COCA)
(Davies, 2009) is the largest freely-available online
database of English, and the only large and balanced corpus of American English. It contains more than 410 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts,
including 20 million words each year from 1990-2010.
1 These sentences have had some unnecessary material removed
for clarity.
379
Figure 2: Parse tree for the sentence “My favorite activity was singing.”
Figure 3: Parse tree for the sentence “My favorite cousin was singing.”
380
Rosemary’s hobbies are gardening, walking, ...
The charges are kidnapping, torture, ...
Examples of such activities are mountaineering, rock climbing, ...
In tundra and taiga the natural resources are fishing and hunting.
Included in our amusements are gardening, riding, shooting, ...
Common treatments have been counseling and psychotherapy.
Part of the issue is timing.
Bowling is bowling, and I know what it takes.
The central fact is that cloning is cloning is cloning.
His passion is fishing.
The bottleneck is training and finance.
Betting is gambling.
My talent is gambling.
Our theme was gardening ...
My passion was teaching.
His one good distraction was gardening.
His main business was ranching.
His favorite recreations were gardening and fishing.
• The Corpus of Historical American English (COHA)
(Davies, 2010) is the largest structured online database of
historical English. This corpus consists of 400 million
words in texts from 1810 to 2009.
Though these databases do not have full treebank-style
syntactic parses, their words are tagged for part of speech
(POS) and have interfaces that support querying for limited sequences of tags, words, and other similar information.
Finding sentences with progressive tenses was straightforward; there were tens of thousands of such instances. Again,
the copula+gerund examples were not as prevalent.
Part of this is due to POS tagging errors, which for progressives and gerunds are particularly common. For example, in
the COCA corpus a sequence of noun plus copula plus gerund
is coded as:
[n*] [vb*] *ing.[n*]
but it erroneously returns sequences like “heart is pounding”,
“storm is brewing”, “people are spending”, and so forth. This
demonstrates the difficulty of processing this type of construction and the limited state of the art in addressing this
syntactic ambiguity type.
In this effort we excluded sentences where other factors
complicated the syntactic analysis and disambiguation strategy; in particular, we leave for future work sentences where:
Figure 6: Example sentences of the copula+gerund type
mined from several corpora and annotated.
Executing disambiguation decisions
We next discuss how we used the corpus data along with
the Soar system enhancements to allow the natural language
agent to better handle the ambiguity type in question.
As mentioned above, the corpus collection process revealed a large discrepancy in the ratio of progressive sentences to copula+gerund ones. Hence from a global perspective a rational choice, and most productive baseline, would be
to assume that all such sentences would be of the progressive
type. However, since the parser is incremental, at the time it
receives the word “was” it doesn’t know which construction
will emerge as other words are added. Instead, the default
processing step is to assume that “was” is a main verb, since
sentences with “was” as a main verb usually outnumber sentences with progressive forms.
Now sometimes the next word, the -ing word, is unambiguous. If it’s an unambiguous noun, the system will link
in the noun as the object of the main verb and the sentence
terminates with a successful parse. This case is uninteresting
for the work reported here and will not be further mentioned.
Similarly, if the -ing word is unambiguously a verb the system recognizes the inconsistency with its temporary assumption, and “snips” the main verb node from the tree, repairing
the construction by replacing it with the progressive auxiliary
node. Then the -ing form can be incorporated into the tree
as its complement, and the sentence terminates successfully.
Again, since there is no ambiguity in the -ing word, the case
is not directly interesting to us for this paper.
What is relevant for us in this paper is when the -ing word
is ambiguous between a noun and a verb—for example the
words “fishing” or “golfing”. When the system parses sentences like “His hobby was fishing.”, it will parse correctly
using the default strategy since the assumption that “was” is a
• potential ambiguity exists between contracted auxiliaries
and the possessive:
The court’s reasoning that the two jurisdictions are
soverign...
• multi-word expressions and fixed phrases mask syntactic
structure:
Chances are homebuilding will require...
• the post-copula gerund forms the initial part of a nominal
compound:
On the floor were sleeping bags, knapsacks, ...
• the subject is a pronoun and hence determining reference
is difficult:
...some were counseling caution...
...others are shuffling...
• the mood is interrogative:
In which country was fencing originated?
Figure 6 shows some copula+gerund sentences which have
been matched and extracted from the corpora mentioned earlier. In total we gathered about 80 copula+gerund sentences
and, naturally, thousands of progressives.
In most cases resolution of the ambiguity is fairly straightforward for humans: most are clearly progressive or clearly
copula+gerund. However, some sentences illustrate that indeterminism does exist, especially at the sentence level. For
example, the sentence:
His business is advertising.
could be construed as involving either construction, depending on wider context.
381
main verb was appropriate. However, the system would parse
a sentence like “The president was golfing.” in exactly the
same way; the result would be an incorrect copula construction, whereas the correct construal would be a progressive
construction.
What was needed was a way for the agent to encapsulate
and access information that would help identify, when confronted with an incoming -ing word, when to proceed with
one parse versus the other.
In order to process copula+gerund constructions, though,
we need to rely on further information. First, we converted
all of the copula+gerund corpus examples into data that could
be loaded into semantic memory. This allows the agent to
query semantic memory when deciding which alternative to
pursue. In particular, given word1 (the subject) and word2
(the gerund), if a match is found, the copula+gerund construal
is preferred over the progressive one.
If no match is found, the agent tries for a partial match by
abstracting away from the two lexical items themselves and
querying instead based on their WordNet semantic classes.
Thus if semantic memory contains a match for “activity”
and “reading” (such as from the sentence “My favorite
activity was reading.”), but the agent is processing a sentence
like “My favorite pastime was reading.”, a partial match will
result since “activity” and “pastime” are close semantically.
Finally, if no exact or partial matches are found in semantic memory based on lexical or WordNet semantic class
features, the agent executes one more strategy. A similarity metric is generated for the subject/gerund word pair using the Java WordNet::Similarity implementation2. The program provides similarity measurements given by several algorithms, so based on our corpus collection, and with the aid
of the Eureqa software tool3 , we determined the best metric for our disambiguation procedure. The Resnik measurement (Resnik, 1995) correlated most highly with the data. In
particular, to assess its usefulness we used a training corpus
of 62 copula+gerund sentences and 86 progressive sentences.
The best threshold value was 1.73; if the Resnik similarity
values was greater than that, we mark the sentence as copula+gerund, else as a progressive. We tested the result on a
corpus of 54 progressives and 38 copula+gerunds, and it had
an error of 0.197. This function was then implemented in Java
to provide the agent a means of identifying copula+gerund instances not yet present in semantic memory. Then the result is
stored in semantic memory for future access when pertinent.
ambiguity we address in this paper. The system performed
very well, as expected, for this set; most parsing failures were
due to words missing in WordNet and hence not available to
the system for subsequent processing: “telecommuting” or
“face-painting” as verbs, for example.
We also created a test corpus of 92 sentences, 52 with progressive forms and 40 with copula forms. These were extracted from WordNet sample sentences from the verb and
noun databases. Table 1 shows the results from this test. The
forcenoun strategy forces the corpus+gerund interpretation;
forceverb forces the progressive construal.
Performance of the system dropped slightly on this unseen
data: missing vocabulary and ambiguity elsewhere in the sentence contributed to failed parses. Using the Resnik metric for
attachment decisions improved performance in some cases:
for example, there were several tautological sentences like
“Bowling is bowling” “Cloning is cloning.”, and “Beekeeping
is beekeeping” where semantic similarity (namely identity)
was detected by the measure, resulting in correctly licensing
the copula. Querying directly against semantic memory did
not noticeably improve results since coverage was sparse and
there was little overlap between the training data and testing
data. With more extensive training data we expect this to improve. This was also the case for the hypernyms in semantic
memory.
Strategy
Forcenoun
Forceverb
Resnik
Random
Smem
Wrong
49
36
36
37
35
Failed
4
4
4
4
4
Table 1: System results on 92-sentence test set.
Finally, several of the test sentences had pronominal subjects, which contribute little semantic information and hence
are not conducive to the semantic relatedness tests we have
so far developed. This matter will require further study.
Conclusion and future work
In this paper we have sketched an approach for disambiguating a problematic syntactic construction. We first extracted
examples from sizable corpora that we have subsequently annotated. We then converted the instances to knowledge that
was uploaded into our agent’s semantic memory. Results
were modest, but further work will involve developing more
extensive knowledge sources via other corpus resources.
Though we have concentrated only on the progressive vs.
copula+gerund ambiguity, related difficulties abound, especially when semantics is also implicated, as it is in our agent.
We have mentioned earlier several phenomena that we have
left beyond the scope of this paper which nevertheless must
be addressed to assure robust treatment.
There are still other closely related constructions. Consider, for example, the following sentence:
Evaluation and results
For our first evaluation of the system we created a training set
of some 430 unique sentences from the corpora listed above;
372 were progressive forms, and 61 were copula forms. We
ran the system on a simplified version of these sentences, removing adjuncts and other material not crucial to the determination of the attachment site in incremental parsing for the
2 Downloaded
3 Available
Correct
39
52
52
51
53
from http://www.cogs.susx.ac.uk/users/drh21/.
for download at: http://ccsl.mae.cornell.edu/eureqa.
382
Twenty minutes after they’ve arrived, chicken is frying,
lamb ragout is simmering, ...”
ceedings of the Sixth International Conference on Cognitive Modeling (ICCM 2004) (p. 160-165). Lawrence
Erlbaum Associates.
Lonsdale, D., McGhee, J., Anderson, T., & Glenn, N. (2011).
Resolving a syntactic ambiguity type with semantic
memory. In Proceedings of the 20th Behavior Representation in Modeling & Simulation (BRIMS) Conference (p. 288-289).
Marcus, M., Santorini, B., Marcinkiewicz, M., & Taylor, A.
(1999). Treebank-3.
Merlo, P., & Ferrer, E. E. (2006). The notion of argument
in PP attachment. Computational Linguistics, 32(3),
1–35.
Newell, A. (1990). Unified theories of cognition. Cambridge,
MA: Harvard University Press.
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the
14th International Joint Conference on Artificial Intelligence (p. 448-453).
Rubinoff, R., & Lehman, J. F. (1994). Real-time natural
language generation in NL-Soar. In Proceedings of the
Seventh International Workshop on Natural Language
Generation.
Rytting, C. A., & Lonsdale, D. (2005). An operator-based
account of semantic processing. In V. P. Alessandro Lenci Simonetta Montemagni (Ed.), Acquisition
and representation of word meaning: theoretical and
computational perspectives (Vol. XXII-XXIII, p. 117137). Pisa/Rome: Istituti editoriali e poligrafici internazionali.
Skousen, R. (1989). Analogical modeling of language. Dordrecht: Kluwer.
In this sentence “chicken” is syntactically the subject of a
progressive construction. However, it is the object of the verb;
this is often called an unaccusative or middle verb construction. A syntax/semantics mismatch like this must be detected
if the sentence is to be understood correctly. We intend to use
the techniques sketched in this paper, namely semantic memory and semantic distance, to tease apart unaccusatives from
more standard (i.e. unergative) predicates.
Acknowledgments
We would like to thank Erika Hunt and Jason Housley for
linguistic and corpus support.
References
Adger, D. (2003). Core Syntax: A Minimalist Approach.
Oxford: Oxford University Press.
Chomsky, N. (1995). The Minimalist Program. Cambridge,
MA: MIT Press.
Davies, M. (2004). BYU-BNC: The British National Corpus.
(Available online at http://corpus.byu.edu/bnc)
Davies, M. (2009). The 385+ Million Word Corpus of Contemporary American English (1990-2008+): Design,
Architecture, and Linguistic Insights. International
Journal of Corpus Linguistics, 14, 159-90.
Davies, M. (2010). The Corpus of Historical American English (COHA): 400+ million words, 1810-2009.
(Available online at http://corpus.byu.edu/coha)
Derbinsky, N., & Laird, J. E. (2010). Extending Soar with
Dissociated Symbolic Memories. (Symposium on Human Memory for Artificial Agents, AISB (2010))
Fellbaum, C. (1998). WordNet: An electronic lexical
database. Cambridge, MA: MIT Press.
Green, N., & Lehman, J. F. (2002). An integrated discourse
recipe-based model for task-oriented dialogue. Discourse Processes, 33(2), 133-158.
Hindle, D., & Rooth, M. (1993). Structural ambiguity and
lexical relations. Computational Linguistics, 19(1),
103-120.
Laird, J. E. (1984). Universal subgoaling. Unpublished doctoral dissertation, Carnegie Mellon University. (Available as Carnegie Mellon University Computer Science
Technical Report 83-138.)
Laird, J. E. (2012). The Soar Cognitive Architecture. MIT
Press.
Lewis, R. (1993). An architecturally-based theory of human
sentence comprehension. Unpublished doctoral dissertation, Carnegie Mellon.
Lonsdale, D. (2006). Learning in minimalism-based language modeling. In Proceedings of the 28th Annual
Meeting of the Cognitive Science Society (p. 2651).
Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Lonsdale, D., & Manookin, M. (2004). Combining learning approaches for incremental on-line parsing. In Pro-
383