* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Anaphora Resolution for Question Answering
Dependency grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Antisymmetry wikipedia , lookup
Modern Hebrew grammar wikipedia , lookup
Modern Greek grammar wikipedia , lookup
Transformational grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Lexical semantics wikipedia , lookup
Chinese grammar wikipedia , lookup
Preposition and postposition wikipedia , lookup
Malay grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Italian grammar wikipedia , lookup
French grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
Spanish pronouns wikipedia , lookup
Zulu grammar wikipedia , lookup
Sloppy identity wikipedia , lookup
Arabic grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Turkish grammar wikipedia , lookup
Romanian nouns wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Pipil grammar wikipedia , lookup
Determiner phrase wikipedia , lookup
English grammar wikipedia , lookup
Anaphora Resolution for Question Answering by Luciano Castagnola Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2002 @ Massachusetts Institute of Technology 2002. All rights reserved. Author .............. r Department of Electrical Engineering and Computer Science May 24, 2002 Certified by... . ...... . . . . . . . . . . . . . . . . . .. . . . . . .. Boris Katz Principal Research Scientist Thesis Supervisor Accepted by . .. . . . . . . . . . . . . . . . .6:7. .... Arthur C. Smith Chairman, Department Committee on Graduate Students MSSACHUNSETS INSTITUTE OF TECHNOLOGY JUL 3 1 2002 LIBRARIES Anaphora Resolution for Question Answering by Luciano Castagnola Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2002, in partial fulfillment of the requirements for the degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science Abstract Anaphora is a major phenomenon of natural language, and anaphora resolution is one of the important problems in Natural Language Understanding. In order to analyze text for content, it is important to understand what pronouns (and other referring expressions) refer to. This is important in the context of Question Answering, where questions and information sources are analyzed for content in order to provide precise answers, unlike keyword searches. This thesis describes BRANQA, an anaphora resolution tool built for the purpose of improving the performance of Question Answering systems. It resolves pronoun references via the use of syntactic analysis and high precision heuristic rules. BRANQA serves as an infrastructure for the experimentation with different resolution strategies and will enable evaluation of the benefits of anaphora resolution for Question Answering. We evaluated BRANQA's performance and found it to be comparable to that of other systems in the literature. Thesis Supervisor: Boris Katz Title: Principal Research Scientist 2 Acknowledgments I am grateful to Sue Felshin and Greg Marton for valuable comments in the preparation of this document. I thank Ali Ibrahim and Greg for their help during the development of the system. I thank Boris Katz for his support and patience throughout these years. I thank Patrick Winston for his constant encouragement and extraordinary generosity. 3 Contents 1 Introduction 8 1.1 What is anaphora? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Resolving Pronouns for Question Answering . . . . . . . . . . . . . . 10 1.4 O utline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Anaphora Resolution 2.1 12 Overview of Pronominal Anaphora Resolution . . . . . . . . . . . . . 12 2.1.1 Two Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Preferences 14 2.1.4 Computational Strategies . . . . . . . . . . . . . . . . . . . . 15 2.2 Government and Binding Theory . . . . . . . . . . . . . . . . . . . . 15 2.3 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 RA P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 CogNIAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 System Architecture 3.1 3.2 Link Parser 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Link Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Constituent Structure 23 24 . . . . . . . . . . . . . . . . . . . . . . 25 Noun Phrase Categorization . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 27 Heads and Subject . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2.2 Valid References . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Coreference Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 Pleonastic pronoun detector . . . . . . . . . . . . . . . . . . . 30 3.4.2 Syntactic filter . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4.3 Resolution procedure . . . . . . . . . . . . . . . . . . . . . . . 31 4 Evaluation 35 4.1 MUC-7 Coreference Task Corpus 4.2 Test Procedure . . . . . . . . . . . . . . . . . . . . 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Effect on Question Answering . . . . . . . . . . . . . . . . . . . . . . 38 5 Future Work 5.1 5.2 6 40 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1.1 Quoted Speech . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1.2 Named Entity Module . . . . . . . . . . . . . . . . . . . . . . 40 5.1.3 Syntactic Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Future research projects . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Statistics as a proxy for world knowledge . . . . . . . . . . . . 41 5.2.2 Alternative resolution procedures . . . . . . . . . . . . . . . . 42 5.2.3 Integration with other systems . . . . . . . . . . . . . . . . . . 42 Contributions 44 5 List of Figures 2-1 Binding Theory Examples . . . . . . . . . . . . . . . . . . . . . . . . 16 2-2 Examples of disjoint reference . . . . . . . . . . . . . . . . . . . . . . 19 2-3 RAP's pleonastic pronoun detector . . . . . . . . . . . . . . . . . . . 19 3-1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3-2 Link Parser Output Example . . . . . . . . . . . . . . . . . . . . . . 24 3-3 Problems assigning constituent structure to conjunctions . . . . . . . 25 3-4 Noun Phrase Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3-5 Resolution rules in BRANQA 33 . . . . . . . . . . . . . . . . . . . . . . 6 List of Tables 4.1 Test Results by Rule . . . . .. . . . . . . . . . . . . . . . . . . . . . 37 4.2 Test Results by Pronoun . . .. . . . . . . . . . . . . . . . . . . . . . 38 7 Chapter 1 Introduction In this chapter I present the motivation behind the development of BRANQA 1 , an anaphora resolution tool. 1.1 What is anaphora? Anaphora is reference to entities mentioned previously in the discourse. The referring expression is called an anaphorand the entity to which it refers, or binds, is its referent or antecedent. Anaphora resolution is the process of finding an anaphor's antecedent. Example: The car is falling apart, but it still works. Here "it" is the anaphor and "The car" is the antecedent. This is an example of pronominal anaphora, or anaphora where the anaphor is a pronoun. It is the most common type of anaphora, and will be the focus of this thesis. Other kinds of anaphora are definite noun phrase anaphora and one-anaphora: President George Bush signed (...) The president... If you don't like the coat, you can choose another one. In the first sentence "The president" is the anaphor, and "President George Bush" is the antecedent. In the second, "one" is the anaphor and "the coat" the antecedent. When the anaphor is in the same sentence as the antecedent, it is called an intrasententialanaphor; otherwise it is an intersententialanaphor. 'BRANQA: BRANQA Resolves ANaphors for Question Answering 8 1.2 Question Answering The InfoLab Group at MIT's Al Lab has developed systems that attempt to solve the problem of information access. The belief that natural language is the easiest way for humans to request information has led the group to work on question answering systems. The START (SynTactic Analysis using Reversible Transformations) (17, 18] system provides multimedia access using natural language. It has been available to answer questions on the World Wide Web 2 since December 1993. Since it came online, it has answered millions of questions for hundreds of thousands of people all over the world, providing users with knowledge regarding geography, presidents of the U.S., movies, and many other areas. The START System strives to deliver "just the right information" in response to a query. Unlike Web search engines, START does not reply with long lists of documents that might contain an answer to our question; it provides the actual answer we are looking for. This comes in the form of a short information segment (e.g., an English sentence, a graph, a picture), rather than an entire document. START has been very successful in its interaction with users, but its domain of knowledge is fairly limited and expanding its knowledge base requires human effort. It works extremely well within the domains it handles, but any question outside its knowledge base will get a reply from START saying it does not know how to answer it. In response to this problem, the InfoLab Group began to work on systems with less stringent requirements with respect to both returning correct answers and delivering "just the right information". These systems lie somewhere along the spectrum between information retrieval engines like Altavista or Google3 at the one end and natural language systems like START at the other. They are linguistically informed search engines, which attempt to use natural language tools to aid the retrieval of information in order to return a smaller amount of irrelevant information than traditional search engines. One of these systems, Sapere [22], indexes relations between words to allow it 2 3 http://www.ai.mit.edu/projects/infolab http://www.altavista.com, http://www.google.com 9 to search for information in a smart way. By storing relations like Subject-VerbObject, it can distinguish between cases that the simple "bag of words" approach would confuse. For example, in response to the question "When did France attack England?" Sapere will not return the sentences "England and France attacked China in 1857", "England attacked France", or "France was attacked by England", since the crucial relation France-attack-England is missing in all of them. The "bag of words" approach treats documents as sets of keyword counts, and would thus consider "England attacked France" to be equivalent to "France attacked England". 1.3 Resolving Pronouns for Question Answering Underlying the motivation for this project is a desire to improve the performance of the InfoLab Group's question answering systems. START, Sapere, and future group projects can benefit from the use of a pronominal anaphora resolution tool. As mentioned above, Sapere indexes relations as part of its linguistically informed information retrieval approach. The analysis, indexing and retrieving are all done at the sentence level and this makes the resolution of anaphors very important. Without resolving what a pronoun refers to in a sentence, relations involving that pronoun are not useful for retrieval. After reading "The first seven attempts to climb Mount Everest were unsuccessful. Edmund Hillary climbed it in 1953..." we cannot answer "Who climbed Mount Everest for the first time?" unless we find that "Mount Everest" is an antecedent for "it". Adding a pronominal anaphora resolution module to Sapere should increase the number of questions it can answer about a given corpus; resolving pronouns should increase its recall4 by raising the number of useful relations that are indexed. START also stands to benefit from the availability of an anaphora resolution module. One of the features of START that is not currently being used is the ability to keep track of threads of conversation with different users. Enabling this feature 4Recall is the ratio of correct answers found to correct answers in the corpus. Precision is the ratio of correct answers to answers given (correct and incorrect). 10 could allow more interesting interaction with the users, turning sessions into dialogues rather than series of disconnected question/answer pairs. In this mode of operation, pronominal anaphora resolution would become very important, since it would allow users to refer to entities introduced in previous sentences much more naturally. Currently START handles the simplest cases of pronominal anaphora; namely, if there is only one possible antecedent for a pronoun that passes the gender and number agreement test, START will resolve it, but otherwise it will ask the user to clarify. This is the most conservative approach to anaphora resolution (so long as gender and number of entities is identified correctly, no mistakes will be made), but this leads to unnatural conversation in many cases where the pronoun could be resolved with high confidence. We believe that a good pronominal anaphora resolution tool would lead to improved user interaction, an important goal of the START system. More traditional approaches to information retrieval also stand to gain from anaphora resolution [29]. It can even help systems based on the "bag of words" scheme, where pronouns should raise the counts of their antecedents. Thus, future group projects in this direction could also profit from this technology. This thesis presents the design and evaluation of BRANQA, a system motivated by the benefits that question answering could reap from anaphora resolution. 1.4 Outline The rest of the document is organized as follows: * Chapter 2 introduces the background work on which the system is based. * Chapter 3 describes the system's architecture and how it works. * Chapter 4 presents an evaluation of the system. * Chapter 5 lists improvements to be made in the near future together with research projects suggested by this work. " Chapter 6 summarizes the contributions made in developing this thesis. 11 Chapter 2 Anaphora Resolution In this chapter I present the ground on which this thesis rests. 2.1 Overview of Pronominal Anaphora Resolution When encountering a pronoun in the text, how can one tell what it refers to? The literature shows a wide variety of approaches to solving this problem. Miktov [27] provides an excellent overview of the state of the art in anaphora resolution, parts of which I summarize briefly in this section. 2.1.1 Two Stages The process of finding the expression to which a pronoun refers can be split into two tasks: finding a set of plausible referents, and picking a "best" element from the set. The first task is complicated by the many different types of reference that pronouns take part in. Pronouns can refer to noun phrases, verb phrases, clauses, sentences or even whole paragraphs. For example, in "Mary ran ten miles yesterday. She liked it very much", "she" refers to the noun phrase "Mary", and "it" refers to the verb phrase "ran ten miles yesterday". Pronouns can also lack referents. This is the case with pleonastic (alternatively non-referential or semantically empty) pronouns, as in "It is raining" or "It seems John is unhappy." 12 Additionally, the referent can be mentioned before or after the pronoun. If the referent is mentioned first, the usual situation, the referent is the antecedent of an anaphor. If the pronoun is seen first, the kind of reference is called cataphora and the pronoun is a cataphor. An example of a cataphoric relation would be "When he woke up, John was drenched in sweat." An ideal system for the resolution of pronouns would have to handle all of these kinds of reference, which would involve search over all possible referents (noun phrases, verb phrases, etc.) both before and after the pronoun. In practice, the scope of systems is usually reduced to the detection of pleonastic pronouns and the resolution of anaphors with noun phrase antecedents, both because of the complexity of handling the general case, and because these are the most common uses of pronouns. In a system with this focus, the first step in resolving a pronoun is to determine whether it is pleonastic, and if not, to identify all noun phrases occurring before it as possible antecedents. Once the set of potential antecedents is determined, a number of "resolution factors" are used to track down the correct antecedent. Factors used frequently in the resolution process include gender and number agreement, syntactic binding restrictions, semantic consistency, syntactic parallelism, semantic parallelism, salience, proximity and others. These factors can be divided into constraints (or eliminating factors), which must hold, and preferences, which are used to rank candidates. 2.1.2 Constraints Constraints control what an anaphor can refer to. They are conditions that always need to hold for reference to be valid, and can thus be used to remove implausible candidates from the list of possible antecedents. Examples of constraints are gender and number agreement. Anaphors and their antecedents must always agree in number and gender.' Some constraints are given by syntactic theories like Government and Binding Theory, which specifies binding 'Note: Collective nouns like "government" and "team" can be referred to by "they", and plural nouns like "data" can be referred to by "it". The definition of number is complicated in these cases. 13 restrictions (see Section 2.2). Other constraints are given by semantics. Although it is beyond current natural language technology to understand open domain texts, statistics can be used as a proxy for semantic knowledge. In the two examples below, the frequency of co- occurrence of words could be used to disambiguate the anaphors: Joe removed the diskette from the computer and disconnected iti. Joe removed the diskette from the computer and copied iti. Ge, Hale and Charniak [11] present a successful statistical approach to the resolution of pronouns consisting of a probabilistic model trained on a small subset of the Penn Treebank corpus. 2.1.3 Preferences Preferences, as opposed to constraints, are not obligatory conditions and therefore do not always hold. They are criteria that can be used to rank the possible antecedents. Among preferences, Mitkov lists syntactic parallelism, semantic parallelism and centering. Syntactic parallelism gives preference to noun phrases with the same syntactic function as the anaphor. For example: The programmer successfully combined Prologj with C, but he had combined it1 with Pascal last time. The programmeri successfully combined Prolog with Cj, but he had combined Pascal with it, last time. Similarly, semantic parallelism says that noun phrases which have the same semantic role as the anaphor are favoured. Vincent gave the diskette to Sodyi. Kim also gave him a letter. Vincenti gave the diskette to Sody. Kim got a letter from himi too. Syntactic and semantic criteria are not always sufficient to choose among a set of candidates. These criteria are usually used as filters to eliminate unsuitable candidates, and after that the most salient element among the remaining noun phrases is selected. This most salient element is referred to as the focus [31] or center [12]. Mitkov uses the following example to illustrate this concept: 14 Jenny put the cup on the plate and broke it. Here the meaning of "it" is ambiguous; its antecedent could be "the cup" or "the plate". However, context can help disambiguate the reference: Jenny went window shopping yesterday and spotted a nice cup. She wanted to buy it, but she had no money with her. The following day, she went to the shop and bought the coveted cup. However, once back home and in her kitchen, she put the cup on the plate and broke it. Now "the cup" is the most salient entity and is the center of attention throughout the paragraph; it is preferred over "the plate" as an antecedent for "it". This example illustrates the important role of tracking down the center/focus in anaphora resolution. After "filtering" unsuitable candidates, the final choice is made by determining which of the candidates seems to be the center. Various methods have been proposed for center/focus tracking [5, 9, 25, 32, 36]. 2.1.4 Computational Strategies The traditional approach to anaphora resolution is to eliminate unlikely candidates until a minimal set of plausible candidates is obtained, and then make use of preferences to choose a candidate. Other approaches compute the most likely candidate on the basis of statistical or "AI" techniques (Mitkov mentions uncertainty-reasoning methods as an example of these techniques). In these "alternative" systems the concept of constraint might disappear, and all resolution factors might be considered preferences whose weights get updated through "Al" techniques. Mitkov [26] compares a traditional and an "alternative" approach using the same set of anaphora resolution factors. 2.2 Government and Binding Theory Government and Binding Theory is a version of Chomsky's theory of universal grammar named after his Lectures on Government and Binding [7]. One of its components, Binding Theory, explains the behavior of intrasentential anaphora. The theory ex15 Johni hit himj/,i. Johni hit himselfi/,j. Luciej said [cp that [ip Lilij hurt herselfj/,i/,]]. Luciej said [cp that [jp Lilij hurt heri/k/*jIj. Poiroti believes [NP John's description of himselfj/*j]. Poiroti believes [NP any description of himselfi/,j]. ('*' denotes ungrammatical co-indexings) Figure 2-1: Binding Theory Examples plains when an anaphor can bind to a noun phrase based on their relative positions in syntactic structure. The details of the theory are complicated, but the important point for this thesis is that syntax alone can place hard constraints on anaphora, and this can be used to help us pick antecedents for anaphors by eliminating syntactically disallowed candidates. For an introductory treatment of Binding Theory see Haegeman [13]. Figure 2-1 shows examples of reference determined valid or invalid by Binding Theory on the basis of syntactic structure. 2.3 Prior Work BRANQA is largely based on two prior systems: Lappin and Leass' RAP [19], and Baldwin's CogNIAC [4]. This section presents some of the ideas taken from them. 2.3.1 RAP RAP (Resolution of Anaphora Procedure) is an algorithm for identifying the noun phrase antecedents of third person pronouns and lexical anaphors (reflexive and reciprocal pronouns). The algorithm applies to the syntactic representations generated by McCord's Slot Grammar parser [23], and relies on salience measures derived from syntactic structure and a simple dynamic model of attentional state. In a blind test on computer manual text containing 360 pronoun occurrences the system identified the correct antecedent for 86% of these pronoun occurrences. 16 RAP contains the following main components: " An intrasentential syntactic filter for ruling out anaphoric dependence of a pronoun on a noun phrase based on syntactic binding constraints. " A morphological filter for ruling out anaphoric dependence of a pronoun on a noun phrase due to non-agreement of person, number or gender features. * A procedure for identifying pleonastic (semantically empty) pronouns. * An anaphor binding algorithm for identifying the possible antecedent of a lexical anaphor within the same sentence. " A procedure for assigning values to several salience parameters (grammatical role, parallelism of grammatical roles, frequency of mention, proximity, and sentence recency) for a noun phrase. This procedure employs a grammatical role hierarchy according to which the evaluation rules assign higher salience weights to (i) subject over non-subject noun phrases, (ii) direct objects over indirect objects, (iii) arguments of a verb over adjuncts and objects of prepositional phrase adjuncts of the verb, and (iv) head nouns over complements of head nouns. * A procedure for identifying anaphorically linked noun phrases as an equivalence class for which a global salience value is computed as the sum of the salience values of its elements. " A decision procedure for selecting the preferred element of a set of antecedent candidates for a pronoun. BRANQA's syntactic filter and pleonastic pronoun detector were modeled after the ones in RAP, which I describe below. Intrasentential Syntactic Filter RAP's syntactic filter was developed for English Slot Grammar, a kind of dependencybased grammar [24]. Dependency syntax avoids the use of phrase structure or cat17 egories; instead it marks syntactic dependencies between the words of a sentence. These are represented by arcs with arrows: X-+Y. We say that Y depends on X, or that X governs Y. X is called the (syntactic) governor of Y and Y is called the (syntactic) dependent of X. The head of a phrase P is a component of P which governs all other components of P. An argument of X is a necessary dependent of X (e.g., the direct object for a transitive verb) and an adjunct of X is an optional dependent of X (e.g., an adjective modifying a noun). The filter consists of conditions for non-coreference of a noun phrase and a pronoun within the same sentence. The following terminology is used to state these conditions: " A phrase P is in the argument domain of a phrase N iff P and N are both arguments of the same head. " P is in the adjunct domain of N iff N is an argument of a head H, P is the object of a preposition PREP, and PREP is an adjunct of H. " P is in the NP domain of N iff N is the determiner of a noun argument of adjunct of Q, or (i) P is an (ii) P is the object of a preposition PREP and PREP is an Q. " A phrase P is contained in a phrase adjunct of Q and Q, Q iff (i) P is either an argument or an Q, or in Q. i.e., P is immediately contained in contained in some phrase R, and R is contained (ii) P is immediately Given these definitions, the syntactic filter says that a pronoun P is non-coreferential with a (non-reflexive or non-reciprocal) noun phrase N if any of the following hold: 1. P is in the argument domain of N. 2. P is in the adjunct domain of N. 3. P is an argument of a head H, N is not a pronoun, and N is contained in H. 4. P is in the NP domain of N. 5. P is a determiner of a noun Q, and N is contained in 18 Q. 1. She likes her3 . Johni seems to want to see him3 . 2. Shei sat near her 3 . 3. Hei believes that the man3 is amusing. 4. Johni's portrait of him is interesting. 5. Hisi portrait of John3 is interesting. Hisi description of the portrait by John3 is interesting. Figure 2-2: Examples of disjoint reference Figure 2-2 shows examples of disjoint reference signalled by these conditions. Pleonastic pronoun detector RAP attempts to identify non-referential uses of it to improve resolution performance. It defines a class of modal adjectives (ModalAdj) containing words like "necessary", "easy" and "advisable", together with their morphological negations, as well as comparative and superlative forms. It also defines a class of cognitive verbs (CogV) like "recommend", "think" and "believe". When it is present in one of the constructions in Figure 2-3 it is considered pleonastic. Syntactic variants of these constructions (It is not/may be ModalAdj..., Wouldn't it be ModalAdj..., etc) are also recognized. It is ModalAdj that S It is ModalAdj (for NP) to VP It is CogV-past-tense that S It seems/appears/means/follows (that) S NP makes/finds it ModalAdj (for NP) to VP It is time to VP It is thanks to NP that S Figure 2-3: RAP's pleonastic pronoun detector 19 2.3.2 CogNIAC CogNIAC is a pronoun resolution system giving more importance to precision than to recall. The system resolves a subset of anaphors that do not require general world knowledge or sophisticated linguistic processing for successful resolution. CogNIAC does this by being very sensitive to ambiguity, and only resolving pronouns when very high confidence rules have been satisfied. CogNIAC, like RAP, first eliminates candidate phrases that are not compatible with the anaphor's gender and number or that are ruled out on syntactic grounds (the syntactic constraints used are not mentioned in the paper). The system then evaluates a set of heuristic rules to choose an antecedent, or in the case that no rules are triggered, to leave it unresolved. The six core rules of CogNIAC are (in order of application): 1. Unique in Discourse: If there is a single possible antecedent i in the preceding portion of the entire discourse, then pick i as the antecedent. 2. Reflexive: Pick nearest possible antecedent in preceding portion of current sentence if the anaphor is a reflexive pronoun. 3. Unique in Current + Prior: If there is a single possible antecedent i in the prior sentence and the preceding portion of the current sentence, then pick i as the antecedent. 4. Possessive Pro: If the anaphor is a possessive pronoun and there is a single exact string match i of the possessive in the prior sentence, then pick i as the antecedent. 5. Unique Current Sentence: If there is a single possible antecedent in the preceding portion of the current sentence, then pick i as the antecedent. 6. Unique Subject/ Subject Pronoun: If the subject of the prior sentence contains a single possible antecedent i, and the anaphor is the subject of the current sentence, then pick i as the antecedent. 20 In the first experiment reported, CogNIAC was tested on 298 third person singular pronouns in narrative texts about two same-gender people (chosen to maximize ambiguity). It achieved a precision of 92% and recall of 64%. In a second experiment, CogNIAC was tested on the articles used in the MUC-6 coreference task [30]. The system underwent some changes in preparation for MUC-6, both because CogNIAC was now being used as part of a larger system, and because the domain of the MUC-6 documents was different from the narrative. Rule 4 was eliminated because it did not seem appropriate for the domain. Additions were made to process quoted speech in a limited fashion (the specific additions were not presented in the paper). A rule was added to search back for a unique antecedent through the text looking backwards at progressively larger portions of the text. A new pattern was added which selected the subject of the immediately surrounding clause. A pleonastic it detector was also implemented. After these changes, CogNIAC achieved 73% precision and 75% recall on fifteen MUC-6 documents containing 114 pronoun occurrences. 21 Chapter 3 System Architecture As mentioned in Chapter 2, the resolution of anaphoric expressions proceeds in two stages: the identification of a set of plausible antecedents, followed by the selection of the most likely candidate. The overall architecture of the system has two main components, each one dealing with one part of the problem. Link Parser Link Parser Interface Named Entity Module Noun Phrase Coreference Module Pleonastic Pron. Categorization Syntactic Filter ReslutionProcedure Noun Phrase Table Figure 3-1: Overall Architecture 22 The noun phrase categorization tool identifies noun phrases in the input and determines relevant properties of them (e.g., gender and number). The coreference module then uses these properties of noun phrases to select an antecedent for the anaphor. The two components interact with the external Link Parser through a wrapper that communicates with the parser and attempts to correct some of its deficiencies. One of the goals in mind while designing the system was to provide a testbed for research in anaphora resolution and its application to question answering. Thus, although the system concentrates on the resolution of pronominal anaphors, the infrastructure is there for experimentation with coreference in general. The resolution procedure does not depend on the parser output representation, nor does it depend directly on the linguistic resources used for noun phrase categorization. This modularity allows for easy experimentation with individual parts of the system, for example, evaluating different resolution strategies or different methods for noun phrase categorization. The system was written in Java, except for parts of the wrapper for the Link Parser which were written in C. The Link Parser is the only external dependency but the design philosophy allows for easy connection to other systems (e.g., a better Named Entity module). The following sections describe the system's components in more detail. 3.1 Link Parser After a sentence is submitted to the system, the system parses it. For this task it uses the Link Parser developed at Carnegie Mellon University.1 The Link Parser is written in generic C code, and runs on any platform with a C compiler. An application program interface (API) makes it easy to incorporate the parser into other applications. The parser has a dictionary of about 60,000 word forms. It has coverage of a lhttp://www.link.cs.cmu.edu 23 +-SFsi+---Paf--+--THi--+-Cet+-Ss-+---I---+--Os-+ I I I I I I I I it seemed.v likely.a that.c he would.v kiss.v Mary Figure 3-2: Link Parser Output Example wide variety of syntactic constructions, including many rare and idiomatic ones. The parser is robust; it is able to skip over portions of the sentence that it cannot understand, and assign some structure to the rest of the sentence. It is able to handle unknown vocabulary, and make intelligent guesses from context and spelling about the syntactic categories of unknown words. It has knowledge of capitalization, numerical expressions, and a variety of punctuation symbols. When several interpretations of a sentence are possible the parser allows access to all of them, sorted by a measure of how good the parse is (for example, when the parse is not complete, it includes in this measure the number of words that had to be skipped). 3.1.1 Link Grammar The parser is based on a formal grammatical system called a link grammar [33, 34]. A link grammar has no concept of constituents or categories (e.g., noun phrase, verb phrase). It contains a set of words (the terminal symbols of the grammar) each of which has a linking requirement. The parser connects the words with links so as to satisfy their linking requirements and the requirement of planarity (that links do not cross each other), which is a property that holds for most sentences of most natural languages [24]. The linking requirements for the words are specified in a dictionary. They determine what types of links can be used to connect a word to others. The link grammar for English contains more than 100 types of links, each of which specifies a different kind of relation between words. Some of these link types are very useful for the task of pronominal anaphora resolution. For example, an SF link connecting "it" to a verb indicates that this is a non-referential use of "it". Figure 3-2 shows a sample linkage. Its focus on dependency relations among words makes the Link Parser great for 24 (S (NP Former guests) (S (NP Former guests) (VP include (NP (NP (NP John) (VP include (NP (NP John) (NP Paul)) (NP Paul) (NP George) (NP (NP George) (NP Ringo) and (NP (NP Ringo and (NP Steve) (NP Steve))) (b) Correct parse (worst rank) (a) Top-ranked parse Figure 3-3: Problems assigning constituent structure to conjunctions the task of extracting relations, and the ongoing JLink project at the InfoLab Group is working on that problem. The extracted relations can then be used in our question answering systems (such as Sapere) as well as in new versions of BRANQA by building on the work of Dagan and Itai [8] (see Chapter 5). Thus, work using the Link Parser has the possibility of helping the InfoLab Group beyond the direct results of this thesis, especially through the identification of problems and possible solutions. 3.1.2 Constituent Structure Although link grammars have no concept of constituents, the Link Parser has (since version 4.0) a phrase-parser: a system which takes a linkage (the usual link grammar representation of a sentence, showing links connecting pairs of words) and derives a constituent or phrase-structure representation, showing conventional phrase categories such as noun phrase (NP), verb phrase (VP), prepositional phrase (PP), clause (S), and so on. This allows us to identify the noun phrases in the text, the first step towards resolution of pronominal anaphors. The interface to the Link Parser takes the NPs identified by the parser and attempts to fix some common problems in the parser's handling of conjunctions. Conjunctions with many disjuncts are almost always parsed incorrectly in the top-ranking 25 Jimi bought a new guitar. He broke it on stage. NP 1 2 3 4 5 sent. 1 1 2 2 2 text Jimi a new guitar He it stage head(s) Jimi guitar He it stage subject true false true false false he true false true false false she false false false false false it false true false true true they false false false false false ref. 1 2 1 2 5 Figure 3-4: Noun Phrase Table linkage returned by the parser. This happens because one of the components in the cost vector used to sort the linkages is the sum of link lengths, and a flatter structure will have longer links than one with more embedding of phrases. Thus, a sentence like "Former guests include John, Paul, George, Ringo and Steve" has the constituent structure in Figure 3-3(a) assigned to the top-ranking parse whereas the correct constituent structure is the one of the worst ranked linkage. The interface to the Link Parser identifies instances of lists of items, like the one in the example, and corrects the constituent structure within the conjunction. By using better knowledge of named entities, BRANQA is also able to correct some parsing errors when identifying noun phrases. The Link Parser fails to parse the second sentence in the example below, since "Son" is not recognized as a name. Mr. Son sang a song. (...) Son was happy. Our Named Entity recognition module identifies "Son" as a name after having seen "Mr. Son", enabling us to mark the noun phrase. 3.2 Noun Phrase Categorization The next step after identifying the noun phrases is to determine their values for relevant features to be used by the resolution engine. These properties include the principal noun or head (in the case of conjunctions, the heads of all disjuncts are listed), whether it is a subject or not, and whether each of "he", "she", "it" and "they" can refer to it. The noun phrase categorization module builds up a table of noun phrases that have been observed in the document. Noun phrases in the table 26 have a reference field used to mark anaphoric reference to a previously seen noun phrase. The coreference module is the one in charge of filling that column of the table. An example can be seen in Figure 3-4. 3.2.1 Heads and Subject Finding the head of a noun phrase is accomplished through examination of the link grammar representation of the sentence. The main noun of a simple noun phrase is the only word with a link crossing the boundaries of the phrase. In the case of conjunctions several words can link outside of the phrase; the heads of all noun phrases that comprise the conjunction are then listed. Occasionally this happens in noun phrases that are not conjunctions, but a small list of rules helps select the correct word in most of these cases. The value for the subject column is obtained directly from the link grammar parse. If the phrase is a subject, its head will be linked to a verb using one of the subject link types (S, SF, SX, SI, SFI or SXI). In passive constructions the surface subject will be linked to the verb through one of these link types, so care must be taken to not mark it as a subject. This is done by checking for a Pv link, which marks the use of a passive verb. 3.2.2 Valid References The most important task performed by the noun categorization module is determining which pronouns can refer to the noun phrases. The main resources used for this task are the Named Entity module, Wordnet [10], and a list of male and female common nouns. The nouns in Wordnet were split into three lists according to whether they always, sometimes, or never indicate a person. This was done by checking whether the word was a hyponym of the synset2 "person, individual, someone, somebody..." in all, some, or none of its senses. Many of the 2A synset is a set of synonyms; it is the basic element in the Wordnet hierarchy of meanings. Synset A is a hyponym of synset B if A "IS-A" or "IS-A-KIND-OF" B. 27 words that were in the sometimes-a-person list were then moved to either the nevera-person or always-a-person lists if they were assigned to that list because of senses which are very infrequent uses of the word. A similar procedure was used to generate a list of collective nouns like "team" by looking for hyponyms of "group, grouping". Proper names are handled by the Named Entity module. For the rest of the noun phrases, the module looks at the head of the noun phrase and checks for number and gender by using the aforementioned lists, Wordnet's list of irregular plurals and some pluralization rules. 3.3 Named Entity Recognition A very simple Named Entity recognition module was built to assist in the identification of noun phrases and the determination of valid references. The module knows about male and female first names, countries, and US states. It also uses heuristics to recognize unknown names. For example, it identifies as company names those sequences of capitalized words ending in an element of a set containing "Company", "Co", "Inc" and other words that indicate the entity is a company. Once it has seen the full name ending in one of these words, it will recognize subsequences of the words in the name as coreferent with the full name (e.g. after seeing "Lockheed Martin Corp." it will identify "Lockheed", a word it doesn't know, as a company). A similar treatment is given to names of people, which are identified if they contain known first names or personal titles like "Dr." , "Ms." or "Capt.". When a capitalized sequence of words is a subsequence of more than one previously identified named entity, the module will not mark it as coreferent with any of them, and will set its valid references to be the union of the valid references for the matching named entities. For example, after seeing "Janis Joplin", "Joplin" would be resolved to "Janis Joplin" and identified as a female name. If after that "Peter Joplin" is mentioned in the same article, future mentions of "Joplin" will be left unresolved, but they will be identified as persons. If the module had not seen any of the full 28 names, "Joplin" would not be marked as a person, as it could be referring to a company or a place. Since articles tend to use full names when a company or person is first mentioned, this strategy gives good performance without requiring a large list of company names or last names. However, a good list of companies would certainly help, especially in the case of household names, since these often show up without a "Co.", "Inc." or any other indication that it is a company. The scope of person names is limited to the document where they are found, i.e., the module forgets the names of people when the system starts working on a new document. Company names, on the other hand, are not forgotten; after seeing "Lockheed Martin Corp." in one article, "Lockheed" will be identified as a company in all subsequent articles. 3.4 Coreference Module Once noun phrases have been identified, it is the turn of the coreference module to find what they refer to. Noun phrases are resolved left to right, filling the reference column of the Noun Phrase Table. Possible values for this column are null, unresolved or a reference to a noun phrase in the table. When two noun phrases are identified as coreferent, each gets its set of valid references reduced to the intersection of the two sets. For example in "Kublai Khan, first Emperor of the Yuan Dynasty", the noun phrases before and after the comma will be marked coreferent. Initially the system doesn't know that "Kublai Khan" refers to a man, or even a person, so it will allow reference by "he", "she" and "it". But when coreference is found with "first Emperor of the Yuan Dynasty", "Kublai Khan" will get its set of valid references reduced to only "he", since BRANQA knows that "Emperor" refers to a male person. The coreference module is comprised of three components: a pleonastic pronoun detector, a syntactic filter, and the resolution procedure. The pleonastic pronoun detector identifies non-referential instances of it and the syntactic filter uses binding 29 constraints to eliminate syntactically disallowed candidates. The resolution procedure is the component that determines the coreference relations. The following subsections describe these components in more detail. 3.4.1 Pleonastic pronoun detector It is important to detect pleonastic instances of it (as the one starting this sentence), in order to avoid assigning referents to pronouns that are non-referential. The Link Parser detects some of these instances and uses the link type SF to signal them. However, there are several cases which the Link Parser does not notice. The pleonastic pronoun detector supplements the parser with a set of rules for detection. These are based on the rules in [19] presented in Chapter 2, together with some rules added to handle uncovered cases (e.g., "It's been a long time ... ") 3.4.2 Syntactic filter The syntactic filter is used to rule out reference to noun phrases on the basis of intrasentential binding constraints. Chapter 2 mentioned Government and Binding Theory, and the fact that it can be used to constrain the search for antecedents. However, it is not easy to obtain from a link grammar the syntactic structure that we need to make direct use of binding constraints. In theory, we should be able to construct the necessary categories from the link grammar representation (and the Link Parser already helps by providing some constituent structure), but in practice the mapping from a non-categorial grammar to one based on phrase-structure is not so easy. A better match to our system is the set of binding constraints for English Slot Grammar [23] presented in [19, 20] (explained here in Chapter 2). Slot Grammar belongs to the set of dependency grammars, and these are very similar to link grammars. Quoting Sleator [34]: "In a dependency grammar, a grammatical sentence is endowed with dependency structure, which is very similar to a linkage. This structure, as defined by Meikuk [24], consists of a set of planar directed arcs among the 30 words that form a tree. Each word (except the root word) has an arc to exactly one other word, and no arc may pass over the root word. In a linkage (as opposed to a dependency structure) the links are labeled, undirected, and may form cycles, and there is no notion of a root word." While we cannot directly apply the algorithms in [19, 20] to the linkages obtained from the Link Parser, it is possible to extract some of the information that would be present in English Slot Grammar by adding direction to the links. The current implementation does this for a few link types covering many important cases. Future work includes improving the syntactic filter to detect more cases of syntactically invalid reference. 3.4.3 Resolution procedure The resolution procedure is the core of the coreference module. It uses the Named Entity module, the Noun Phrase Table and the other two components of the coreference module to make decisions regarding coreference relations. It is not directly dependent on the representation of the parse trees, and this modularity allows experiments with the resolution strategies and the linguistic resources to be carried out independently of each other. The current resolution procedure concentrates on pronominal anaphora, but it also resolves some simple cases of coreference between noun phrases, namely coreference between named entities, coreference with appositional phrases, and coreference with modifiers of a named entity. Named Entities The Named Entity Module is used to mark coreference between named entities. When it identifies a noun phrase as matching one of the previously seen named entities, the coreference module marks the two expressions as coreferent in the Noun Phrase Table. 31 Appositional Phrases Appositional phrases are typically used to provide an alternative description or name for an entity. The module recognizes appositions by checking for noun phrases of the form: (NP <token>+ , (NP <token>+) ,). For example: (NP Luca Prodan, (NP the great singer, ...)) (NP the great singer, (NP Luca Prodan, ...)) Here the appositional phrases are marked coreferent with the first noun phrase. A common use of appositions which does not indicate coreference is in the names of places (e.g., "Cambridge, Massachusetts"). The system checks for this possibility using a list of countries and U.S. states, which manages to cover the most common cases in U.S. newspaper articles. In the near future I plan to improve coverage by using a larger list of names, including well known U.S. and foreign cities. Modifiers of Named Entities The case of modifiers of named entities is similar to that of appositions. For phrases of the form: (NP (NP <token>+) <named-entity>), the embedded modifier is marked as coreferent with the whole phrase. An example of this kind of construction is: (NP (NP famous singer) Luca Prodan) ... Pronominal Anaphors We finally get to the reason behind all previously described components, which is resolving pronominal anaphors. The other three cases of coreference marked are there to allow the resolution procedure to work correctly when resolving pronouns. The resolution strategy used belongs to the traditional approach to anaphora resolution, i.e., discounting unlikely candidates and then making use of heuristics to pick a referent from the remaining set of plausible candidates. The system eliminates from consideration all noun phrases that do not pass the morphological and syntactic filters. The morphological filter eliminates all phrases that do not agree in gender and number with the pronoun. This is done by checking 32 1. Unique in Discourse 2. Reflexive 3. Unique in Current + Prior 4. Unique in Current 5. Search Back (until > 1 candidates) 6. Unique Current Subject 7. Unique Prior Subject / / Subject Pron Subject Pron Figure 3-5: Resolution rules in BRANQA the valid reference columns in the Noun Phrase Table (e.g., themselves cannot refer to a noun phrase whose value in the they column is false). The syntactic filter further removes from consideration all those noun phrases that are ruled out by binding constraints. Here the system also makes use of coreference links marked between noun phrases to apply the constraints. For example: Peteri made fun of John Smith. John beat himi up. In the second sentence, binding constraints forbid him from referring to John. Since John will be marked coreferent with John Smith, this also eliminates John Smith from consideration, leaving a single possible antecedent, Peter. The system then uses heuristics to pick an antecedent from the remaining noun phrases. The heuristics used are taken from the CogNIAC system, described in Section 2.3.2. The core rules of the system are used, together with a rule to search back for a unique antecedent when no possible antecedents are found (Search Back) and a rule that looks for a unique antecedent in the subject of the current sentence (Unique Current Subj). Rule 4, Possessive Pro, was excluded since it was eliminated when preparing CogNIAC for MUC-6. The rules used by BRANQA are listed in order of evaluation in Figure 3-5. In evaluating the rules, when checking for a "single possible antecedent" we count possible entities, not possible expressions; that is, if there is more than one possible antecedent but all possible antecedents refer to the same entity, the rule is allowed to 33 trigger. This is one of the reasons why resolving coreference between non-pronominal noun phrases helps with pronoun resolution. Before trying to apply any of the rules, the pleonastic pronoun detector is used to check if this is an instance of non-referential it, in which case it is assigned null reference. The rules are then evaluated in order, and if none of them trigger, the pronoun is marked unresolved in the Noun Phrase Table. 34 Chapter 4 Evaluation In this chapter I evaluate the performance of BRANQA on a test corpus. I explain the experimental procedure and present the results. 4.1 MUC-7 Coreference Task Corpus Evaluation was performed on the newspaper articles used in the MUC-7 Coreference Task [15]. These articles have been annotated for coreference using SGML tags, allowing one to procedurally check the correctness of BRANQA's decisions. Coreference relations are tagged between markables: nouns, noun phrases and pronouns. Pronouns include both personal and demonstrative pronouns, and with respect to personal pronouns, all grammatical cases, including the possessive. Dates ("January 23"), currency expressions ("$1.2 billion"), and percentages ("17%") are considered noun phrases. Coreference relations are marked only between pairs of elements both of which are markables. This means that in those cases where the antecedent is a clause rather than a markable the relation will not be annotated. Referring expressions and their antecedents are marked as follows: <COREF ID="100">Lawson Mardon Group Ltd.</COREF> said <COREF ID="101" TYPE="IDENT" REF="100">it</COREF> ... All markables have a unique ID within the document, which is used to refer to 35 them through the REF attribute of COREF tags. Two sets of articles were available for testing, one specified "dry-run" and the other one "formal", used in different stages of the MUC-7 evaluation. The "dry-run" set was used for this evaluation, saving the other set for future experiments. This set consists of thirty New York Times articles, most of them regarding airplane crashes. 4.2 Test Procedure The calls to BRANQA's pronoun resolution procedure were instrumented so that it would send its answers through an evaluation module, which checked them against the key. In evaluating the system, errors were not chained; that is, answers were corrected, if possible, before proceeding to resolve the next pronoun. After resolving a pronoun, the evaluation procedure recorded the answer and checked it against the key. If it was wrong, it attempted to find a noun phrase in the Noun Phrase Table that would match the one in the key. This was not always possible for two reasons: sometimes the pronoun was not marked on the key because it had no markable antecedent, and sometimes parser errors caused BRANQA not to identify the marked noun phrase. In both cases the pronoun was marked unresolved in the Noun Phrase Table before going on to the next pronoun. If the pronoun was not marked for lack of a markable antecedent, the evaluation module considered an answer of unresolved as correct. 4.3 Results Table 4.1 shows BRANQA's precision and recall characteristics on 336 third person pronouns in the test corpus, broken down by rule. The resolution of a pronoun to null (for the pleonastic case) was considered correct if the pronoun had no antecedent in the key (which could happen either if the pronoun was actually pleonastic, or if it had an antecedent that was not markable). I checked the eight cases marked pleonastic in the test and they were all in fact pleonastic. 36 Rule Pleonastic Unique in Disc. Reflexive Unique Cur+Prior Unique Cur Search Back Subject Cur Subject Prev Unresolved (correct) Total Contribution to Recall 2% (8/336) 6% (20/336) 1% (4/336) 14% (47/336) 17% (58/336) 0% (0/336) 3% (10/336) 6% (21/336) 3% (10/336) 53% (177/336) Precision 100% (8/8) 100% (20/20) 80% (4/5) 85% (47/55) 95% (58/61) never used 83% (10/12) 84% (21/25) 100% (10/10) 91% (177/195) Table 4.1: Test Results by Rule The "Unresolved (correct)" line of the table shows the number of pronouns that were left unresolved but had no antecedent in the key, and were thus considered correct for the purpose of computing precision and recall statistics. Table 4.2 shows the results broken down by pronoun. The precision/recall characteristics of BRANQA are comparable to those of CogNIAC. In the first experiment on narrative texts CogNIAC achieved 92% precision for 64% recall, and in the second test, on MUC-6 documents, it yielded 73% precision for a recall of 75%. Especially in the second case, CogNIAC's recall is quite higher than that of BRANQA, but this came at a significant cost in precision. Of the 18 incorrect resolutions, six happened in cases where there was no antecedent marked on the key (this included pleonastic pronouns, but also cases where the antecedent was not markable, e.g., "they" refering to two people mentioned in separate sentences). Three incorrect resolutions can be attributed to misclassification of a word according to gender and number. Another three were due to parser errors, and the remaining six can be attributed to failures of the resolution rules. Of the unresolved cases, several could have been resolved with the existing rules if not for misclassification of words, failure to eliminate candidates by the syntactic filter, and parser errors leading to faulty identification of noun phrases. A detailed case by case analysis of the 140 unresolved pronouns was not carried out. 37 Correct 37 11 40 14 2 36 0 11 Wrong 0 1 7 5 0 0 1 1 Unresolved 24 7 30 13 4 13 0 4 0 0 0 0 0 0 its itself them their 12 3 1 9 1 0 0 2 23 0 11 11 theirs 0 0 0 2 178 0 18 0 140 Pronoun he she it they him his himself her hers herself themselves Total Precision 100% 92% 85% 74% 100% 100% 0% 92% - Recall 61% 58% 52% 44% 33% 73% 0% 69% - 92% 100% 100% 82% 33% 100% 8% 41% - 100% 91% 100% 53% Table 4.2: Test Results by Pronoun 4.4 Effect on Question Answering I chose to use the rules from Baldwin's system because it was developed with a focus on high precision coreference and I believe that high precision is more important than high recall when doing question answering. I think it is better not to give an answer than to give the wrong answer. On the other hand, the best way not to make mistakes is to never attempt to resolve pronouns. The purpose of a resolution tool is to raise the recall characteristics of the systems using it, and thus a balance must be struck between precision and recall. More important than precision/recall values for resolution on a test corpus are the effects that the tool has on the precision/recall characteristics of the systems using it. Lacking a test set for Sapere, I could not evaluate BRANQA's effect on it. However, I did look at some articles from the WorldBook Encyclopedia that Sapere indexed, in order to get an idea of the potential benefits for question answering. I ran BRANQA on the articles, and then evaluated its results manually, since no coreference annotations had been made. 38 Taking the article on Afghanistan as an example, we find that out of 42 thirdperson pronoun occurrences, the system resolved 27 of them correctly, resolved 3 incorrectly, and left 12 occurrences unresolved (all 12 had a markable antecedent). An example of useful resolution for question answering is that of "he" and "his" to "Abdur Rahman" in "After he died in 1901, his policies were continued by his son, Habibullah Khan." Since this is the only article in the Encyclopedia mentioning Abdur Rahman, the resolution of "he" adds information that we did not have before. For the three incorrect resolutions, it does not seem like they would cause Sapere to return wrong answers to questions people would ask. The following lists the mistakes and the reasons behind them: * "They" resolved to "their communities" instead of "Mullahs" in "They interpret Islamic law and educate the young" (failure to recognize "Mullahs" as plural). * "It" resolved to "the game" in "In the game, dozens of horsemen try to grab a headless calf and carry it across a goal" (a bug in the syntactic filter eliminating "a headless calf" from the set of possible antecedents) * "His" resolved to "The British" instead of "Abdur Rahman Khan" in "The British agreed to recognize his authority over the country's internal affairs" (misclassification of "The British" as a person's name) More formal testing is necessary, but from what I have seen so far I am led to believe that BRANQA would improve Sapere's performance if used before indexing relations. 39 Chapter 5 Future Work In this chapter I present a number of ways in which the system will be improved in the near future, and possible research projects that are suggested by this thesis. 5.1 Improvements 5.1.1 Quoted Speech Several of the pronouns left unresolved in the evaluation could have been assigned an antecedent if better machinery had been added to handle quoted speech. Quotations are very common in newspaper articles, and it seems plausible to construct a module that accurately keeps track of who is the speaker being quoted. This can then be used to add binding constraints for the pronouns in quotations. I expect this should improve the performance of the system, at least for the domain of newspaper articles. 5.1.2 Named Entity Module The named entity module developed for this system is very simple and leaves ample room for improvement. A better named entity tagger is currently being developed by the InfoLab Group, and I plan to integrate it into BRANQA when it becomes available. 40 5.1.3 Syntactic Filter It was previously mentioned that the current implementation of the syntactic filter does not cover all cases that are ruled out by binding constraints in [19, 20]. Only a few link types of English Link Grammar are being used within our filter. Several failures to resolve a pronoun in the test corpus were due to the syntactic filter failing to establish disjoint reference. I plan to extend the coverage of the filter and to correct some of the mistakes it currently makes. 5.2 5.2.1 Future research projects Statistics as a proxy for world knowledge Baldwin suggests that CogNIAC achieves good performance with a simple set of rules because it works on those pronoun occurrences which do not need world knowledge to be resolved. However, there are many cases which do need world knowledge for adequate resolution. BRANQA's current world knowledge is limited to classification of words according to whether they denote people or groups, and small lists of people's names, personal titles, country names and U.S. states. It doesn't know, for example, that days do not own vessels, and this led to one of the incorrect resolutions in our test corpus, where "their" in "their vessel" was resolved to "the past two days" because this phrase passed the morphological and syntactic filters and triggered Rule 6, Unique Subject / Subject Pron. While it is extremely difficult to add large amounts of knowledge to rule out invalid antecedents on semantic grounds, it is possible to to extract statistics from a large corpus to help disambiguate references. In the example above, we could have used the fact that "Coonan and 10 other crew members" is a noun phrase referring to people (something the system can currently determine) and statistics showing that people co-occur with "vessel" in a possessive relation more often than "days" (infinitely more often, in this case). 41 Dagan and Itai [8] follow this approach using frequency of co-occurrence in subjectverb-object relations to help resolve the pronoun "it". Dagan and Itai show that their system correctly handles many anaphors that BRANQA's rules would leave unresolved. For example: They knew full well that the companies held tax moneyi aside for collection later on the basis that the government said it3 would collect iti. By using the JLink relation extraction system currently in development by the InfoLab Group, we can extend their approach to use other relations like modification and possession, hopefully improving performance. 5.2.2 Alternative resolution procedures The literature shows a wide variety of approaches and methods developed for the resolution of anaphora. Having laid down the infrastructure that identifies noun phrases and their relevant properties, it could be interesting to perform experiments with different resolution procedures. This could potentially lead to a set of resolution systems with different precision and recall characteristics from which developers could choose according to their preferences. BRANQA was designed with a bias towards high precision, but for some applications a higher recall tool might be desirable. 5.2.3 Integration with other systems The most important future project will be the integration of BRANQA with the rest of the group's systems. Our goal in this thesis was not to extend the state of the art in anaphora resolution, but to build a useful tool that improves our question answering performance. I evaluated the performance of the system on a test corpus, but the crucial evaluation that remains to be done is testing how much it helps the other systems developed by the group. Integration with systems like Sapere, that work by first indexing a corpus should be straightforward. No changes need to be made in the original system if we simply replace occurrences of pronouns in the corpora with the referents found by BRANQA. 42 This will allow for initial experiments to be performed without much work once we have adequate test sets for our question answering systems. If the systems wanted to take into account the fact that our resolution is not perfect, we could instead mark noun phrases with SGML coreference tags as done with the MUC-7 evaluation. This would allow the users of our system to have access to both the original pronoun and our resolution, leaving it up to them to decide what to do with our output. Integration with START will be more involved, as will be the evaluation of improvements to performance. We envision START carrying out dialogues with its users, using anaphora resolution to allow for more natural conversations. This will require changes on the START side to allow this shift from disconnected series of questions and answers to actual dialogues. 43 Chapter 6 Contributions This thesis contributes to research at the InfoLab Group by: * Providing an overview of the relevant literature in anaphora resolution. e Showing an independent replication of the resolution strategy used in the CogNIAC system, achieving comparable results. * Presenting the design and implementation of an anaphora resolution tool that, in informal evaluation, appears to be helpful for improving the performance of question answering systems. e Building useful infrastructure that can be reused in future research projects: an interface to the Link Parser attempting to fix some of its deficiencies, a noun phrase categorization tool, a simple named entity module, and an architecture that allows for easy experimentation with anaphora resolution methods. I hope this work will bear fruits by improving the performance of our systems and by motivating new projects that make use of anaphora resolution and coreference in general. I believe it should at least provide a starting point for the development of better systems that tackle question answering using coreference. 44 Bibliography [1] James Allen. Natural Language Understanding. The Benjamin/Cummings Publishing Company Inc., Redwood City, California, second edition, 1995. [2] B. Amit. Evaluation of coreferences and coreference resolution systems. In Proceedings of the First Language Resource and Evaluation Conference, May 1998. [3] Carl Lee Baker. English Syntax. MIT Press, Cambridge, Massachusetts, second edition, 1995. [4] Breck Baldwin. CogNIAC: High precision coreference with limited knowledge and linguistic resources. In Proceedings of the ACL Workshop on Operational Factors in Practical,Robust Anaphora Resolution for Unrestricted Texts, pages 38-45, 1997. [5] Susan E. Brennan, Marilyn W. Friedman, and Carl Pollard. A centering approach to pronouns. In A CL Proceedings, 2 5 h Annual Meeting, pages 155-162, 1987. [6] Donna K. Byron and Joel R. Tetreault. A flexible architecture for reference resolution. In Proceedings of the Ninth Conference of the European Chapter of the Association for ComputationalLinguistics, 1999. [7] Noam Chomsky. Lectures on Government and Binding. Foris Publications, 1981. [8] Ido Dagan and Alon Itai. A statistical filter for resolving pronoun references. In Y.A. Feldman and A. Bruckstein, editors, Artificial Intelligence and Computer Vision, pages 125-135. Elsevier, 1991. 45 [9] Deborah A. Dahl and Catherine N. Ball. Reference resolution in pundit. Technical Report CAIT-SLS-9004, Paoli: Center for Advanced Information Technology, March 1990. [10] Christiane Fellbaum, editor. Wordnet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts, 1998. [11] N. Ge, J. Hale, and E. Charniak. A statistical approach to anaphora resolution. In Proceedings of the Sixth Workshop on Very Large Corpora, pages 161-171, 1998. [12] Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. Providing a unified account of definite noun phrases in discourse. In Proceedings of the 2 1" Annual meeting of the Association for ComputationalLinguistics, pages 44-50, 1983. [13] Liliane Haegeman. Introduction to Government and Binding Theory. Blackwell, 1991. [14] Irene Roswitha Heim. The Semantics of Definite and Indefinite Noun Phrases. Doctor of Philosophy, University of Massachusetts, 1982. [15] Lynette Hirschman and Nancy Chinchor. Coreference task definition v3.0. In Proceedings of the Seventh Message Understanding Conference, July 1997. [16] Jerry R. Hobbs. Pronoun resolution. Technical Report Technical Report 76-1, Department of Computer Science, City College, City University of New York, 1976. [17] Boris Katz. Using English for indexing and retrieving. In P.H. Winston and S.A. Shellard, editors, Artificial Intelligence at MIT: Expanding Frontiers,volume 1. MIT Press, 1990. [18] Boris Katz. Annotating the world wide web using natural language. In Proceedings of the 5 th RIAO Conference on Computer Assisted Information Searching on the Internet, 1997. 46 [19] Shalom Lappin and Herbert J. Leass. An algorithm for pronominal anaphora resolution. Computational Linguistics, 20(4):535-561, 1994. [20] Shalom Lappin and Michael McCord. A syntactic filter on pronominal anaphora in slot grammar. In Proceedings of the 2 8th Annual Meeting of the Association for Computational Linguistics, pages 135-142, 1990. [21] Geoffrey Leech and Roger Garside. Running a grammar factory: the production of syntactically analysed corpora or 'treebanks'. In Stig Johansson and Anna-Brita Stenstrom, editors, English Computer Corpora: Selected Papers and Bibliography. Mouten de Gruyter, 1991. [22] Jimmy J. Lin. Indexing and Retrieving Natural Language Using Ternary Expressions. Master of Engineering, Massachusetts Institute of Technology, 2001. [23] Michael McCord, Arendse Bernth, Shalom Lappin, and Wlodek Zadrozny. Natural language processing within a slot grammar framework. InternationalJournal of Artificial Intelligence Tools, 1(2):229-277, 1992. [24] Igor A. Mekeuk. Dependency Syntax: Theory and Practice. State University of New York Press, 1988. [25] Ruslan Mitkov. A new approach for tracking center. In Proceedings of the InternationalConference "New Methods in Language Processing", 1994. [26] Ruslan Mitkov. Factors in anaphora resolution: they are not the only things that matter. In Proceedings of the A CL Workshop on OperationalFactorsin Practical, Robust Anaphora Resolution for Unrestricted Texts, pages 14-21, 1997. [27] Ruslan Mitkov. Anaphora resolution: The state of the art. Working paper (Based on the COLING'98/ACL'98 tutorial on anaphora resolution), 1999. [28] Ruslan Mitkov, Richard Evans, Constantin Orasan, Catalina Barbu, Lisa Jones, and Violeta Sotirova. Coreference and anaphora: developing annotating tools, annotated resources and annotation strategies. In Proceedings of the Discourse 47 Anaphora and Anaphora Resolution Colloquium (DARC'2000), pages 49-58, 2000. [29] Thomas S. Morton. Using coreference in question answering. In Proceedings of the 8"h Text REtrieval Conference (TREC-8), 1999. [30] MUC-6 Program Committee. Coreference task definition v2.3. In Proceedings of the Sixth Message Understanding Conference, November 1995. [31] Candace Lee Sidner. Towards a computational theory of definite anaphora comprehension in english discourse. Technical Report AITR-537, MIT Al Lab, 1979. [32] Candace Lee Sidner. Focusing in the comprehension of definite anaphora. In Barbara Grosz, Karen Sparck Jones, and Bonny Lynn Webber, editors, Readings in Natural Language Processing. Morgan Kaufman, 1986. [33] Daniel Sleator and Davy Temperley. Parsing english with a link grammar. Technical Report CMU-CS-91-196, Carnegie Mellon University, October 1991. [34] Daniel Sleator and Davy Temperley. Parsing english with a link grammar. In Third International Workshop on Parsing Technologies, August 1993. [35] Joel R. Tetreault. Analysis of syntax-based pronoun resolution methods. In Proceedings of the Association for Computational Linguistics, 1999. [36] Marylin A. Walker, Masayo lida, and Sharon Cote. Japanese discourse and the process of centering. Technical Report IRCS Report No. 92-14, The Institute for Research in Cognitive Science, University of Pennsylvania, 1992. [37] Patrick H. Winston. Artificial Intelligence. Addison-Wesley, Reading, Mas- sachusetts, third edition, 1992. 48