Download Paper

Flexible Configuration of Information Extraction to Support Richer Propositional Constructs The ultimate goal of our project is to allow various members of the coalition to access necessary information that has been created by other people and teams, often from other divisions, other branches of the military, even other countries. One aspect of this is creating a common query language that can bridge different conceptualizations of the domain, whether due to different cultures, different dialects, or different areas of expertise. However, a query language is not enough; the information must be entered into the system to be accessible. Much of this information already exists in the form of free text memos, reports, and other documents and it must be re-coded in order to be entered into a system in a form that is broadly accessible. This is where information extraction comes in. Various aspects of the information in those free text documents must be extracted, including various types of entities and their attributes and relations to each other, various types of events along with their participants, location, time, and how certain we can be of the report. However, although a certain amount of information extraction can be done with a fairly general system, much of it requires knowledge that only those familiar with the domain can provide, including knowledge of the types of people, objects, locations, events, and situations that are relevant and the language used to describe them. So a fundamental goal of this project is to provide a system and a language for these domain experts to express this knowledge in. In pursuit of this goal, we are developing a system for performing ontology-based information extraction using a controlled natural language. Having an ontology-based system will simplify the extraction by allowing us (or rather the domain experts) to describe constraints on the kinds of information that is expected to be found (or at least to be useful), thus aiding the process of interpreting the text. The controlled natural language, a subset of ordinary language (English in our case), is ITA-CE (hereafter just CE). Because it is ordinary English it will be relatively easy for these domain experts to use yet sufficiently precise and unambiguous that a computer can interpret the input of the domain experts and use it to automatically extract the necessary knowledge from free text. Users of CE While the ultimate end-users are commanders at various levels, war fighters, and peace keeping teams, who need to query the system and access needed information, there are several other types of users of the system, and CE will have to serve all of these. In addition to the commanders, there are analysts who will also need to query the knowledge. These two sets of users will require a query language, though possibly at different levels of abstraction. Then there are the domain experts who will have to put knowledge into the system (likely the analysts themselves). It needs to be kept in mind that both of these groups of users will have little or no expertise in linguistic analysis or knowledge representation; nevertheless, they need to use their domain expertise to modify and extend an information extraction system to extract ever increasing amounts of useful information from free text, often in a constantly changing environment. There are two types of information they will have to provide: domain knowledge (i.e. extending the ontology or domain model) and the language used to express it. The system will have to provide a sufficiently intuitive language to allow them to enter this knowledge. In addition to these two end-user roles, the system will also require knowledge representation experts and natural language processing experts to create the basic system and provide direction and tools for its expansion and modification by domain experts. CE will also have to serve them as well, providing for the former a language for creating and structuring an ontology or domain model and for the latter a means of describing language and its structure, as well as a means to map between the two. This report will discuss the current state of CE and some of the extensions that will be required in order to provide an easy-to-use language that will serve all of the above types of users. Current State of the System and Language CE currently has a solid basis for all of these roles. There is a basic high-level ontology expressed in CE and CE provides a means for introducing new concepts into the ontology, assigning them the appropriate relations to other elements of the ontology (the IS-A or subclass relation), and providing for various attributes that an instance of a particular entity concept may have and for and relations that instances of different entity concepts may have with each other. The meaning of a new concept can be clarified by identifying it with a “synset” in WordNet (where a synset is a set of words that are synonyms, i.e. have, at least approximately, the same meaning). There is also a way to introduce real-world instances of concepts (in the form of entity identifiers) and their connection to the ontology, i.e. as instances or realizations of those concepts with the “realizes” relationship. One can also express the identity relationship between two instances, the fact that they represent the same real-world entity, to allow us to express the fact that two entity identifiers derived from two different language referents (e.g. pronouns and their antecedents) actually represent the same real-world entity (e.g. person or organization). There are also ways of expressing the truth or falsity of a sentence, the degree of certainty the system has in a certain sentence, and whether a sentence is merely an assumption rather than an established fact. CE allows the expression of if-then rules, which can be used to make inferences from a fact base. Since situations are full-fledged entities, reification of sentences is possible, allowing reference to them as part of a description of the rationale behind a proposition based on a chain of rule-based inferences. This ability to describe a rationale in CE, together with the possibility of sentences being ascribed the status of “assumption”, allows what-if scenarios to be explored in the language. Finally, there is a means for querying the fact base for the existence of entities or situations that fit a certain description, or a count of the entities or situations that fit a certain description. As for the describing language and its structure, CE can currently introduce new words and ascribe to them their basic part-of-speech (e.g. noun, verb, adjective, preposition) and important subcategories of these (e.g. present and past participles of verbs). It can also describe certain phrasal categories (sentence, noun phrase, verb phrase, prepositional phrase), and describe them in terms of their head word (e.g. the head noun of a noun phrase or the head verb of a verb phrase) and its dependents (e.g. verb complements like direct objects; verb modifiers like adverbs and (most) prepositional phrases; or noun modifiers like adjectives and (most) prepositional phrases). It can also ascribe certain feature annotations to phrases, allowing, for example, the statement of certain verb subcategorization restrictions like what kind of a prepositional phrase complement a particular verb takes. It has the beginning of an ability to describe the semantic roles various complements of verbs can have (e.g. agent, patient, instrument), which can play an important role in extracting the roles of different participants in situations despite varying surface representations. For example, the subject of a passive sentence or the direct object or prepositional phrase complement of an active sentence: 1. 2. 3. The villager was shot. They shot the villager. They shot at the villager. CE also has a means for stating what words may be used to express a particular entity or relation concept (“expresses”), providing for a basic association between the ontology or domain model and language. One can also use CE to state that a certain linguistic expression (e.g. a noun phrase or a sentence) “stands for” a particular entity or situation. Suggestions for Extending CE There are some current limitations of CE, if not in general expressiveness then in the ability to express things in a stylistically felicitous and possibly more succinct manner. In this section, this report will focus on adjectives (a later report will discuss prepositional phrases, and relative clauses). The syntax of CE currently does not allow for adjectives. Note that CE is perfectly capable of describing non-CE free text that has adjectives in it, but CE itself does not have adjectives. The meaning of adjectives is currently captured by considering them as representing a class in the entity ontology. For example, the adjective “Christian” in “the Christian market” is currently rendered as the entity concept “the christianentity |c2|”, which also happens to be a “market” (the CE meta-ontology allows an instance or concept to have multiple parents in the IS-A hierarchy). 4. the christianentity '|c2|' … 5. the christianentity '|c2|' is a market There are several ways of handling the semantics of “Christian” that stay truer to its adjectival nature, treating it as a property or attribute rather than an entity. First, the market should be seen as an instance of the concept associated with the head noun, so we would have “the market ‘|m2|’ rather than “the christianentity ‘|c2|’”. The adjective then could be represented in any of the following ways. The first stays closest to the existing CE syntax: 6. the market ‘|m2|’ … a. the market has christian as religious_affiliation [or possibly] b. the market has christianity as religious_affiliation These (especially 6.b) are analogous to the existing CE syntax in sentences like “X has Y as father”. 1.b still reifies the adjective “Christian” into “Christianity”, but in a more felicitous way than “christianentity”. It treats it as a property of the market and doesn’t require adding spurious entity concepts like “christianentity” to the ontology. Extending the CE syntax somewhat, we could have either the attributive use of adjectives (where they occur before the noun they modify): 7. the christian market ‘|m2|’ … Or the predicative use of adjectives, where they occur after a copula verb like “be”: 8. a. the market ‘|m2|’ … b. the market ‘|m2|’ is christian 8 is less concise, separating the two predicates into two sentences, but it would perhaps be more accessible to queries of the form “what entities are christian?”. 9. the market ‘|m2|’ which is christian Sentence 9 uses a relative clause containing a predicate adjective to keep the two predicates in a single sentence, but nevertheless has a form more similar to the pair of sentences in 8 while also maintaining the primary categorization of the entity ‘|m2|’ as a market. Sentence 10 illustrates yet another possibility, with similarities to Sentence 1. Other than being slightly more felicitous, it does not offer any distinct advantages over 6. 10. the market ‘|m2|’ … a. the marked has the religious_affiliation christian b. the marked has the religious_affiliation christianity Note that the approach taken in Sentence 1 is the same as having a slot in a frame representation and the slot has the name or type “religious_affiliation” and, in this case, the value “Christian” or “Christianity”. This could in fact be used to guide information extraction: once we know that we have a large public institution like a market in certain areas of the world (like the Middle East), it may be likely that they have a religious affiliation, with a finite number of possible values (e.g. “Christian”, “Muslim”, “Shiite”, “Sunni” etc.). The information extraction agent could then search for a modifier nearby or even in another sentence that supplied the appropriate value for the slot. A possible disadvantage of this is that it would require us to categorize adjectives so that they can match appropriate slots. This is not difficult with adjectives like “Christian” or color (“red”), shape (“round”), height (“tall”) or weight (“slender”), but might be difficult with some other adjectives. If this approach is taken (and is the only one), then some kind of generic category like “property” would have to be used to label these attributes. Another possibly problem with adjectives is the fact that some can only be used attributively and some can only be used predicatively. “Former” in the following sentence is an example of an adjective that can only be used attributively. 11. Bath’est website promotes return of the party to its former prominence. One cannot say “its prominence was former”. In order to capture the meaning of this sentence, either the attributive use of adjectives would have to be allowed or all the implications of “former prominence” would have to be made directly from the sentence (e.g. it was prominent at some time prior to the utterance/publication of the sentence and it is no longer prominent at the time of the utterance/publication of the sentence). In general, the choice needs to pay attention to several different criteria. The syntax must: A. B. C. D. Allow everything that needs to be expressed to be expressible Support information extraction from text Support the expected queries Be at least reasonably felicitous for English speakers There may not be a single solution that satisfies all of these criteria perfectly. Of course, there is nothing to prevent CE from employing more than one of these approaches. If the simplest and most concise version, the attribute syntax, is chosen (Sentence 2), we would probably also want to allow the predicative use, to support certain queries, with each form (typically) derivable from the other by a general inference rule. Sources of Ambiguity and Suggestions for Limiting It Ambiguity can come from several sources. There is lexical ambiguity, deriving from the different meanings of a word (e.g. “bank”, “star”, “tank”). There are also other sources of ambiguity. Consider the following sentences: 12. The boy saw the girl on the hill with the telescope. [Syntactic: Who had the telescope, the boy or the girl?] 13. Flying airplanes can be dangerous. [Syntactic: Are the airplanes dangerous or is the act of flying them?] 14. Five boys loved a girl. [Logical scope: Did they all love the same girl or did they each love a different girl?] 15. John and Agnes married last year. [Pragmatic: Did they marry each other or did they each marry someone else?] Examples like the first are especially common. These are called “prepositional phrase attachment” ambiguities and derive from the uncertainty of whether the prepositional phrase is syntactically attached to (i.e. modifies) the direct object (“the girl” in the above example) or the verb phrase (where it gets an instrumental interpretation – the boy had the telescope and used it to see the girl). These can arise if the syntax 1) allows prepositional phrases; and 2) allows them to attach to both noun phrases and verb phrases (or sentences). Since we want CE to be completely unambiguous, we need to somehow eliminate this source of uncertainty. One way is to prohibit prepositional phrases completely, but that would severely limit what could be said in CE. Another way is to allow prepositional phrases with restrictions: only allow prepositional phrases to modify either noun phrases or verb phrases, but not both. For example, we could only allow prepositional phrases to modify verb phrases and not noun phrases. To capture similar meanings on a noun phrase (like “on the hill” above), we would require that it be paraphrased. The easiest way would probably be to allow relative clauses. For example, the second example above would become one of the following two, depending on which meaning was intended: 16. The boy saw the girl who was on the hill with the telescope. [the boy used the telescope to see the girl] 17. The boy saw the girl who was on the hill and (who) had the telescope. [the girl had the telescope] Of course, we cannot avoid this type of ambiguity in the text we are extracting information from but, once we have used heuristics to determine which interpretation is most likely, we can represent it in CE unambiguously. Language for Describing Language: Verbs In addition to assigning a part-of-speech to words, additional information is needed to interpret sentences and appropriately extract the information it contains. In particular, verbs, as the core of a sentence (or a clause) have complex semantic structures, including temporal structure and various semantic relationships with complements of various types (such as noun phrases, prepositional phrases, and sentential complements, both finite and non-finite). First, while adjectives prototypically refer to states and verbs prototypically refer to processes or activities, this is not always the case. “Know” in “The authorities know about him” is stative; the situation does not change over time. This is true in general of perceptual and cognitive verbs (e.g. “see”, “believe”) and others (e.g. “inhabit”). In addition, processes can have more complex temporal structure. For example, verbs can also be described in terms of conditions or predications that hold before the event starts, those that hold during an event, and those that hold at the end or after an event is over. For example, in the situation described by “John put the book on the table”, prior to the process of “putting”, the book is not on the table, during it John is (typically) in contact with the book or somehow manipulating it, and after the event the book is on the table. Verbs can also be distinguished by the semantic types of the arguments they take. For example, the subject of “put” is typically an agent, i.e. a person, an organization, a selfdriven machine (a robot), or a force of nature (the wind). The direct object just has to be something concrete (although there are other senses of “put” that take abstract objects like “fear”) and “put” requires a destination which must be a location. These restrictions on the complements of a verb are called “selectional restrictions”. They apply to the complements that play certain semantic roles in a sentence (called “thematic” roles), not to syntactic positions. The most obvious example is that in passive sentences like “The book was put on the table (by John)”, the restriction of “agentness” applies not to the subject of the sentence, “book”, but rather to the object of the preposition “by” (if it is expressed at all). In fact, even if the subject is excluded, it is understood that there is an agent involved in the situation being portrayed, and it may be specified or referred to subsequently: 18. The vase got broken. 19. John’s going to get in trouble. “John” in the sentence 19 likely refers to the unexpressed agent in 18, and this inference derives from the expected agent role associated with the verb “break”. In addition to the active/passive pairs, there are many other alternating pairs of structures. Just a few of these are illustrated below: 20. 21. 22. 23. He gave the book to John. / He gave John the book. Jane broke the window. / The window broke. Harry broke the window with a hammer. / A hammer broke the window. He scratched his arm. / He scratched himself on the arm. In each case, the various selectional restrictions imposed by the verb apply to the complements that have a certain role with respect to the situation described by the verb, regardless of how it is realized syntactically. Furthermore, verbs that can occur in the same set of patterns tend to have similar semantics [Kipper et al.]. For example, “crack”, “smash”, “rip”, and “shatter” enter into the same set of syntactic patterns with the same associated selectional restrictions as “break”. This has several implications for the task of information extraction. First, ambiguous verbs can often be disambiguated based on the incompatibility of the semantic type of the subject, object, or other complement with the selectional restrictions associated with certain senses of the verb. The two senses of “drive” in the following sentences can be distinguished by the selectional restrictions of the direct object (or rather the “theme”): 24. The militia members drove the villagers from the village. 25. The soldier drove the tank into the village. The first sense of “drive” takes an animate theme (“villagers”, here); the second takes a vehicle as a theme (“tank”). Note that different inferences can be made depending on which sense of “drive” is intended. 24 does not imply that the militia members left the village while 25 implies that the soldier, along with the tank he was driving, ended up in the village. Second, entities satisfying the selectional restrictions of unexpressed complements (e.g. the unexpressed agent of a passive sentence) can often be found in neighboring sentences, filling out the meaning of the sentence (see sentences 18 and 19 above). Third, the semantics of the verb are (largely) common across the different constructions, but only if the logical expressions are stated in terms of the entities filling the different semantic roles rather than directly from the syntax of the sentence. Fourth, certain high-level semantics are largely common across the different verbs in a class, as defined above by the verbs which enter in to the same set of syntactic constructions (e.g. the “break” class mentioned above). Fifth, thematic roles tend to be mapped to common semantic inferences, allowing generalizations even across verb classes. For example, if a verb has a patient, then the verb portrays a change in it and if the verb also has an agent, the agent causes that change. In order to reap the various benefits mentioned in the previous paragraphs, it is necessary to have a relatively complete and consistent set of thematic roles. These need to go beyond “agent”, “patient”, and “instrument”. We need to add, minimally, “recipient”, “beneficiary”, “location”, “destination”, “(locative) source”, and “experiencer”. We probably also need to reserve “patient” for objects that undergo the change referred to by the verb and “agent” for entities which actually cause the change (whether intentionally or not). We would then need to assign other roles to subjects that are not causally associated with the changes referred to by the verb, such as “experiencer” (as in the subject of many psychological and perception verbs, like “fear” and “see”) and “theme” (for subjects of verbs that refer to a simple location or change of location). In addition, there are some minor thematic roles that we may or may not need to use (see Palmer for a list of these). In addition, the entity hierarchy must be sufficiently in line at the top level to support the statement of the selectional restrictions needed on the various thematic roles. Finally, the various verbs that we will encounter in our domain must be assigned to the various verb classes and the semantic mapping from the thematic roles to logical expressions spelled out. There are a couple of existing resources that might be used for this. One is VerbNet [Palmer; Kipper], which follows the above approach very explicitly, but, although it has been expanded a couple of times, is still far from complete. For example, it lacks both senses of “drive” mentioned in sentences 24 and 25. In addition, many of the example sentences strike me as rather odd English, as though they may have been generated by a non-native speaker, limiting the usefulness of the resources somewhat. Another resource with a similar if not identical approach to mapping syntax to semantics with a frame-like representation is FrameNet [Rupenhofer et al.; FrameNet]. Where VerbNet focuses exclusively on defining “frames” for verbs, FrameNet defines frames that serve as the basis for defining not only verbs, but also adjectives, nouns (based on their role in a frame), and a few prepositions (primarily those related to location and path). FrameNet also seems to be a little bit deeper in its analysis of the semantic structure of verbs and has examples that are apparently actually found in printed English rather than made up. On the other hand, FrameNet describes the semantics of verbs (and other lexical items) in ordinary English rather than a logical format like VerbNet, limiting its usefulness in making inferences automatically. FrameNet also, apparently, covers fewer verbs (although it is difficult to directly assess this). The [Unified Verb Index] is an on-line search tool that allows the user to search for a verb and find its representation in VerbNet, as well as FrameNet, [Propbank], and [OntoNotes]. References Ruppenhofer, J., Ellsworth, M., Petruck, M. R. L., Johnson, C. R., and Scheffczyk, J. 2010. FrameNet II: Extended Theory and Practice (ebook). https://framenet2.icsi.berkeley.edu/docs/r1.5/book.pdf FrameNet. https://framenet.icsi.berkeley.edu/fndrupal/ Kipper, K., Korhonen, A., Ryant, N., and Palmer, M. 2008. A Large-Scale Classification of English Verbs. In the Journal of Language Resources and Evaluation. 42(1). 21-40. OntoNotes. http://www.bbn.com/ontonotes/. Palmer, M. VerbNet: A Class-Based Verb Lexicon. http://verbs.colorado.edu/~mpalmer/projects/verbnet.html. PropBank. http://verbs.colorado.edu/~mpalmer/projects/ace.html. Unified Verb Index. http://verbs.colorado.edu/verb-index/index.php. Vendler , Z. (1967). Linguistics in Philosophy. Ithaca, NY: Cornell University Press.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Paper