Download Paper

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Udmurt grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Old Irish grammar wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Ukrainian grammar wikipedia , lookup

Preposition and postposition wikipedia , lookup

Inflection wikipedia , lookup

English clause syntax wikipedia , lookup

Swedish grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Navajo grammar wikipedia , lookup

Kannada grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

French grammar wikipedia , lookup

Old English grammar wikipedia , lookup

Portuguese grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Chinese grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Kagoshima verb conjugations wikipedia , lookup

Georgian grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Turkish grammar wikipedia , lookup

Icelandic grammar wikipedia , lookup

Italian grammar wikipedia , lookup

Russian grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Yiddish grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Spanish grammar wikipedia , lookup

English grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
Flexible Configuration of Information Extraction to Support Richer Propositional
Constructs
The ultimate goal of our project is to allow various members of the coalition to access necessary information that
has been created by other people and teams, often from other divisions, other branches of the military, even other
countries. One aspect of this is creating a common query language that can bridge different conceptualizations of the
domain, whether due to different cultures, different dialects, or different areas of expertise. However, a query
language is not enough; the information must be entered into the system to be accessible. Much of this information
already exists in the form of free text memos, reports, and other documents and it must be re-coded in order to be
entered into a system in a form that is broadly accessible. This is where information extraction comes in. Various
aspects of the information in those free text documents must be extracted, including various types of entities and
their attributes and relations to each other, various types of events along with their participants, location, time, and
how certain we can be of the report. However, although a certain amount of information extraction can be done with
a fairly general system, much of it requires knowledge that only those familiar with the domain can provide,
including knowledge of the types of people, objects, locations, events, and situations that are relevant and the
language used to describe them. So a fundamental goal of this project is to provide a system and a language for these
domain experts to express this knowledge in. In pursuit of this goal, we are developing a system for performing
ontology-based information extraction using a controlled natural language. Having an ontology-based system will
simplify the extraction by allowing us (or rather the domain experts) to describe constraints on the kinds of
information that is expected to be found (or at least to be useful), thus aiding the process of interpreting the text. The
controlled natural language, a subset of ordinary language (English in our case), is ITA-CE (hereafter just CE).
Because it is ordinary English it will be relatively easy for these domain experts to use yet sufficiently precise and
unambiguous that a computer can interpret the input of the domain experts and use it to automatically extract the
necessary knowledge from free text.
Users of CE
While the ultimate end-users are commanders at various levels, war fighters, and peace keeping teams, who need to
query the system and access needed information, there are several other types of users of the system, and CE will
have to serve all of these. In addition to the commanders, there are analysts who will also need to query the
knowledge. These two sets of users will require a query language, though possibly at different levels of abstraction.
Then there are the domain experts who will have to put knowledge into the system (likely the analysts themselves).
It needs to be kept in mind that both of these groups of users will have little or no expertise in linguistic analysis or
knowledge representation; nevertheless, they need to use their domain expertise to modify and extend an
information extraction system to extract ever increasing amounts of useful information from free text, often in a
constantly changing environment. There are two types of information they will have to provide: domain knowledge
(i.e. extending the ontology or domain model) and the language used to express it. The system will have to provide a
sufficiently intuitive language to allow them to enter this knowledge.
In addition to these two end-user roles, the system will also require knowledge representation experts and natural
language processing experts to create the basic system and provide direction and tools for its expansion and
modification by domain experts. CE will also have to serve them as well, providing for the former a language for
creating and structuring an ontology or domain model and for the latter a means of describing language and its
structure, as well as a means to map between the two.
This report will discuss the current state of CE and some of the extensions that will be required in order to provide
an easy-to-use language that will serve all of the above types of users.
Current State of the System and Language
CE currently has a solid basis for all of these roles. There is a basic high-level ontology expressed in CE and CE
provides a means for introducing new concepts into the ontology, assigning them the appropriate relations to other
elements of the ontology (the IS-A or subclass relation), and providing for various attributes that an instance of a
particular entity concept may have and for and relations that instances of different entity concepts may have with
each other. The meaning of a new concept can be clarified by identifying it with a “synset” in WordNet (where a
synset is a set of words that are synonyms, i.e. have, at least approximately, the same meaning). There is also a way
to introduce real-world instances of concepts (in the form of entity identifiers) and their connection to the ontology,
i.e. as instances or realizations of those concepts with the “realizes” relationship. One can also express the identity
relationship between two instances, the fact that they represent the same real-world entity, to allow us to express the
fact that two entity identifiers derived from two different language referents (e.g. pronouns and their antecedents)
actually represent the same real-world entity (e.g. person or organization).
There are also ways of expressing the truth or falsity of a sentence, the degree of certainty the system has in a certain
sentence, and whether a sentence is merely an assumption rather than an established fact. CE allows the expression
of if-then rules, which can be used to make inferences from a fact base. Since situations are full-fledged entities,
reification of sentences is possible, allowing reference to them as part of a description of the rationale behind a
proposition based on a chain of rule-based inferences. This ability to describe a rationale in CE, together with the
possibility of sentences being ascribed the status of “assumption”, allows what-if scenarios to be explored in the
language. Finally, there is a means for querying the fact base for the existence of entities or situations that fit a
certain description, or a count of the entities or situations that fit a certain description.
As for the describing language and its structure, CE can currently introduce new words and ascribe to them their
basic part-of-speech (e.g. noun, verb, adjective, preposition) and important subcategories of these (e.g. present and
past participles of verbs). It can also describe certain phrasal categories (sentence, noun phrase, verb phrase,
prepositional phrase), and describe them in terms of their head word (e.g. the head noun of a noun phrase or the head
verb of a verb phrase) and its dependents (e.g. verb complements like direct objects; verb modifiers like adverbs and
(most) prepositional phrases; or noun modifiers like adjectives and (most) prepositional phrases). It can also ascribe
certain feature annotations to phrases, allowing, for example, the statement of certain verb subcategorization
restrictions like what kind of a prepositional phrase complement a particular verb takes. It has the beginning of an
ability to describe the semantic roles various complements of verbs can have (e.g. agent, patient, instrument), which
can play an important role in extracting the roles of different participants in situations despite varying surface
representations. For example, the subject of a passive sentence or the direct object or prepositional phrase
complement of an active sentence:
1.
2.
3.
The villager was shot.
They shot the villager.
They shot at the villager.
CE also has a means for stating what words may be used to express a particular entity or relation concept
(“expresses”), providing for a basic association between the ontology or domain model and language. One can also
use CE to state that a certain linguistic expression (e.g. a noun phrase or a sentence) “stands for” a particular entity
or situation.
Suggestions for Extending CE
There are some current limitations of CE, if not in general expressiveness then in the ability to express things in a
stylistically felicitous and possibly more succinct manner. In this section, this report will focus on adjectives (a later
report will discuss prepositional phrases, and relative clauses).
The syntax of CE currently does not allow for adjectives. Note that CE is perfectly capable of describing non-CE
free text that has adjectives in it, but CE itself does not have adjectives. The meaning of adjectives is currently
captured by considering them as representing a class in the entity ontology. For example, the adjective “Christian” in
“the Christian market” is currently rendered as the entity concept “the christianentity |c2|”, which also happens to be
a “market” (the CE meta-ontology allows an instance or concept to have multiple parents in the IS-A hierarchy).
4.
the christianentity '|c2|' …
5.
the christianentity '|c2|' is a market
There are several ways of handling the semantics of “Christian” that stay truer to its adjectival nature, treating it as a
property or attribute rather than an entity. First, the market should be seen as an instance of the concept associated
with the head noun, so we would have “the market ‘|m2|’ rather than “the christianentity ‘|c2|’”. The adjective then
could be represented in any of the following ways. The first stays closest to the existing CE syntax:
6.
the market ‘|m2|’ …
a. the market has christian as religious_affiliation
[or possibly]
b. the market has christianity as religious_affiliation
These (especially 6.b) are analogous to the existing CE syntax in sentences like “X has Y as father”. 1.b still reifies
the adjective “Christian” into “Christianity”, but in a more felicitous way than “christianentity”. It treats it as a
property of the market and doesn’t require adding spurious entity concepts like “christianentity” to the ontology.
Extending the CE syntax somewhat, we could have either the attributive use of adjectives (where they occur before
the noun they modify):
7.
the christian market ‘|m2|’ …
Or the predicative use of adjectives, where they occur after a copula verb like “be”:
8.
a. the market ‘|m2|’ …
b. the market ‘|m2|’ is christian
8 is less concise, separating the two predicates into two sentences, but it would perhaps be more accessible to
queries of the form “what entities are christian?”.
9.
the market ‘|m2|’ which is christian
Sentence 9 uses a relative clause containing a predicate adjective to keep the two predicates in a single sentence, but
nevertheless has a form more similar to the pair of sentences in 8 while also maintaining the primary categorization
of the entity ‘|m2|’ as a market.
Sentence 10 illustrates yet another possibility, with similarities to Sentence 1. Other than being slightly more
felicitous, it does not offer any distinct advantages over 6.
10. the market ‘|m2|’ …
a. the marked has the religious_affiliation christian
b. the marked has the religious_affiliation christianity
Note that the approach taken in Sentence 1 is the same as having a slot in a frame representation and the slot has the
name or type “religious_affiliation” and, in this case, the value “Christian” or “Christianity”. This could in fact be
used to guide information extraction: once we know that we have a large public institution like a market in certain
areas of the world (like the Middle East), it may be likely that they have a religious affiliation, with a finite number
of possible values (e.g. “Christian”, “Muslim”, “Shiite”, “Sunni” etc.). The information extraction agent could then
search for a modifier nearby or even in another sentence that supplied the appropriate value for the slot.
A possible disadvantage of this is that it would require us to categorize adjectives so that they can match appropriate
slots. This is not difficult with adjectives like “Christian” or color (“red”), shape (“round”), height (“tall”) or weight
(“slender”), but might be difficult with some other adjectives. If this approach is taken (and is the only one), then
some kind of generic category like “property” would have to be used to label these attributes.
Another possibly problem with adjectives is the fact that some can only be used attributively and some can only be
used predicatively. “Former” in the following sentence is an example of an adjective that can only be used
attributively.
11. Bath’est website promotes return of the party to its former prominence.
One cannot say “its prominence was former”. In order to capture the meaning of this sentence, either the attributive
use of adjectives would have to be allowed or all the implications of “former prominence” would have to be made
directly from the sentence (e.g. it was prominent at some time prior to the utterance/publication of the sentence and
it is no longer prominent at the time of the utterance/publication of the sentence).
In general, the choice needs to pay attention to several different criteria. The syntax must:
A.
B.
C.
D.
Allow everything that needs to be expressed to be expressible
Support information extraction from text
Support the expected queries
Be at least reasonably felicitous for English speakers
There may not be a single solution that satisfies all of these criteria perfectly. Of course, there is nothing to prevent
CE from employing more than one of these approaches. If the simplest and most concise version, the attribute
syntax, is chosen (Sentence 2), we would probably also want to allow the predicative use, to support certain queries,
with each form (typically) derivable from the other by a general inference rule.
Sources of Ambiguity and Suggestions for Limiting It
Ambiguity can come from several sources. There is lexical ambiguity, deriving from the different meanings of a
word (e.g. “bank”, “star”, “tank”). There are also other sources of ambiguity. Consider the following sentences:
12. The boy saw the girl on the hill with the telescope.
[Syntactic: Who had the telescope, the boy or the girl?]
13. Flying airplanes can be dangerous.
[Syntactic: Are the airplanes dangerous or is the act of flying them?]
14. Five boys loved a girl.
[Logical scope: Did they all love the same girl or did they each love a different girl?]
15. John and Agnes married last year.
[Pragmatic: Did they marry each other or did they each marry someone else?]
Examples like the first are especially common. These are called “prepositional phrase attachment” ambiguities and
derive from the uncertainty of whether the prepositional phrase is syntactically attached to (i.e. modifies) the direct
object (“the girl” in the above example) or the verb phrase (where it gets an instrumental interpretation – the boy had
the telescope and used it to see the girl). These can arise if the syntax 1) allows prepositional phrases; and 2) allows
them to attach to both noun phrases and verb phrases (or sentences).
Since we want CE to be completely unambiguous, we need to somehow eliminate this source of uncertainty. One
way is to prohibit prepositional phrases completely, but that would severely limit what could be said in CE. Another
way is to allow prepositional phrases with restrictions: only allow prepositional phrases to modify either noun
phrases or verb phrases, but not both. For example, we could only allow prepositional phrases to modify verb
phrases and not noun phrases. To capture similar meanings on a noun phrase (like “on the hill” above), we would
require that it be paraphrased. The easiest way would probably be to allow relative clauses. For example, the second
example above would become one of the following two, depending on which meaning was intended:
16. The boy saw the girl who was on the hill with the telescope.
[the boy used the telescope to see the girl]
17. The boy saw the girl who was on the hill and (who) had the telescope.
[the girl had the telescope]
Of course, we cannot avoid this type of ambiguity in the text we are extracting information from but, once we have
used heuristics to determine which interpretation is most likely, we can represent it in CE unambiguously.
Language for Describing Language: Verbs
In addition to assigning a part-of-speech to words, additional information is needed to interpret sentences and
appropriately extract the information it contains. In particular, verbs, as the core of a sentence (or a clause) have
complex semantic structures, including temporal structure and various semantic relationships with complements of
various types (such as noun phrases, prepositional phrases, and sentential complements, both finite and non-finite).
First, while adjectives prototypically refer to states and verbs prototypically refer to processes or activities, this is
not always the case. “Know” in “The authorities know about him” is stative; the situation does not change over time.
This is true in general of perceptual and cognitive verbs (e.g. “see”, “believe”) and others (e.g. “inhabit”). In
addition, processes can have more complex temporal structure. For example, verbs can also be described in terms of
conditions or predications that hold before the event starts, those that hold during an event, and those that hold at the
end or after an event is over. For example, in the situation described by “John put the book on the table”, prior to the
process of “putting”, the book is not on the table, during it John is (typically) in contact with the book or somehow
manipulating it, and after the event the book is on the table. Verbs can also be distinguished by the semantic types of
the arguments they take. For example, the subject of “put” is typically an agent, i.e. a person, an organization, a selfdriven machine (a robot), or a force of nature (the wind). The direct object just has to be something concrete
(although there are other senses of “put” that take abstract objects like “fear”) and “put” requires a destination which
must be a location. These restrictions on the complements of a verb are called “selectional restrictions”. They apply
to the complements that play certain semantic roles in a sentence (called “thematic” roles), not to syntactic positions.
The most obvious example is that in passive sentences like “The book was put on the table (by John)”, the restriction
of “agentness” applies not to the subject of the sentence, “book”, but rather to the object of the preposition “by” (if it
is expressed at all). In fact, even if the subject is excluded, it is understood that there is an agent involved in the
situation being portrayed, and it may be specified or referred to subsequently:
18. The vase got broken.
19. John’s going to get in trouble.
“John” in the sentence 19 likely refers to the unexpressed agent in 18, and this inference derives from the expected
agent role associated with the verb “break”.
In addition to the active/passive pairs, there are many other alternating pairs of structures. Just a few of these are
illustrated below:
20.
21.
22.
23.
He gave the book to John. / He gave John the book.
Jane broke the window. / The window broke.
Harry broke the window with a hammer. / A hammer broke the window.
He scratched his arm. / He scratched himself on the arm.
In each case, the various selectional restrictions imposed by the verb apply to the complements that have a certain
role with respect to the situation described by the verb, regardless of how it is realized syntactically. Furthermore,
verbs that can occur in the same set of patterns tend to have similar semantics [Kipper et al.]. For example, “crack”,
“smash”, “rip”, and “shatter” enter into the same set of syntactic patterns with the same associated selectional
restrictions as “break”.
This has several implications for the task of information extraction. First, ambiguous verbs can often be
disambiguated based on the incompatibility of the semantic type of the subject, object, or other complement with the
selectional restrictions associated with certain senses of the verb. The two senses of “drive” in the following
sentences can be distinguished by the selectional restrictions of the direct object (or rather the “theme”):
24. The militia members drove the villagers from the village.
25. The soldier drove the tank into the village.
The first sense of “drive” takes an animate theme (“villagers”, here); the second takes a vehicle as a theme (“tank”).
Note that different inferences can be made depending on which sense of “drive” is intended. 24 does not imply that
the militia members left the village while 25 implies that the soldier, along with the tank he was driving, ended up in
the village.
Second, entities satisfying the selectional restrictions of unexpressed complements (e.g. the unexpressed agent of a
passive sentence) can often be found in neighboring sentences, filling out the meaning of the sentence (see sentences
18 and 19 above). Third, the semantics of the verb are (largely) common across the different constructions, but only
if the logical expressions are stated in terms of the entities filling the different semantic roles rather than directly
from the syntax of the sentence. Fourth, certain high-level semantics are largely common across the different verbs
in a class, as defined above by the verbs which enter in to the same set of syntactic constructions (e.g. the “break”
class mentioned above). Fifth, thematic roles tend to be mapped to common semantic inferences, allowing
generalizations even across verb classes. For example, if a verb has a patient, then the verb portrays a change in it
and if the verb also has an agent, the agent causes that change.
In order to reap the various benefits mentioned in the previous paragraphs, it is necessary to have a relatively
complete and consistent set of thematic roles. These need to go beyond “agent”, “patient”, and “instrument”. We
need to add, minimally, “recipient”, “beneficiary”, “location”, “destination”, “(locative) source”, and “experiencer”.
We probably also need to reserve “patient” for objects that undergo the change referred to by the verb and “agent”
for entities which actually cause the change (whether intentionally or not). We would then need to assign other roles
to subjects that are not causally associated with the changes referred to by the verb, such as “experiencer” (as in the
subject of many psychological and perception verbs, like “fear” and “see”) and “theme” (for subjects of verbs that
refer to a simple location or change of location). In addition, there are some minor thematic roles that we may or
may not need to use (see Palmer for a list of these).
In addition, the entity hierarchy must be sufficiently in line at the top level to support the statement of the selectional
restrictions needed on the various thematic roles.
Finally, the various verbs that we will encounter in our domain must be assigned to the various verb classes and the
semantic mapping from the thematic roles to logical expressions spelled out. There are a couple of existing
resources that might be used for this. One is VerbNet [Palmer; Kipper], which follows the above approach very
explicitly, but, although it has been expanded a couple of times, is still far from complete. For example, it lacks both
senses of “drive” mentioned in sentences 24 and 25. In addition, many of the example sentences strike me as rather
odd English, as though they may have been generated by a non-native speaker, limiting the usefulness of the
resources somewhat. Another resource with a similar if not identical approach to mapping syntax to semantics with
a frame-like representation is FrameNet [Rupenhofer et al.; FrameNet]. Where VerbNet focuses exclusively on
defining “frames” for verbs, FrameNet defines frames that serve as the basis for defining not only verbs, but also
adjectives, nouns (based on their role in a frame), and a few prepositions (primarily those related to location and
path). FrameNet also seems to be a little bit deeper in its analysis of the semantic structure of verbs and has
examples that are apparently actually found in printed English rather than made up. On the other hand, FrameNet
describes the semantics of verbs (and other lexical items) in ordinary English rather than a logical format like
VerbNet, limiting its usefulness in making inferences automatically. FrameNet also, apparently, covers fewer verbs
(although it is difficult to directly assess this).
The [Unified Verb Index] is an on-line search tool that allows the user to search for a verb and find its representation
in VerbNet, as well as FrameNet, [Propbank], and [OntoNotes].
References
Ruppenhofer, J., Ellsworth, M., Petruck, M. R. L., Johnson, C. R., and Scheffczyk, J. 2010. FrameNet II: Extended
Theory and Practice (ebook). https://framenet2.icsi.berkeley.edu/docs/r1.5/book.pdf
FrameNet. https://framenet.icsi.berkeley.edu/fndrupal/
Kipper, K., Korhonen, A., Ryant, N., and Palmer, M. 2008. A Large-Scale Classification of English Verbs. In the
Journal of Language Resources and Evaluation. 42(1). 21-40.
OntoNotes. http://www.bbn.com/ontonotes/.
Palmer, M. VerbNet: A Class-Based Verb Lexicon. http://verbs.colorado.edu/~mpalmer/projects/verbnet.html.
PropBank. http://verbs.colorado.edu/~mpalmer/projects/ace.html.
Unified Verb Index. http://verbs.colorado.edu/verb-index/index.php.
Vendler , Z. (1967). Linguistics in Philosophy. Ithaca, NY: Cornell University Press.