Download 1. Linguistic processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Dependency grammar wikipedia , lookup

Lojban grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Spanish grammar wikipedia , lookup

Focus (linguistics) wikipedia , lookup

Latin syntax wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Polish grammar wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Malay grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Stemming wikipedia , lookup

Russian grammar wikipedia , lookup

Parsing wikipedia , lookup

Pipil grammar wikipedia , lookup

Semantic memory wikipedia , lookup

Semantic holism wikipedia , lookup

Lexical semantics wikipedia , lookup

Junction Grammar wikipedia , lookup

Cognitive semantics wikipedia , lookup

Transcript
Linguistic Knowledge for Search
Relevance Improvement
Gennady OSIPOV, Ivan SMIRNOV, Ilya TIKHOMIROV,
Olga VYBORNOVA and Olga ZAVJALOVA
Institute for System Analysis, Russia
Abstract. The paper presents methods and software implementation for
semantically relevant search. We mainly focus on the usage of linguistic
knowledge for improvement of semantic relevance in search engines. We state the
effectiveness of information retrieval systems based on semantic search involving
refined linguistic processing tools. Advantages of semantic search in comparison
with traditional search engines are discussed.
Keywords. Knowledge-based semantic search, linguistic semantics, natural
language processing, meta-search.
Introduction
The main part of information now is presented as web-documents and Internet/Intranetdistributed databases. However, the process of rapid accumulation of information on
the Internet was rather chaotic, therefore a paradox occurred: the volume of
accumulated information increases much faster than the ability to search within that
information. Fast and efficient search for necessary information is a common problem
of modern society.
When we use modern search engines, the first documents returned as the result of
the query processing are often occupied by links to non-relevant (in the context of the
query) documents, although their relevance is rated as 100 per cent by search engines.
There are three reasons for that:



search engines use algorithms based on various statistical scores of document
texts, rather than on “understanding” the meaning of queries and the contents
of informational resources;
natural language query is usually ambiguous and it depends on the situation
context;
the user is not always able to formulate his/her query exactly, and search
engines are of no help in this task.
The problem of relevance is usually solved by means of keyword-based linear
search with the help of statistical and some kinds of linguistic methods. Some systems
proclaim the possibility of semantic search, of natural-language query input, of
question-answering. However, they use inadequate linguistic and software tools. The
output results of such systems are large arrays of documents, but only a lesser portion
of documents in the array is relevant.
Improvement of search precision is possible if the following tools are employed:



tools for natural language processing;
linguistic knowledge base;
tools for applied semantic analysis, comparison and ranking. [1][2][3][4].
So the principal motivations we have in mind while developing an intelligent
information search engine are:




to extend standard keyword search engine mechanisms;
to provide support for search queries in natural language;
to improve search precision;
to preserve the speed of standard search methods.
1. Linguistic processing
Most of the existing search methods employ search by key words, they do not take into
account the possibility of semantic search containing language processing tools and
being an alternative type of search.
Search by key words often does not satisfy the main requirement of the user,
namely the requirement of semantic relevance of the found documents to the query,
even in spite of the fact that all key words of the query are present in these found
documents. However a document can be semantically relevant to the query even if it
does not contain any key words of this query. For example, if we have a query “speech
of the President of the Russian Federation”, then the relevant documents might contain
the following variants: speech of the President of the Russian Federation, report of the
President of the Russian Federation, speech of the leader of the Russian Federation,
report of the President of the Russian Federation, speech of the President of Russia,
report of the leader of Russia, etc.
One more example. Query: "Visit of the president to Brussels". The same idea can
be expressed differently, say "The president went to Brussels." To process the query we
introduce a semantic relation DEST, which denotes the fact that Y is a destination of X.
Then we search for DEST(president, Brussels).
Of course natural language copes with the task of increase of the search engine
work relevance far better than search by key words. That is why queries in the form of
natural language discourse should be allowed. From this follows the necessity of
semantic analysis of the query text itself and of the required documents.
A disadvantage of the traditional search by key words is that using a simple set of
key words it is impossible to convey the semantic direction of the query, impossible to
“focus” it. In other words it is impossible to formulate the search need of the user. A
query is not just a set of words. It is a sentence/phrase in which words are not just stuck
together, but are connected with each other according to certain syntactic rules.
When forming a discourse, syntax deals above all with meaningful units bearing
not only their individual lexical meaning, but also generalized categorical meaning in
constructions of various complexity. These units called syntaxemes (minimal
indivisible semantic-syntactic structures of language [5] [6]. Syntaxemes are detected
taking into account: a) categorical semantics of the word; b) morphological form; c)
function in the sentence (according to the constructive capabilities within a sentence
there are 3 functional types of syntaxemes: free/conventional/bound).
Categorical semantics is a generalized meaning characterizing words that belong to
the same categorical class (for instance, to the class of people, things, attributes for
nouns).
Let us consider some examples of different syntaxemes that are similar in form,
but different in meaning:
1.
2.
The mother brings her son to school.
Laziness brings researchers to trouble.
In (1) “bring to school” denotes a spatial noun with the directive meaning
(direction of movement), it is a free syntaxeme, since its meaning does not depend on
the position in a sentence. In (2) “bring to perfection” denotes an attributive noun
meaning logical consequence. It is a conventional syntaxeme, since its meaning is
realized only in a certain complicative model in the position of a semi-predicative
complicator of the model.
In a particular discourse, in a particular sentence of a query a word performs as a
syntaxeme, i.e. has a certain syntactical function, in a certain grammatical form, it
realizes only one of the possible meanings which this word can take in this
sentence/phrase. The main task of the semantic analysis is to reveal semantic relations
between syntaxemes.
Semantic analysis essentially enhances the precision and recall of the search and
decreases the number of irrelevant documents returned in the result of the search.
Following this general overview let’s now consider the linguistic analysis process
in our system in more detail.
The main idea of semantic search is semantic processing of user query texts and of
returned documents. Semantic processing involves generation of semantic images of
documents and match of the obtained images. In the result the additional types of
relevance are calculated that allow to remove the documents obviously not
corresponding to the search query semantics.
Semantic processing of the discourse is made for three stages: morphological,
syntactical and semantic analysis itself. [7], [8], [9]. Each stage is fulfilled by a separate
analyzer with its input and output data and its own settings.
Interpretation of utterances implies the following milestones:
1.
2.
Morphological and syntactical analysis of the sentence/phrase.
Situational analysis of the sentence/phrase (building a subcategorization frame,
correlation of the sentence/phrase situation with the situation described in the
knowledge base of the system, selection of an adequate action).
3. Generation of the system answer.
1.1. Morphological analysis
At the morphological analysis stage words and separators (in case available) are
recognized in the discourse. Based on the word morphology the list of all possible
grammatical forms for each word are defined. Word forms corresponding to the same
normal dictionary form of the word and to the same part of speech and those in the
same number (singular or plural) (for parts of speech that can change the number) will
be unified into groups which will be further called lexemes (though they are not
lexemes in the strict linguistic sense).
Obviously, several such lexemes can correspond to the same word. In order to
decrease the number of the resulting variants of the sentence, the morphological
analyzer has a filter: it can be defined for every part of speech whether it will be taken
into account in the further analysis. The settings allow to ignore interjections and
particles by default if there are variants of this word belonging to other parts of speech.
At the output we get a list of sentences each of which is a list of words. Each of the
word in its turn is a list of lexemes.
1.2. Syntactical analysis
The main task of the syntactical analysis is to establish syntactical dependencies
between lexemes revealed at the previous stage. In particular, syntactical analysis is
made for extraction of minimal semantic-syntactic structures (syntaxemes).
The first two tasks can be solved within one sentence. Compound sentences are
split into simple clauses which are further processed as separate sentences. A list of
variants is composed for all sentences acquired at the output of morphological analysis
so that in each sentence variant each word has only one lexeme. Since the number of
sentence variants is equal to the product of the number of lexemes for each word, the
task of limiting the number of variants becomes apparent. For this heuristics allowing
to reject obviously incorrect variants are applied. Besides, a maximum allowable
number of variants can be prescribed in the syntactical analyzer settings.
After that the algorithm for subordinating syntactical relations discovery is applied
for each variant. In the result lexemes are unified into dependency trees: the lexeme at
the parent node governs all child node lexemes. A minimal semantic-syntactic structure
(syntaxeme) is a tree the root of which is a noun or a preposition that governs the noun.
It should be noted that proper names also belong to nouns. A noun phrase is any
syntaxeme subtree into which the root noun is included.
Besides searching for syntactical dependencies the syntactical analysis detects
homogeneous parts. Thus at this stage two types of relations can be detected between
lexemes: government and homogeneity. Every time when the program detects some
syntactical relation, the weight of the sentence variant is increased. At the end of the
sentence analysis only variants with maximal weight are kept. Sentences with zero
weight are deleted by default (this option can be modified using the syntactical
analyzer settings).
Thus the syntactical analysis input is a sentence acquired at the output of the
morphological analysis. The output is a sentence in the form of a list of variants, each
of which is a list of dependency trees (list of syntaxemes).
1.3. Semantic analysis
The main task of the semantic analysis or, to be more exact, the kind of semantic
analysis described below (which can become deeper in case of properly composed
knowledge base of the domain), is to reveal semantic meanings of syntaxemes and
relations on the set of syntaxemes. In general a semantic relation is understood as
relation of concepts in the conceptual system of the domain. Representatives of
semantic relations in lexis are predicate words, i.e. lexemes representing predicates.
The main place here is occupied by verbs that have, as a rule, the central position in the
semantic structure of the sentence and that decisively influence noun phrases and
sentences. Detection of relations is based on the concept of a subcategorisation frame
that is widely used in structural linguistics. The information about syntactical
compatibility of every verb is put to special tables of relations. The tables of relations
for each verb indicate types of relations between its roles (arguments).
Semantic search image consists of an ordered map of triples: <relation, arg1, arg2>,
where <relation> stands for semantic relation type, and <args> are dependency trees
for a corresponding NP or PP.
In those cases when we speak about obligatory arguments (roles) of the verb, i.e.
when the verb distributors are in the verb subcategorization frame, we can speak about
bound syntaxemes. Tables of roles and semantic relations are compiled for such a type
of syntaxemes.
In case of optional roles where semantics and syntactical expression is not typical
of the given verb we have free syntaxemes. A special heuristic algorithm is developed
for them, and for those syntaxemes that have interrelations with each other avoiding
verbs. It should be noted that participles and adverbial participles as special verb forms,
as well as verbal abstract nouns, can act as predicate words.
To do the semantic analysis, first of all it is necessary to extract predicate words
(predicators). If a verb is a predicator in the sentence, it can be detected immediately at
the morphological analysis stage. In other cases (when a predicator is a participle,
verbal noun, etc.) additional rules are applied.
When the predicate word and NPs related to it are detected, the valences in the
predicator role structures should be filled up. The filling is made using special
linguistic dictionaries where a certain set of roles is associated with each predicator.
Rules of context and domain consideration are used to remove polysemy.
Besides, the dictionary indicates how NPs are interrelated within role structures. A
set of binary relations in the set of roles is also specific for each type of predicate words
and is defined a priori.
The total number of NPs, roles and binary relations is presented in the form of a
semantic graph describing the situation in the neighborhood of one predicator. This
graph is a fragment of a semantic network describing the semantic image of the whole
text.
Every time when a syntaxeme fills up the predicator role or when two roles
correspond to a semantic relation, the program increases the weight of the sentence
variant. Hence in case of simultaneous syntactical and semantic analysis the “heaviest”
variants from the point of view of both syntactical and semantic relations are kept. That
is why simultaneous analysis is not equal to sequential analysis: in the latter case
variants with the greatest number of syntactical dependencies are first selected and then
those ones among them are chosen in which the role structure of predicators is filled up
in the best way and more semantic relations are found. If a verb is polysemantic (i.e.
there are several entries for one word in the dictionary), then all variants one by one are
considered. Those variant(s) are finally selected on the further stages of analysis, where
syntaxemes meanings for a greater number of NPs of the fragment are found, and
where categorical semantics attributes worked most frequently. If there are a few
equivalent variants again, then the variant with the maximum ratio of the number of
syntaxemes found in the sentence to the total number of syntaxemes described in the
given dictionary entry should be chosen (i.e. the variant with the best (complete) verb
role structure filling).
Participial phrases and adverbial participial phrases are processed after the
corresponding main clauses. The subject of the main clause becomes the subject of an
adverbial participial phrase. Candidates for the subject/object of a participial phrase are
the nearest NPs, roots of which agree with the participle in the gender, number and case.
Syntaxemes are also searched for interrogative words like who, what, where, why,
when, how, how much, what for, at what time. A special attribute is given to the found
meanings (roles in the pairs <role, NP> representing semantics of the query and the
document) and then to relations in triples <relation, NP1, NP2>. When comparing the
query with the document (during relevance calculation), this attribute will allow to
consider NP, NP1, NP2 of the query that coincide with any NPs of the document
corresponding to the given interrogative words in categorical semantics.
At present the system works with Russian language, and its English version is
under development. One of the further directions of work is to expand the system for
handling queries in other languages. Semantic roles and relations are universal, so the
main concepts of the system can be preserved.
1.4. Linguistic knowledge base
The linguistic knowledge base in our system consists of two components:


thesaurus containing a set of words and NPs associated with each other, like
synonymy, partonymy, semantic likelihood, etc.
predicate dictionary. This dictionary describes the ways of syntactic context
realization of different types of semantic relations between concepts. The list
of predicators, say, for Russian language contains verbs and various verb
derivative forms. The predicate dictionary also has the table of relations
between the set of syntaxemes and the set of predicator roles.
The thesaurus is used for query pre-processing and the predicate dictionary is used
for semantic analysis.
Let’s consider the structure of the linguistic knowledge base on an example.
Verb= love
Role=subject – syntaxeme meaning
Facultative=No
Soul=Undefined
S=+case – syntaxeme expression form
Role=object
Facultative=No
Soul=Undefined
S=+ case – syntaxeme expression form
Role=causative
Facultative=No
Soul=Undefined
S= case – syntaxeme expression form
Relation=subject object PTN – semantic relation
Relation=subject causative CAUS
2. Meta-search engine software architecture
The described methods are implemented in the meta-search engine. The component
model architecture of the engine is:
Figure 1. Meta-search engine component model diagram.
Meta-search engine consists of the following components:
1.
2.
3.
4.
5.
6.
7.
Multi-agent environment provides distributed search queries processing by
redirecting functionally independent tasks to agent modules located on
different computers of the local network [7].
Database stores indexed documents and multi-engine environment support
data.
Text processing module makes part-of-speech tagging and semantic analysis.
Agent modules make possible all search tasks solving (search query
processing, documents downloading and indexing, relevance calculation).
Thesaurus for search query preprocessing.
Predicate dictionary stores linguistics knowledge.
Search engine modules support web-interface, users management and so on.
3. Semantic filtering
Semantic relevance improvements can be achieved using additional semantics filtering
[2], [3], [4]. The dataflow diagram gives explanation of the main idea of additional
document processing using linguistics analysis.
Figure 2. Semantic search process dataflow diagram.
The typical scenario of user search query processing includes the following steps:
1.
2.
3.
4.
5.
6.
Linguistic analysis of the query text generating its semantic search image.
Extraction of keywords and creation of an alternative search image for
external search engines.
Redirection of keywords search image to currently added-on external search
engines.
Getting and merging results lists from each search engine.
For each document in the joint results list:
5.1. Extraction of excerpts of its text with matching keywords.
5.2. Linguistic analysis of the excerpts and generation of a semantic search
image.
5.3. Match of the semantic search image of the document against that of the
query (Calculating semantic relevance).
Rearrangement of the results list according to semantic ranking, removing
false drops, i.e. documents ranked below some pre-specified threshold.
The overall process can thus be generally referred to as "semantic filtering" of the
results list.
4. Conclusion and future work
We presented methods for knowledge-based semantic search elaborated within an
intelligent system for semantic search and analysis of information on heterogeneous
informational resources and services (Internet/Intranet networks, local/distributed
databases), along with solution of the problem of semantically relevant search being the
central concept of the system.
We are developing new algorithms for recognition and analysis of various models
of sentences, parametric algorithms of relevance calculation, algorithms for machine
learning adapted for work with text information and enrichment of the system predicate
dictionary.
One more subject of thorough further research is methods of implicit relations
revelation as well as anaphora resolution in case of multiple variants requiring semantic
filtering of syntaxemes. It will be necessary for selection of the best variant of resulting
interpretation using preference criteria (individual criteria for every type of phenomena
is needed). The preference criteria that have weight coefficients my include: categorical
semantics and morphological forms of candidates, results of syntaxemes search, word
order and other factors.
At present we are also developing the English version of the system and
considering the possibility of multilingual search and translation within the system.
The Russian prototype of the semantic meta-search engine is available at
www.exactus.ru.
References
[1] A. Klimovski, G. Osipov, I. Smirnov, I. Tikhomirov, O. Vybornova, O. Zavjalova. Usability evaluation
methods for search engines. // to appear in Proceedings of international conference IAI’2006, Kiev,
2006. (in Russian)
[2] A. Klimovski, I. Kuznetsov, G. Osipov, I. Smirnov, I. Tikhomirov, O. Zavjalova. Problems of providing
search precision and recall: Solutions in the intelligent meta-search engine Sirius // Proceedings of
international conference Dialog’2005, Moscow, Nauka, 2005. (in Russian)
[3] G. Osipov, I. Smirnov, I. Tikhomirov, O. Zavjalova. Intelligent semantic search employing meta-search
tools. // Proceedings of international conference IAI’2005, Kiev, 2005. (in Russian)
[4] I. Kuznetsov, G. Osipov, I. Tikhomirov, I. Smirnov, O. Zavjalova. Sirius - engine for intelligent semantic
search in local and global area networks and databases. // 9th National conference on artificial
intelligence CAI-2004, b.3. Moscow: Fizmalit, 2004. (in Russian).
[5] O. Zavjalova. About principles of verb dictionary generation for the tasks of automatic text processing. //
Proceedings of international conference Dialog’2004, Moscow, Nauka, 2004. (in Russian)
[6] G. Zolotova, N. Onipenko, M. Sidorova. Communicative grammar of Russian language. Мoscow, 2004.
(in Russian)
[7] G. Osipov. Interactive Synthesis of Knowledge-Based Configuration. //Proc. of 5 Second Joint
Knowledge Base Conference on Software Engeneering. Sozopol, Bulgaria, 1996.
[8] G. Osipov. Semantic Types of Natural Language Statements. A Method of Representation. //10th IEEE
International Symposium on Intelligent Control 1995, Monterey, California, USA, Aug. 1995.
[9] G. Osipov. Methods for Extracting Semantic Types of Natural Language Statements from Texts. //10th
IEEE International Symposium on Intelligent Control 1995, Monterey, California, USA, Aug. 1995.