* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1. Linguistic processing
Dependency grammar wikipedia , lookup
Lojban grammar wikipedia , lookup
Untranslatability wikipedia , lookup
Spanish grammar wikipedia , lookup
Focus (linguistics) wikipedia , lookup
Latin syntax wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Polish grammar wikipedia , lookup
Word-sense disambiguation wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Malay grammar wikipedia , lookup
Russian grammar wikipedia , lookup
Pipil grammar wikipedia , lookup
Semantic memory wikipedia , lookup
Semantic holism wikipedia , lookup
Lexical semantics wikipedia , lookup
Linguistic Knowledge for Search Relevance Improvement Gennady OSIPOV, Ivan SMIRNOV, Ilya TIKHOMIROV, Olga VYBORNOVA and Olga ZAVJALOVA Institute for System Analysis, Russia Abstract. The paper presents methods and software implementation for semantically relevant search. We mainly focus on the usage of linguistic knowledge for improvement of semantic relevance in search engines. We state the effectiveness of information retrieval systems based on semantic search involving refined linguistic processing tools. Advantages of semantic search in comparison with traditional search engines are discussed. Keywords. Knowledge-based semantic search, linguistic semantics, natural language processing, meta-search. Introduction The main part of information now is presented as web-documents and Internet/Intranetdistributed databases. However, the process of rapid accumulation of information on the Internet was rather chaotic, therefore a paradox occurred: the volume of accumulated information increases much faster than the ability to search within that information. Fast and efficient search for necessary information is a common problem of modern society. When we use modern search engines, the first documents returned as the result of the query processing are often occupied by links to non-relevant (in the context of the query) documents, although their relevance is rated as 100 per cent by search engines. There are three reasons for that: search engines use algorithms based on various statistical scores of document texts, rather than on “understanding” the meaning of queries and the contents of informational resources; natural language query is usually ambiguous and it depends on the situation context; the user is not always able to formulate his/her query exactly, and search engines are of no help in this task. The problem of relevance is usually solved by means of keyword-based linear search with the help of statistical and some kinds of linguistic methods. Some systems proclaim the possibility of semantic search, of natural-language query input, of question-answering. However, they use inadequate linguistic and software tools. The output results of such systems are large arrays of documents, but only a lesser portion of documents in the array is relevant. Improvement of search precision is possible if the following tools are employed: tools for natural language processing; linguistic knowledge base; tools for applied semantic analysis, comparison and ranking. [1][2][3][4]. So the principal motivations we have in mind while developing an intelligent information search engine are: to extend standard keyword search engine mechanisms; to provide support for search queries in natural language; to improve search precision; to preserve the speed of standard search methods. 1. Linguistic processing Most of the existing search methods employ search by key words, they do not take into account the possibility of semantic search containing language processing tools and being an alternative type of search. Search by key words often does not satisfy the main requirement of the user, namely the requirement of semantic relevance of the found documents to the query, even in spite of the fact that all key words of the query are present in these found documents. However a document can be semantically relevant to the query even if it does not contain any key words of this query. For example, if we have a query “speech of the President of the Russian Federation”, then the relevant documents might contain the following variants: speech of the President of the Russian Federation, report of the President of the Russian Federation, speech of the leader of the Russian Federation, report of the President of the Russian Federation, speech of the President of Russia, report of the leader of Russia, etc. One more example. Query: "Visit of the president to Brussels". The same idea can be expressed differently, say "The president went to Brussels." To process the query we introduce a semantic relation DEST, which denotes the fact that Y is a destination of X. Then we search for DEST(president, Brussels). Of course natural language copes with the task of increase of the search engine work relevance far better than search by key words. That is why queries in the form of natural language discourse should be allowed. From this follows the necessity of semantic analysis of the query text itself and of the required documents. A disadvantage of the traditional search by key words is that using a simple set of key words it is impossible to convey the semantic direction of the query, impossible to “focus” it. In other words it is impossible to formulate the search need of the user. A query is not just a set of words. It is a sentence/phrase in which words are not just stuck together, but are connected with each other according to certain syntactic rules. When forming a discourse, syntax deals above all with meaningful units bearing not only their individual lexical meaning, but also generalized categorical meaning in constructions of various complexity. These units called syntaxemes (minimal indivisible semantic-syntactic structures of language [5] [6]. Syntaxemes are detected taking into account: a) categorical semantics of the word; b) morphological form; c) function in the sentence (according to the constructive capabilities within a sentence there are 3 functional types of syntaxemes: free/conventional/bound). Categorical semantics is a generalized meaning characterizing words that belong to the same categorical class (for instance, to the class of people, things, attributes for nouns). Let us consider some examples of different syntaxemes that are similar in form, but different in meaning: 1. 2. The mother brings her son to school. Laziness brings researchers to trouble. In (1) “bring to school” denotes a spatial noun with the directive meaning (direction of movement), it is a free syntaxeme, since its meaning does not depend on the position in a sentence. In (2) “bring to perfection” denotes an attributive noun meaning logical consequence. It is a conventional syntaxeme, since its meaning is realized only in a certain complicative model in the position of a semi-predicative complicator of the model. In a particular discourse, in a particular sentence of a query a word performs as a syntaxeme, i.e. has a certain syntactical function, in a certain grammatical form, it realizes only one of the possible meanings which this word can take in this sentence/phrase. The main task of the semantic analysis is to reveal semantic relations between syntaxemes. Semantic analysis essentially enhances the precision and recall of the search and decreases the number of irrelevant documents returned in the result of the search. Following this general overview let’s now consider the linguistic analysis process in our system in more detail. The main idea of semantic search is semantic processing of user query texts and of returned documents. Semantic processing involves generation of semantic images of documents and match of the obtained images. In the result the additional types of relevance are calculated that allow to remove the documents obviously not corresponding to the search query semantics. Semantic processing of the discourse is made for three stages: morphological, syntactical and semantic analysis itself. [7], [8], [9]. Each stage is fulfilled by a separate analyzer with its input and output data and its own settings. Interpretation of utterances implies the following milestones: 1. 2. Morphological and syntactical analysis of the sentence/phrase. Situational analysis of the sentence/phrase (building a subcategorization frame, correlation of the sentence/phrase situation with the situation described in the knowledge base of the system, selection of an adequate action). 3. Generation of the system answer. 1.1. Morphological analysis At the morphological analysis stage words and separators (in case available) are recognized in the discourse. Based on the word morphology the list of all possible grammatical forms for each word are defined. Word forms corresponding to the same normal dictionary form of the word and to the same part of speech and those in the same number (singular or plural) (for parts of speech that can change the number) will be unified into groups which will be further called lexemes (though they are not lexemes in the strict linguistic sense). Obviously, several such lexemes can correspond to the same word. In order to decrease the number of the resulting variants of the sentence, the morphological analyzer has a filter: it can be defined for every part of speech whether it will be taken into account in the further analysis. The settings allow to ignore interjections and particles by default if there are variants of this word belonging to other parts of speech. At the output we get a list of sentences each of which is a list of words. Each of the word in its turn is a list of lexemes. 1.2. Syntactical analysis The main task of the syntactical analysis is to establish syntactical dependencies between lexemes revealed at the previous stage. In particular, syntactical analysis is made for extraction of minimal semantic-syntactic structures (syntaxemes). The first two tasks can be solved within one sentence. Compound sentences are split into simple clauses which are further processed as separate sentences. A list of variants is composed for all sentences acquired at the output of morphological analysis so that in each sentence variant each word has only one lexeme. Since the number of sentence variants is equal to the product of the number of lexemes for each word, the task of limiting the number of variants becomes apparent. For this heuristics allowing to reject obviously incorrect variants are applied. Besides, a maximum allowable number of variants can be prescribed in the syntactical analyzer settings. After that the algorithm for subordinating syntactical relations discovery is applied for each variant. In the result lexemes are unified into dependency trees: the lexeme at the parent node governs all child node lexemes. A minimal semantic-syntactic structure (syntaxeme) is a tree the root of which is a noun or a preposition that governs the noun. It should be noted that proper names also belong to nouns. A noun phrase is any syntaxeme subtree into which the root noun is included. Besides searching for syntactical dependencies the syntactical analysis detects homogeneous parts. Thus at this stage two types of relations can be detected between lexemes: government and homogeneity. Every time when the program detects some syntactical relation, the weight of the sentence variant is increased. At the end of the sentence analysis only variants with maximal weight are kept. Sentences with zero weight are deleted by default (this option can be modified using the syntactical analyzer settings). Thus the syntactical analysis input is a sentence acquired at the output of the morphological analysis. The output is a sentence in the form of a list of variants, each of which is a list of dependency trees (list of syntaxemes). 1.3. Semantic analysis The main task of the semantic analysis or, to be more exact, the kind of semantic analysis described below (which can become deeper in case of properly composed knowledge base of the domain), is to reveal semantic meanings of syntaxemes and relations on the set of syntaxemes. In general a semantic relation is understood as relation of concepts in the conceptual system of the domain. Representatives of semantic relations in lexis are predicate words, i.e. lexemes representing predicates. The main place here is occupied by verbs that have, as a rule, the central position in the semantic structure of the sentence and that decisively influence noun phrases and sentences. Detection of relations is based on the concept of a subcategorisation frame that is widely used in structural linguistics. The information about syntactical compatibility of every verb is put to special tables of relations. The tables of relations for each verb indicate types of relations between its roles (arguments). Semantic search image consists of an ordered map of triples: <relation, arg1, arg2>, where <relation> stands for semantic relation type, and <args> are dependency trees for a corresponding NP or PP. In those cases when we speak about obligatory arguments (roles) of the verb, i.e. when the verb distributors are in the verb subcategorization frame, we can speak about bound syntaxemes. Tables of roles and semantic relations are compiled for such a type of syntaxemes. In case of optional roles where semantics and syntactical expression is not typical of the given verb we have free syntaxemes. A special heuristic algorithm is developed for them, and for those syntaxemes that have interrelations with each other avoiding verbs. It should be noted that participles and adverbial participles as special verb forms, as well as verbal abstract nouns, can act as predicate words. To do the semantic analysis, first of all it is necessary to extract predicate words (predicators). If a verb is a predicator in the sentence, it can be detected immediately at the morphological analysis stage. In other cases (when a predicator is a participle, verbal noun, etc.) additional rules are applied. When the predicate word and NPs related to it are detected, the valences in the predicator role structures should be filled up. The filling is made using special linguistic dictionaries where a certain set of roles is associated with each predicator. Rules of context and domain consideration are used to remove polysemy. Besides, the dictionary indicates how NPs are interrelated within role structures. A set of binary relations in the set of roles is also specific for each type of predicate words and is defined a priori. The total number of NPs, roles and binary relations is presented in the form of a semantic graph describing the situation in the neighborhood of one predicator. This graph is a fragment of a semantic network describing the semantic image of the whole text. Every time when a syntaxeme fills up the predicator role or when two roles correspond to a semantic relation, the program increases the weight of the sentence variant. Hence in case of simultaneous syntactical and semantic analysis the “heaviest” variants from the point of view of both syntactical and semantic relations are kept. That is why simultaneous analysis is not equal to sequential analysis: in the latter case variants with the greatest number of syntactical dependencies are first selected and then those ones among them are chosen in which the role structure of predicators is filled up in the best way and more semantic relations are found. If a verb is polysemantic (i.e. there are several entries for one word in the dictionary), then all variants one by one are considered. Those variant(s) are finally selected on the further stages of analysis, where syntaxemes meanings for a greater number of NPs of the fragment are found, and where categorical semantics attributes worked most frequently. If there are a few equivalent variants again, then the variant with the maximum ratio of the number of syntaxemes found in the sentence to the total number of syntaxemes described in the given dictionary entry should be chosen (i.e. the variant with the best (complete) verb role structure filling). Participial phrases and adverbial participial phrases are processed after the corresponding main clauses. The subject of the main clause becomes the subject of an adverbial participial phrase. Candidates for the subject/object of a participial phrase are the nearest NPs, roots of which agree with the participle in the gender, number and case. Syntaxemes are also searched for interrogative words like who, what, where, why, when, how, how much, what for, at what time. A special attribute is given to the found meanings (roles in the pairs <role, NP> representing semantics of the query and the document) and then to relations in triples <relation, NP1, NP2>. When comparing the query with the document (during relevance calculation), this attribute will allow to consider NP, NP1, NP2 of the query that coincide with any NPs of the document corresponding to the given interrogative words in categorical semantics. At present the system works with Russian language, and its English version is under development. One of the further directions of work is to expand the system for handling queries in other languages. Semantic roles and relations are universal, so the main concepts of the system can be preserved. 1.4. Linguistic knowledge base The linguistic knowledge base in our system consists of two components: thesaurus containing a set of words and NPs associated with each other, like synonymy, partonymy, semantic likelihood, etc. predicate dictionary. This dictionary describes the ways of syntactic context realization of different types of semantic relations between concepts. The list of predicators, say, for Russian language contains verbs and various verb derivative forms. The predicate dictionary also has the table of relations between the set of syntaxemes and the set of predicator roles. The thesaurus is used for query pre-processing and the predicate dictionary is used for semantic analysis. Let’s consider the structure of the linguistic knowledge base on an example. Verb= love Role=subject – syntaxeme meaning Facultative=No Soul=Undefined S=+case – syntaxeme expression form Role=object Facultative=No Soul=Undefined S=+ case – syntaxeme expression form Role=causative Facultative=No Soul=Undefined S= case – syntaxeme expression form Relation=subject object PTN – semantic relation Relation=subject causative CAUS 2. Meta-search engine software architecture The described methods are implemented in the meta-search engine. The component model architecture of the engine is: Figure 1. Meta-search engine component model diagram. Meta-search engine consists of the following components: 1. 2. 3. 4. 5. 6. 7. Multi-agent environment provides distributed search queries processing by redirecting functionally independent tasks to agent modules located on different computers of the local network [7]. Database stores indexed documents and multi-engine environment support data. Text processing module makes part-of-speech tagging and semantic analysis. Agent modules make possible all search tasks solving (search query processing, documents downloading and indexing, relevance calculation). Thesaurus for search query preprocessing. Predicate dictionary stores linguistics knowledge. Search engine modules support web-interface, users management and so on. 3. Semantic filtering Semantic relevance improvements can be achieved using additional semantics filtering [2], [3], [4]. The dataflow diagram gives explanation of the main idea of additional document processing using linguistics analysis. Figure 2. Semantic search process dataflow diagram. The typical scenario of user search query processing includes the following steps: 1. 2. 3. 4. 5. 6. Linguistic analysis of the query text generating its semantic search image. Extraction of keywords and creation of an alternative search image for external search engines. Redirection of keywords search image to currently added-on external search engines. Getting and merging results lists from each search engine. For each document in the joint results list: 5.1. Extraction of excerpts of its text with matching keywords. 5.2. Linguistic analysis of the excerpts and generation of a semantic search image. 5.3. Match of the semantic search image of the document against that of the query (Calculating semantic relevance). Rearrangement of the results list according to semantic ranking, removing false drops, i.e. documents ranked below some pre-specified threshold. The overall process can thus be generally referred to as "semantic filtering" of the results list. 4. Conclusion and future work We presented methods for knowledge-based semantic search elaborated within an intelligent system for semantic search and analysis of information on heterogeneous informational resources and services (Internet/Intranet networks, local/distributed databases), along with solution of the problem of semantically relevant search being the central concept of the system. We are developing new algorithms for recognition and analysis of various models of sentences, parametric algorithms of relevance calculation, algorithms for machine learning adapted for work with text information and enrichment of the system predicate dictionary. One more subject of thorough further research is methods of implicit relations revelation as well as anaphora resolution in case of multiple variants requiring semantic filtering of syntaxemes. It will be necessary for selection of the best variant of resulting interpretation using preference criteria (individual criteria for every type of phenomena is needed). The preference criteria that have weight coefficients my include: categorical semantics and morphological forms of candidates, results of syntaxemes search, word order and other factors. At present we are also developing the English version of the system and considering the possibility of multilingual search and translation within the system. The Russian prototype of the semantic meta-search engine is available at www.exactus.ru. References [1] A. Klimovski, G. Osipov, I. Smirnov, I. Tikhomirov, O. Vybornova, O. Zavjalova. Usability evaluation methods for search engines. // to appear in Proceedings of international conference IAI’2006, Kiev, 2006. (in Russian) [2] A. Klimovski, I. Kuznetsov, G. Osipov, I. Smirnov, I. Tikhomirov, O. Zavjalova. Problems of providing search precision and recall: Solutions in the intelligent meta-search engine Sirius // Proceedings of international conference Dialog’2005, Moscow, Nauka, 2005. (in Russian) [3] G. Osipov, I. Smirnov, I. Tikhomirov, O. Zavjalova. Intelligent semantic search employing meta-search tools. // Proceedings of international conference IAI’2005, Kiev, 2005. (in Russian) [4] I. Kuznetsov, G. Osipov, I. Tikhomirov, I. Smirnov, O. Zavjalova. Sirius - engine for intelligent semantic search in local and global area networks and databases. // 9th National conference on artificial intelligence CAI-2004, b.3. Moscow: Fizmalit, 2004. (in Russian). [5] O. Zavjalova. About principles of verb dictionary generation for the tasks of automatic text processing. // Proceedings of international conference Dialog’2004, Moscow, Nauka, 2004. (in Russian) [6] G. Zolotova, N. Onipenko, M. Sidorova. Communicative grammar of Russian language. Мoscow, 2004. (in Russian) [7] G. Osipov. Interactive Synthesis of Knowledge-Based Configuration. //Proc. of 5 Second Joint Knowledge Base Conference on Software Engeneering. Sozopol, Bulgaria, 1996. [8] G. Osipov. Semantic Types of Natural Language Statements. A Method of Representation. //10th IEEE International Symposium on Intelligent Control 1995, Monterey, California, USA, Aug. 1995. [9] G. Osipov. Methods for Extracting Semantic Types of Natural Language Statements from Texts. //10th IEEE International Symposium on Intelligent Control 1995, Monterey, California, USA, Aug. 1995.