Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ontology Learning from Text Methods & Tools Definition of Ontology ‘A formal, explicit specification of a shared conceptualization’ must be machine understandable not private to some individual, but accepted by a group types of concepts and constraints must be clearly defined an abstract model of some phenomenon in the world formed by identifying the relevant concepts of that phenomenon Main elements of an ontology wasWrittenBy domain range Object property (relation) hasTitle xsd:string domain Hierarchy of concepts (is-a relations) datatype property (attribute) range Definition of Ontology Learning The application of a set of methods and techniques used for building an ontology from scratch Uses distributed and heterogeneous knowledge and information sources Allows a reduction in the time and effort needed in the ontology development process Ontology Learning methods from… Unstructured sources • Semi-structured source • Involves NLP techniques, morphological and syntactic analysis, etc. elicit an ontology from sources that have some predefined structure, such as XML Schema Structured data • Extracting concepts and relations from knowledge contained in structured data, such as databases Ontology Learning ‘Layer Cake’ x, y (sufferFrom(x, y) ill(x)) Axioms & Rules Relations cure (domain:Doctor, range:Disease) is_a (Doctor, Person) Disease:=<I, E, L> Taxonomy (Concept hierarchies) Concepts {disease, illness} Synonyms disease, illness, hospital Terms Part 1 Terms Extraction Axioms & Rules Relations Taxonomy (Concept hierarchies) Concepts Synonyms disease, illness, hospital Terms Terms Linguistic realizations of domain-specific concepts Are the basis of the ontology learning process Term extraction implies: • Linguistic processing part-of-speech tagging, morphological analysis, etc. • Statistical processing compares the distribution of terms between corpora Terms Extraction: Process Run a Part-Of-Speech (POS) tagger over the domain corpus Identify possible terms by constructing patterns, such as: Adj-Noun, Noun-noun, Adj-Noun-Noun,… Ignore Names Identify only the relevant to the text terms by applying statistical metrics Linguistic Analysis: an example Discourse Analysis [[It SUBJ:X1] [was PRED] still available…] Dependency Structure [[He SUBJ] [booked PRED] [[this] [table HEAD] NP:DOBJ]S] (S) Dependency Structure [[the SPEC] [large MOD] [table HEAD] NP] (Phrases) [[He SUBJ] [booked PRED] [[this] [table HEAD]NP:DOBJ:X1]…]… [[the] [large] [table] NP] [[in] [the] [corner] PP] [work~ing V] [table N:ARTIFACT] [table N:furniture] [table] [2005-06-01] [John Smith] Phrase Recognition Morphological Analysis (stemming) Part of Speech & Semantic Tagging Tokenization (incl. Named-Entity Rec.) Statistical Analysis Statistical metrics used in terms extraction: Term weighting (TFIDF) N tfidf ( w) tf log( ) df ( w) Chi-square (obs exp) exp Mutual Information 2 P( x, y ) mi ( x, y ) P( x) P( y ) TFIDF Most popular weighting schema N tfidf ( w) tf ( w) log( ) df ( w) The word is more popular when it appears several times in a document tf(w) The word is more important if it appears in less documents term frequency (number of words occurrences in a document) df(w) document frequency (number of documents containing the word N number of all documents tfidf(w) relative importance of the word in the document Part 2 Synonyms Axioms & Rules Relations Taxonomy (Concept hierarchies) Concepts {disease, illness} Synonyms Terms Synonyms Identification of terms that share semantics, i.e., potentially refer to the same concept Methods for extracting synonyms • Based on WordNet • Latent Semantic Indexing (LSI) WordNet A lexical database for the English language Nouns, verbs, adjectives & adverbs are grouped into sets of synonyms (synsets) Synsets are interlinked by means of conceptual-semantic and lexical relations Adapting WordNet to specific domain Partition the set of synonymy relations defined in WordNet in three classes: • • • Relations irrelevant in the specific domain Relations that are relevant but incorrect in the specific domain Relations that are relevant and correct in the specific domain Remove relations from the first two classes and include relations from the third class Rank the rest sets according to their frequency in corpus Latent Semantic Indexing (LSI) LSI is a technique in NLP of analyzing relationships between a set of documents and the terms they contain Uses a term-document matrix which describes the occurrences of terms in documents – Vector Space Model Example: doc1 database X computer X access doc2 X X Part 3 Concepts Axioms & Rules Relations Taxonomy (Concept hierarchies) Disease:=<I, E, L> Concepts Synonyms Terms Concepts Intension, Extension, Lexicon A term may be indicate a concept if we can define its: Intension: (in)formal definition of the set of objects that this concept describes Example: a disease is an impairment of health or a condition of abnormal functioning Extension: a set of objects that the definition of this concept describes Example: influenza, cancer, heart disease Lexical realizations: the term itself and its multilingual synonyms Example: disease, illness, maladie Part 4 Taxonomy Induction Axioms & Rules Relations is_a (Doctor, Person) Taxonomy (Concept hierarchies) Concepts Synonyms Terms Concept Hierarchy Extraction Basic methods used for taxonomy extraction: With the use of WordNet Lexico-syntactic patterns Machine Readable Dictionaries Co-occurrence Analysis Linguistic-approaches Taxonomy Extraction with WordNet Given two terms t1 and t2, check if they stand in a hypernym relation with regard to WordNet Normalize the number of hypernym paths by dividing by the number of senses of t1 isa(t1, t 2) min( | paths( senses(t1), senses(t 2)) | ,1) | senses(t1) | path: a sequence of edges connecting the two synsets Example: - 4 different hypernym paths between synsets ‘country’ and ‘region’ - ‘country’ has 5 senses value of isa (country, region) = 0.8 Lexico-syntactic patterns - Hearst Aim: the acquisition of hyponym lexical relations from text Uses a set of predefined lexico-syntactic patterns which • • • occur frequently and in many text genres indicate the relation of interest can be recognized with little or no pre-encoded knowledge Principle idea: match these patterns in texts to retrieve is_a relations Precision with respect to WordNet: 55,45% Lexico-syntactic patterns - Hearst NPo such as {NP1, NP2,…, (and | or)} NPn vehicle ‘Vehicles such as cars, trucks and bikes….’ is-a is-a is-a truck car bike such NP as {NP,} * { (or | and) } NP fruit ‘Such fruits as oranges, nectarines or apples…’ is-a is-a is-a orange apple nectarine NP {, NP} * { , } { or | and } other NP ‘Swimming, running, or/and other activities…’ activity is-a running is-a swimming Lexico-syntactic patterns - Hearst NP { , } including {NP, } * { or | and } NP ‘Injuries, including broken bones, wounds and bruises…’ injury is-a is-a is-a broken bone bruise wound NP { , } especially {NP, } * { or | and } NP ‘Publications, especially papers and books…’ publication is-a paper is-a book Machine Readable Dictionaries A method for extracting taxonomies which goes back to the 80’s Main idea: exploit the regularity of dictionary entries to find a suitable hypernym for the defined word Example: spring “the season between winter and summer and in which leaves and flowers appear” is_a (spring, season) MRDs: Exceptions The hypernym can be preceded by an expression such as ‘a kind of’, ‘a sort of’, or ‘a type of’ The problem is solved by keeping an exception list with words such as ‘kind’, ‘sort’, ‘type‘ and taking the head of the NP following the preposition ‘of’ Example: hornbeam: “a type of tree with a hard wood, sometimes used in hedges” is_a (hornbeam, tree) The word can be defined in terms of a part-of or membership relation Example: republican : “a member of a political party advocating republicanism” is_a (republican, political party) part_of (republican, political party) Co-occurrence analysis A certain term t1 is more special that a term t2, if t2 also appears in all the documents in which t1 appears. Document-based subsumption Term x subsumes term y iff P(x | y) 1, where P( x | y ) n ( x, y ) n( y ) n(x,y) the number of documents in which x and y co-occur n(y) the number of documents that contain y Linguistic Approaches Modifiers typically restrict or narrow down the meaning of the modified noun Example: is_a (international credit card, credit card) Part 5 Relations (non-taxonomic) Axioms & Rules cure (domain:Doctor, range:Disease) Relations Taxonomy (Concept hierarchies) Concepts Synonyms Terms Extracting relations & attributes Specific relations • Part-of • Qualia (Formal, Constitutive, Telic, Agentive) General relations • Exploiting linguistic structure Attributes Learning attributes: Introduction Attributes relations with a datatype as range Typically expressed in texts using preposition of, the verb have or genitive constructs, e.g. ‘the color of the car’, ‘the car’s color’, ‘every car has a color’ Values of attributes are expressed using copula constructs, adjectives or expressions specific to the attribute in question, e.g., • • • ‘the car is red’ (copula + value) ‘the red car’ (adjective) ‘the baby weights 3 kgr’ (specific expressions) Classification of attributes To systematize the learning process attributes are classified according to their range RelExt A tool for Relation Extraction identifies relevant triples (pairs of concepts connected by a relation) over concepts from an existing ontology is based on the fact that verbs express a relation between two classes that specify the domain and range extracts relevant verbs & their grammatical arguments and computes corresponding relations through a statistical & linguistic processing was developed in the context of SmartWeb project to provide intelligent information services in the FIFA World Cup 2006 RelExt: Linguistic processing ● Linguistic annotation the SCHUG system was used provides a multi-layer XML format for a given text dependency structure, lemmatization, POS Corpus Linguistic annotation NER & Concept Tagging ● NER (Name Entity Recognition) performed to map instances of football players to existing ontology classes ●Concept tagging maps synonyms for given terms to the corresponding ontology concepts Annotated corpus RelExt: Statistical Processing Relevance Measure • χ2 test used to compute relevance ranking Coocurence measure Relation Extraction Relevance Measure Frequencies In BNC, NZZ Relevance Scores Heads, Preds Cooccurence measure Cooccurence Scores Heads <> Preds Part 6 Axioms & Rules x, y (sufferFrom(x, y) ill(x) Axioms & Rules Relations Taxonomy (Concept hierarchies) Concepts Synonyms Terms DIRT Discovery of Inference Rules from Text an unsupervised method for discovering inference rules from text, such as X is author of Y X wrote Y, X caused Y Y is blamed on X X manufactures Y X’s Y factory Is based on the assumption that: Distributional Hypothesis Words that occurred in the same contexts tend to be similar DIRT: Distributional Hypothesis Distributional Hypothesis is applied to dependency tress If two paths tend to link the same sets of words, their meanings are hypothesized to be similar DIRT: Dependency trees “John found a solution to the problem” found subj obj Links represent dependency relationships Direction: from the head to the modifier John solution Labels represent types of dependency relations det mod Each link between two words represents a direct semantic relationship a to pcomp Path between “John” and “problem” problem N:subj:V find V:obj:N solution N:to:N meaning “X finds solution to Y” det the DIRT: Paths in Dependency Trees Connect the prepositional complement directly to the words modified by the preposition transformation rule Each link between two words represent a direct semantic relationship A path represents indirect semantic relationships between two content words Ontology Learning Tools Text2Onto • • http://ontoware.org/projects/text2onto OntoLT • • Open source (Java) Open source (Protégé plug-in, Java) http://olp.dfki.de/OntoLT/OntoLT.htm OntoGen • • Open source (C++, .NET) http://www.textmining.net Conclusions A detailed methodology that guides the ontology learning process does not exist Only general guidelines are provided No complete correspondence between the methods and the tools Methods are based mainly on NLP techniques complemented with statistical measures Tools give only support to perform some of the steps proposed in different approaches (except Text2Onto) Some References… Cimiano, P. Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer, 2006 Hearst, M.A., Automatic Acquisition of Hyponyms from Large Text Corpora. In: Proceedings of the 14th International Conference on Computational Linguistics, pp. 539-545, 1992 Gómez-Pérez, A., & Manzano-Macho, D., An overview of methods and tools for ontology learning from text, The Knowledge Engineering Review, Vol. 19:3, 187-212, 2005. P. Cimiano, J. Wenderoth, Automatically Learning Qualia Structures from the Web. In: Proceedings of the ACL Workshop on Deep Lexical Acquisition, pp. 28-37, 2005