* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download NLDB10-OntoGain - Intelligent Systems Laboratory
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Unsupervised Ontology Acquisition from
plain texts:
The OntoGain method
Efthymios Drymonas
Kalliopi Zervanou
Euripides G.M. Petrakis
Intelligent Systems Laboratory
http://www.intelligence.tuc.gr
Technical University of Crete (TUC), Chania, Greece
OntoGain
 A platform for unsupervised ontology
acquisition from text
 Application independent
 Ontology of multi-word term concepts
 Adjusts existing methods for taxonomy &
relation acquisition to handle multi-word
concepts
 Outputs ontology in OWL
 Good results on Medical, Computer science
corpora
2
Why multi-word term concepts?
 Majority of terminological expressions
 Convey classificatory information,
expressed as modifiers
 e.g. “carotid artery disease” denotes a type
of “artery disease” which is a type of
“disease”
 Leads to more expressive and compact
ontology lexicon
3
Ontology Learning Steps
 Concept Extraction
 C/NC-value
 Taxonomy Induction
 Clustering, Formal Concept Analysis
 Non-taxonomic Relations
 Association Rules, Probabilistic algorithm
4
The C/NC-Value method
[Frantzi et.al. , 2000]
 Identifies multi-word term phrases
denoting domain concepts
 Noun phrases are extracted first
((adj | noun)+ | ((adj | noun)
(adj | noun) *) noun
*
(noun prep)?)
 C-Value: Term validity criterion, relying
on the hypothesis that multi-word terms
tend to consist of other terms
 NC-Value: Uses context information
(valid terms tend to appear in specific
context and co-occur with other terms)
5
C-Value: Statistical Part
 For candidate term a
f(a): Total frequency of occurrence
f(b): Frequency of a as part of longer terms
P(Ta): number of these longer terms
|a|: The length of the candidate string
log 2 | a | f (a), a : not nested
1
C  value(a)  
log 2 | a | ( f (a) 
f (b)), otherwise
b
T
a
P(Ta )
Concept Extraction
C/NC-Value sample results
output term
c-nc value
web page
1740.11
information retrieval
1274.14
search engine
1103.99
machine learning
727.70
computer science
723.82
experimental result
655.125
text mining
645.57
natural language processing
582.83
world wide web
557.33
large number
530.67
artificial intelligence
515.73
relevant document
468.22
similarity measure
464.64
information extraction
443.29
knowledge discovery
435.79
7
Ontology Learning Steps
 Preprocessing
 Concept Extraction
 Taxonomy Induction
 Non-taxonomic Relations
8
Taxonomy Induction
Aims at organizing concepts into a
hierarchical structure where each
concept is related to its respective
broader and narrower terms
Two methods in OntoGain
Agglomerative clustering
Formal Concept Analysis (FCA)
Agglomerative Clustering
 Proceeds bottom-up: at each step, the
most similar clusters are merged
 Initially each term is considered a cluster
 Similarity between all pairs of clusters is
computed
 The most similar clusters are merged as
long as they share terms with common
heads
 Group average for clusters, Dice like
formula for terms
10
Formal Concept Analysis (FCA)
[Ganter et al., 1999]
FCA relies on the idea that the objects
(terms) are associated with their
attributes (verbs)
Finds common attributes (verbs)
between objects and forms object
clusters that share common attributes
Formal concepts are connected with the
sub-concept relationship
(O1, A1)  (O 2, A2)  O1  O 2( A1  A2)
FCA Example
Takes as input a matrix showing
associations between terms (concepts)
and attributes (verbs)
submit
Html form
test
describe
*
print
compute
*
Hierarchical
clustering
*
*
Text retrieval
Root node
Single cluster
Web page
search
*
*
*
*
*
*
*
*
*
*
*
FCA Taxonomy
 Formal concepts
 ({hierarchical
clustering, root node,
single cluster},
{compute, search})
 ({html form, web
page}, {print, search})
Not all dependencies
c,v are interesting
f ( c, v )
P (c | v ) 
t
f (v )
13
Non-Taxonomic Relations
extraction phase
 Concept Extraction
 Taxonomy Induction
 Non-Taxonomic Relations
14
Non-Taxonomic Relations
Concepts are also characterized by
attributes and relations to other
concepts in the hierarchy
Typically expressed by a verb relating
pair of concepts
Two approaches
Associations rules
Probabilistic
Association Rules [Aggrawal
et.al., 1993]
Introduced to predict the purchase
behavior of customers
Extract terms connected with some
relation subject-verb-object
Enhance with general terms from the
taxonomy
Eliminate redundant relations:
predictive accuracy < t
Association Rules: Example
Domain
Range
Label
chiasmal syndrome
pituitary disproportion
cause by
medial collateral ligament
surgical treatment
need
blood transfusion
antibiotic prophylaxis
result
lipid peroxidation
cardiopulmonary bypass
lead to
prostate specific antigen
prostatectomy
follow
chronic fatigue syndrome
cardiac function
yield
right ventricular infraction
radionuclide ventriculography
analyze by
creatinine clearance
arteriovenous hemofiltration
achieve
cardioplegic solution
superoxide dismutase
give
bacterial translocation
antibiotic prophylaxis
decrease
accurate diagnosis
clinical suspicion
depend
ultrasound examination
clinical suspicion
give
total body oxygen consumption
epidural analgesia
attenuate by
coronary arteriography
physician
perform by
17
Probabilistic approach [Cimiano
et.al. 2006]
 Collect verbal relations from the corpus
 Find the most general relation wrt verb
using frequency of occurrence
Suffer_from(man, head_ache)
Suffer_from(woman, stomach_ache)
Suffer_from(patient,ache)
 Select relationships satisfying a
conditional probability measure
Associations > t become accepted
18
Evaluation
 Relevance judgments are provided by
humans
 Precision - Recall
 We examined the 200 top-ranked
concepts and their respective relations
in 500 lines
 Results from OhsuMed & Computer
Science corpus
19
Results
Processing
Layer
Concept
Extraction
Taxonomic
Relations
NonTaxonomic
Relations
Method
Precision
–
OhsuMed
Recall
OhsuMed
Precision
–
Comp.
Science
Recall
–
Comp.
Science
C/NC-Value
89.7%
91.4%
86.7%
89.6%
Formal
Concept
Analysis
47.1%
41.6%
44.2%
48.6%
Hierarchical
Clustering
71.2%
67.3%
71.3%
62.7%
Association
Rules
71.8%
67.7%
72.8%
61.7%
Probabilistic
62.7%
55.9%
61.6%
49.4%
20
Comparison with Text2Onto
[Cimiano & Volker, 2005]
 Huge lists of plain single word terms,
and relations lacking of semantic
meaning
 Text2Onto cannot work with big texts
 Cannot export results in OWL
21
Conclusions
 OntoGain
 Multi-word term concepts
 Exports ontology in OWL
 Domain independent
 Results
 C/NC-Value yields good results
 Clustering outperforms FCA
 Association Rules perform better than
Verbal Expressions
22
Future Work
 Explore more methods / combinations
 e.g., clustering, FCA
 Hearst patterns for discovering additional
relation types (Part-of)
 Discover attributes and cardinality
constraints
 Incorporate term similarity information
from WordNet, MeSH
 Resolve term ambiguities
23
Thank you!
Questions ?
24
Preprocessing
Tokenization, POS tagging, Shallow
parsing (OpenNLP suite)
Lemmatization (WordNet Java Library
Apply to all steps of OntoGain
Shallow parsing is used in relations
acquisition for the detection of verbal
dependencies
 Terms sharing a head tend to be similar
 e.g. hierarchical method and agglomerative
method are both methods
 Nested terms are related to each other
 e.g. agglomerative clustering method and
clustering method should be associated)
26
					 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            