Download Adopting Ontologies for Multisource Identity Resolution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Adopting Ontologies for
Multisource Identity Resolution
Milena Yankova, Horacio Saggion
Hamish Cunningham
Department of Computer Science,
The University of Sheffield
Overview
•
•
•
•
•
Introduction
Knowledge representation
Usage of ontologies in identity resolution
Case-study & Evaluation
Conclusion and Further Work
Introduction
• Identity resolution aims at identifying the newly
presented facts and linking them to their previous
mentions Our main
• hypothesis is that
– variations of one and the same fact can be recognised,
– duplications removed and
– their aggregation actually increases the correctness of fact
extraction.
• We use an ontology for internal and resulting
knowledge representational formalism.
• It not only contains the representation of the domain,
but also known entities and properties.
Knowledge Representation via
Ontologies
• Ontologies have been chosen because of its detailed entity
description that is complemented with semantic information.
• The expected benefit from using semantic representation the ability
to recognise not only the type/class of the objects, but also the
individual instances they refers to.
– For example, different appearances of “M&S" on different sources
(e.g. web pages) are extracted and collected as a single instance which
all mentions point to.
• The semantic linkup of the identified objects guaranties more
detailed description as opposed to a simple syntactic
representation.
• In this way it provides more details, which serving as evidence can
improve the accuracy of object comparison.
Source of information
• In this application we have two sources of
information ( company profiles):
– A database of manually collected company details
– Profiles extracted from web pages
Mapping Databases to Ontologies
• The database schema is the data description that holds
the meaning of the data
• binging databases to other knowledge representational
formalism e.g. ontologies requires deep understanding
and domain expertise
• It is usually done manually producing mapping
between the particular database schema and given
ontology
• We use company profiles stored in a MySql Relational
Database Management System which has been
manually mapped to the Musing ontology using scripts
Information Extraction
Ontology-based Information Extraction
• Ontology-based information extraction which
aims at identifying in text concepts and
instances from an underlying domain model
specified in an ontology.
• The extraction prototype uses some default
linguistic processors from GATE
• Custom application rules for concept
identification are specified in regular
grammars implemented in the JAPE language.
Ontologies in IDRF
• Our approach to the identity problem has
been implemented as Identity Resolution
Framework (IDRF)
• It uses an ontology for internal and resulting
knowledge representational formalism
• It is based on the PROTON ontology, which can
be extended, e.g. for our particular domain of
company profiling
Identity Class Models
• Execution of the IdRF is based on what we call Class Models - that
handle the differences of the entity types represented as ontology
classes.
• Each class model is expressed by a single formula based on first
order probabilistic logic
• Each formula is manually composed by combining predicates by the
usual logical connectives like \&", \j", \not" and \)".
• Class models are used in two stages of the framework pipeline:
– during the retrieval of potential matching candidates from the
ontology - applying a strict criteria;
– During actual comparison of entities potential matching pairs using a
soft criteria.
• They are also evaluated differently depending on which component
use them.
Example of Class Model definition
Pre-filtering
• It restrict the whole amount of ontology
instances to a reasonable number, to which
the source entity will be compared.
• In this case the engine does not formally
evaluate the class model/formula but
composes a SeRQL or SQL query.
• The query embodies the model strong
equivalency criteria
Example for Pre-filtering Query
• “MARKS & SPENCER“
query according to the
class model for
"musing:Company"
Evidence Collection (1)
• This component calculates the similarity between
two objects based on their class model,
• It is expressed by a probabilistic logic formula
resulting in a real number from 0 to 1.
– “0” means that the given entities are totally different
– “1” means that they are absolutely equivalent.
– any value between 0 and 1 the probability these
entities to be equivalent
Evidence Collection (2)
• The value fro each of the predicates in the
formula is calculated according to the
algorithm it present
– Predicate values are combined according to the
logical connectives in the formula
– In this setting the usual logical connectives are
expressed as arithmetic expressions, e.g. aVb =
a+b-ab
Data Integration
• It is this third stage of identification process
• It encodes the strength of the presented
evidence for choosing the candidate favored
by the Class Model.
• The successful candidate must pass a
threshold which balances the precision and
recall of the application.
Decision Threshold
• A pre-set threshold determine whether to
registers the matches as successful.
• We have used ROC curve analysis to sent the
threshold of 0.4. which gives the best
performance in our application
Case-study
• Our case-study is focused on company profiling.
• We have automatically extracted hundreds of
company profiles from different web sites, e.g.
http://uk.finance.yahoo.com
• Our database is populated with about 1,8M
manually collected company profiles provided by
http://www.marketlocation.com
• The evaluation has targeted a set of 310
extracted UK companies compared to the
database
Evaluation of the IDRF
• The accuracy of identity resolution is very
promising (89% F-measure)
• Anther experiment on automatically extracted
vacancies shows similar results
Evaluation of the IE
• The Recall of automatically extracted company
attributes is improved from 92% to 97% after
integration
• The Precision rise slightly from 70% to 73%
Conclusion and future work
• IRDF is a general framework for identity
resolution which is based on ontologies
• adapted to ontology-based information
extraction applications.
• future work - how uniqueness of the details
and their number influence the process of
identification
• Thank you Adam!
• Please don’t hesitate to send your questions
to [email protected]