Download Adopting Ontologies for Multisource Identity Resolution

Adopting Ontologies for Multisource Identity Resolution Milena Yankova, Horacio Saggion Hamish Cunningham Department of Computer Science, The University of Sheffield Overview • • • • • Introduction Knowledge representation Usage of ontologies in identity resolution Case-study & Evaluation Conclusion and Further Work Introduction • Identity resolution aims at identifying the newly presented facts and linking them to their previous mentions Our main • hypothesis is that – variations of one and the same fact can be recognised, – duplications removed and – their aggregation actually increases the correctness of fact extraction. • We use an ontology for internal and resulting knowledge representational formalism. • It not only contains the representation of the domain, but also known entities and properties. Knowledge Representation via Ontologies • Ontologies have been chosen because of its detailed entity description that is complemented with semantic information. • The expected benefit from using semantic representation the ability to recognise not only the type/class of the objects, but also the individual instances they refers to. – For example, different appearances of “M&S" on different sources (e.g. web pages) are extracted and collected as a single instance which all mentions point to. • The semantic linkup of the identified objects guaranties more detailed description as opposed to a simple syntactic representation. • In this way it provides more details, which serving as evidence can improve the accuracy of object comparison. Source of information • In this application we have two sources of information ( company profiles): – A database of manually collected company details – Profiles extracted from web pages Mapping Databases to Ontologies • The database schema is the data description that holds the meaning of the data • binging databases to other knowledge representational formalism e.g. ontologies requires deep understanding and domain expertise • It is usually done manually producing mapping between the particular database schema and given ontology • We use company profiles stored in a MySql Relational Database Management System which has been manually mapped to the Musing ontology using scripts Information Extraction Ontology-based Information Extraction • Ontology-based information extraction which aims at identifying in text concepts and instances from an underlying domain model specified in an ontology. • The extraction prototype uses some default linguistic processors from GATE • Custom application rules for concept identification are specified in regular grammars implemented in the JAPE language. Ontologies in IDRF • Our approach to the identity problem has been implemented as Identity Resolution Framework (IDRF) • It uses an ontology for internal and resulting knowledge representational formalism • It is based on the PROTON ontology, which can be extended, e.g. for our particular domain of company profiling Identity Class Models • Execution of the IdRF is based on what we call Class Models - that handle the differences of the entity types represented as ontology classes. • Each class model is expressed by a single formula based on first order probabilistic logic • Each formula is manually composed by combining predicates by the usual logical connectives like \&", \j", \not" and \)". • Class models are used in two stages of the framework pipeline: – during the retrieval of potential matching candidates from the ontology - applying a strict criteria; – During actual comparison of entities potential matching pairs using a soft criteria. • They are also evaluated differently depending on which component use them. Example of Class Model definition Pre-filtering • It restrict the whole amount of ontology instances to a reasonable number, to which the source entity will be compared. • In this case the engine does not formally evaluate the class model/formula but composes a SeRQL or SQL query. • The query embodies the model strong equivalency criteria Example for Pre-filtering Query • “MARKS & SPENCER“ query according to the class model for "musing:Company" Evidence Collection (1) • This component calculates the similarity between two objects based on their class model, • It is expressed by a probabilistic logic formula resulting in a real number from 0 to 1. – “0” means that the given entities are totally different – “1” means that they are absolutely equivalent. – any value between 0 and 1 the probability these entities to be equivalent Evidence Collection (2) • The value fro each of the predicates in the formula is calculated according to the algorithm it present – Predicate values are combined according to the logical connectives in the formula – In this setting the usual logical connectives are expressed as arithmetic expressions, e.g. aVb = a+b-ab Data Integration • It is this third stage of identification process • It encodes the strength of the presented evidence for choosing the candidate favored by the Class Model. • The successful candidate must pass a threshold which balances the precision and recall of the application. Decision Threshold • A pre-set threshold determine whether to registers the matches as successful. • We have used ROC curve analysis to sent the threshold of 0.4. which gives the best performance in our application Case-study • Our case-study is focused on company profiling. • We have automatically extracted hundreds of company profiles from different web sites, e.g. http://uk.finance.yahoo.com • Our database is populated with about 1,8M manually collected company profiles provided by http://www.marketlocation.com • The evaluation has targeted a set of 310 extracted UK companies compared to the database Evaluation of the IDRF • The accuracy of identity resolution is very promising (89% F-measure) • Anther experiment on automatically extracted vacancies shows similar results Evaluation of the IE • The Recall of automatically extracted company attributes is improved from 92% to 97% after integration • The Precision rise slightly from 70% to 73% Conclusion and future work • IRDF is a general framework for identity resolution which is based on ontologies • adapted to ontology-based information extraction applications. • future work - how uniqueness of the details and their number influence the process of identification • Thank you Adam! • Please don’t hesitate to send your questions to [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Adopting Ontologies for Multisource Identity Resolution