* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Compile-time meta
Survey
Document related concepts
Transcript
ProSWIP: Property-based Data Access for Semantic Web Interactive Programming Silviu homoceanu, Philipp Wille and Wolf-Tilo Balke Institue for Information Systems, Braunschweig University of Technology, Germany (布伦瑞克理工大学) System Descriptio n Evaluation Introduction Conclusion Content Introduction 3 Goal of Programming with the WOD: •Easily access the entities that correspond to a concept the developer thinks of. Traditional Way: •ORM(Object Relational Mapping): RDF/OWL -> programming structures. e.g Jena & RDFReactor. Unintelligible, hard to customize, maintain. •Compile-time meta-programming: No global ontology to drive the cleasing. Auto ontology alignment offering just average quality.Requires manual work. Property-based Data Access Model: •A type is defined by a set of required properties •Every entity with at least those properties is part of that type. •Concepts are bound to properties that are required for the program logic. •e.g Movie = {Title, Genre, Director} Introduction System Description Evaluation Conclusion ProSWIP: •Experiments shows: •Quality of property-based data access -> Chosen properties. •Task : property-based access + maintaing the quality under control Required Properties Estimate quality of selected data identifies additional properties that have positive impact on quality combined with user feedback. ProSWIP selected entities Developers •An extensive inspection of the feasibility of the property-based paradigm for accessing data from LOD cloud. •The presentation & evaluation of a quality metric transparent for the paradigm. •The presentation and evaluation of a property selection method for better data quality. Introduction System Description Evaluation Conclusion System Description 6 • Use Case: – develop an application related to movies. – Developers rely on variables that represent movie properties. – These properties are used as filters to select entities in the LOD cloud. – Data sources: BTC 2012 dataset. – Use precision & recall to evaluate the quality of retrieved information. – for type Movie: • precision: entities representing movies / all selected • recall: selected movies / all movies in the dataset/ – Our approach focuses on improving the precision. Introduction 7/8/2017 System Description Evaluation Conclusion ProSWIP: Estimate quality of selected data recommand suitable properties automatically that can improve the quality of the selected data ProSWIP user feedback -> user decides whether to add the property to the type get the updated property-based type and by it select entities until the quality of selected entities are above a threshhold. ? How to : •Identify and select those entities that fulfill the property-based type definition. •Compute the quality of a collection of entities. •Find properties that,if added to the set of properties defining the type, significantly improve the quality. Introduction 7/8/2017 System Description Evaluation Conclusion Identify and select entities of property-based type: • c is a concept, which extendsionally defined through the set of entities, Ec. Tc is the type of concept as the set of properties T P P . PC+ is the set of positive properties and PC- is the set of negative properties. c • c c PC PC Introduction 7/8/2017 System Description Evaluation Conclusion Identify and select entities of property-based type: p is the natural language label of the property. Define the function Dictionary(p) that returns a property uri set for each property p. for the property p in the set: p extend by WordNet the synonym property uri set Psyn including p Each synnonym of p by wordnet, we call it Syn, if p' , ( p' , rdfs : label , Syn) LOD then p' is in the Psyn , for each property URI pi in Psyn: pi extend by sameAs uris set PSApi of all sameas property uris including pi | Psyn | Dictionary ( p) : PSApi i An iterative process to get the field of the transitive closure of the bi-relation owl:sameAs. Introduction 7/8/2017 System Description Evaluation Conclusion Identify and select entities of property-based type: How to select entities from Ec? hit function: e is some entity represented by its URI. p is the natural language label of property. hit(e, p)=1 iff p' Dictionary ( p) : (e, p' ,*) LOD else hit(e,p)=0 for one T P P , and e , for each p in PC+, if e is in Ec, hit(e,p) = 1 for each p in PC+, if e is not in Ec, hit(e,p) = 0 for each p in Pc-, if e is in Ec, hit(e,p) = 0 c c Introduction 7/8/2017 c System Description Evaluation Conclusion Quality of the Selected Entities: Entities from the same datasource and of the same type share similar properties. We reduce all entities have exact same properties to just one witness. Ec Wc reduce Quality of selected entities: The average value of similarity value between pair of any two witnesses in Wc. Similarity measure of two witness: |P P | Jaccard Similarity Index sim ( w , w ) Wi i j Wj | PWi PW j | PWi is the set of properties of Wi.PWj is the set of properties of Wj. Introduction 7/8/2017 System Description Evaluation Conclusion Property Selection: •Finding properties best distinguishing different types similar the problem of induction of an optimal decision tree in data classification. •Information gain is the standard measure for deciding the relevance of a property, to compute which needs a uniform type label for each entity. •The type information: • strongly correlated with the entity properties •is latent in the properties. •The problem of reducing similar witnesses to a dominant type the problem of dimension reduction. A basis transformation that seeks to Principal Component Analysis reduce the dimensionality of the data (PCA) of witnesses; p: numberbyoffinding a few orthogonal linear comX:n×p matrix. n:number properties of all witnesses binations(called principal components) of X UDV T the original variables capturing the largest components. variance. Y=UD are the principal Usally the first PCs(capturing the highest data variance) are chosen to Systemthe dominantEvaluation represent dimensions. Introduction Conclusion 7/8/2017 Description Property Selection: With each PC that show variances above the threshold, one-dimentional Clustering on the coefficients (agglomerative hierarchical clustering with average inter-cluster similarity) PC1 ...... ...... one-dimentional Clustering on the coefficients (agglomerative hierarchical clustering with average inter-cluster similarity) PCt Property Cluster Depends on ISODATA algorithm Property Cluster group together Latent Type(have multiple properties) Property Cluster1 Property Cluster2 ...... Property Clustern Label each entity in Ec with those latent types according to the property-based model(Used to compute Information Gain) Introduction 7/8/2017 System Description Evaluation Conclusion Property Selection: Based on the labeled entities, we can compute the information gain of a property. The property that has the maximum information gain is selected to recommand to user to decide whether to add to the type to further select the entities. Introduction 7/8/2017 System Description Evaluation Conclusion One Iteration of the interactive process: examples Type Movie: Ec: Pc+={actor, director} Auto Property Selection Component Entity e that hit(e,genre)=1 will be selected ...... Updated Type Movie: Pc+={actor, director,Genre} Entity e that hit(e,genre)=0 will be selected Identify a property genre positive User Feedback? negative Introduction 7/8/2017 ...... property genre extend property genre extend System Description Dictionary(genre) Dictionary(genre) Evaluation Updated Type Movie: Pc+={actor, director} Pc-={genre} Conclusion Evaluation 17 Initial Type Definition: •Starting from different concepts presented in strucutured form with schemata on Schema.org. •Build initial type definition for each concept.Each type comprimises the first four properties that have been most frequently annotated in ClueWeb12 for the corresponding schema.org schemata. •The set of entities from the BTC data corpus is selected. •The qualitiy score threshold is 0.65. •We stimulate the user feedback by relying on information from schema.org.If the property with the highest information gain is part of the schema that describes the concept in schema.org, the user feedback is positive.Otherwise, the user feedback is negative. Drawbacks: schema.org doesn't ensure the completeness of properties Some feedbacks are not suitable from a human's perspective. Introduction 7/8/2017 System Description Evaluation Conclusion • • • • Build a Gold Standard(for the supposed entity type) to compute precision and recall. Build by bootstraping from a set of seed entities we know are instances of the given concept type. All concept-related types annotated by rdf:type are retrieved and others are pruned. Then randomly choose a subset of all initial selected entities. All entities are through a process of manual inspection. We build a system RAND which randomly selects properties for each iteration as a baseline. We evaluate ProSWIP on multiple concepts from various fields with different characteristics. Introduction 7/8/2017 System Description Evaluation Conclusion Quality Metric 0.94 Pearson’s linear correlation coefficient Precision quality metric's expressiveness for the quality of the data selection Precision rapidly increases towards values above 90%, showing the success of the whole approach. Introduction 7/8/2017 System Description Evaluation Conclusion Implemented in Scala. Use lucene to index and query.(neo4j's performance is not good) Intel I5-3550 quad-core CPU with 3.3 GHz. 32 GB RAM and 8.5 ms access hard drive •index creation time for the complete BTC data set : 39h •The resulting index & data: 1TB. •One simple entity search: 16s. •The complete process of property-based data access may take up to hours as multiple queries, entity and property retrievals are being performed. •Computing the quality, principal components, latent types and information gain for all properties on large data samples : < 2s Further work: adopting cache mechanism & Lucene-based distributed index leveraging Hadoop is need. Introduction 7/8/2017 System Description Evaluation Conclusion Conclusion 22 •Property-based data access represents a cornerstone in programming with WOD. •Our experiments show such approach suffers from quality problems. •We use an entity homogeneity-based quality metric and iterative feedback from the user on chosen properties to control the quality of selected entities. •This approach is highly correlated to precision and the metric measure provides for transparancy. •By our approach, the precision can easily reaches values above 0.9. •The sparse nature of data in the LOD cloud serverely affects recall. The recall problem is to be tackled in future work. Introduction 7/8/2017 System Description Evaluation Conclusion Thank you. Any Questions? 7/8/2017