Download Compile-time meta

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data vault modeling wikipedia , lookup

Data analysis wikipedia , lookup

Data model wikipedia , lookup

Operational transformation wikipedia , lookup

Information privacy law wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
ProSWIP: Property-based Data Access
for Semantic Web Interactive Programming
Silviu homoceanu, Philipp Wille and Wolf-Tilo Balke
Institue for Information Systems,
Braunschweig University of Technology, Germany
(布伦瑞克理工大学)
System
Descriptio
n
Evaluation
Introduction
Conclusion
Content
Introduction
3
Goal of Programming with the WOD:
•Easily access the entities that correspond to a concept the developer
thinks of.
Traditional Way:
•ORM(Object Relational Mapping): RDF/OWL -> programming structures.
e.g Jena & RDFReactor. Unintelligible, hard to customize, maintain.
•Compile-time meta-programming: No global ontology to drive the cleasing. Auto
ontology alignment offering just average quality.Requires manual work.
Property-based Data Access Model:
•A type is defined by a set of required properties
•Every entity with at least those properties is part of that type.
•Concepts are bound to properties that are required for the program logic.
•e.g Movie = {Title, Genre, Director}
Introduction
System
Description
Evaluation
Conclusion
ProSWIP:
•Experiments shows:
•Quality of property-based data access -> Chosen properties.
•Task : property-based access + maintaing the quality under control
Required Properties
Estimate quality of selected data
identifies additional properties that
have positive impact on quality
combined with user feedback.
ProSWIP
selected entities
Developers
•An extensive inspection of the feasibility of the property-based paradigm
for accessing data from LOD cloud.
•The presentation & evaluation of a quality metric transparent for the paradigm.
•The presentation and evaluation of a property selection method for better
data quality.
Introduction
System
Description
Evaluation
Conclusion
System
Description
6
• Use Case:
– develop an application related to movies.
– Developers rely on variables that represent movie properties.
– These properties are used as filters to select entities in the LOD
cloud.
– Data sources: BTC 2012 dataset.
– Use precision & recall to evaluate the quality of retrieved
information.
– for type Movie:
• precision: entities representing movies / all selected
• recall: selected movies / all movies in the dataset/
– Our approach focuses on improving the precision.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
ProSWIP:
Estimate quality of selected data
recommand suitable properties
automatically that can improve
the quality of the selected data
ProSWIP
user feedback -> user decides
whether to add the property to the type
get the updated property-based type
and by it select entities
until the quality of selected entities are
above a threshhold.
?
How to :
•Identify and select those entities that fulfill the property-based type definition.
•Compute the quality of a collection of entities.
•Find properties that,if added to the set of properties defining
the type, significantly improve the quality.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Identify and select entities of property-based type:
•
c is a concept, which extendsionally defined through the set of entities, Ec.
Tc is the type of concept as the set of properties T  P  P .
PC+ is the set of positive properties and PC- is the set of negative properties.
c
•
c
c
PC  PC  
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Identify and select entities of property-based type:
p is the natural language label of the property.
Define the function Dictionary(p) that returns a property uri set for each property p.
for the property p in the set:
p
extend
by WordNet
the synonym
property uri
set Psyn
including p
Each synnonym of p by wordnet,
we call it Syn,
if p' , ( p' , rdfs : label , Syn)  LOD
then p' is in the Psyn
,
for each property URI pi in Psyn:
pi
extend by
sameAs
uris set PSApi of
all sameas
property uris
including pi
| Psyn |
Dictionary ( p) :  PSApi
i
An iterative process to get the field of the transitive closure of the bi-relation owl:sameAs.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Identify and select entities of property-based type:
How to select entities from Ec?
hit function:
e is some entity represented by its URI.
p is the natural language label of property.
hit(e, p)=1 iff p' Dictionary ( p) : (e, p' ,*)  LOD
else hit(e,p)=0
for one T  P  P , and e ,
for each p in PC+, if e is in Ec, hit(e,p) = 1
for each p in PC+, if e is not in Ec, hit(e,p) = 0
for each p in Pc-, if e is in Ec, hit(e,p) = 0
c
c
Introduction
7/8/2017
c
System
Description
Evaluation
Conclusion
Quality of the Selected Entities:
Entities from the same datasource and of the same type share similar properties.
We reduce all entities have exact same properties to just one witness.
Ec
Wc
reduce
Quality of selected entities:
The average value of similarity value between pair of any two witnesses in Wc.
Similarity measure of two witness:
|P P |
Jaccard Similarity Index
sim ( w , w ) 
Wi
i
j
Wj
| PWi  PW j |
PWi is the set of properties of Wi.PWj is the set of properties of Wj.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Property Selection:
•Finding properties best distinguishing different types
similar
the problem of induction of an optimal decision tree in data classification.
•Information gain is the standard measure for deciding the relevance of a property,
to compute which needs a uniform type label for each entity.
•The type information:
• strongly correlated with the entity properties
•is latent in the properties.
•The problem of reducing similar witnesses to a dominant type the problem of dimension
reduction.
A basis transformation that seeks to
Principal Component Analysis
reduce the dimensionality of the data
(PCA) of witnesses; p: numberbyoffinding
a few orthogonal linear comX:n×p matrix. n:number
properties
of all witnesses
binations(called principal components) of
X  UDV T
the original variables capturing the largest
components. variance.
Y=UD are the principal
Usally the first PCs(capturing the highest data variance) are chosen to
Systemthe dominantEvaluation
represent
dimensions.
Introduction
Conclusion
7/8/2017
Description
Property Selection:
With each PC that show variances above the threshold,
one-dimentional Clustering on the coefficients
(agglomerative hierarchical clustering with average
inter-cluster similarity)
PC1
......
......
one-dimentional Clustering on the coefficients
(agglomerative hierarchical clustering with average
inter-cluster similarity)
PCt
Property
Cluster
Depends
on
ISODATA
algorithm
Property
Cluster
group together
Latent Type(have multiple properties)
Property
Cluster1
Property
Cluster2
......
Property
Clustern
Label each entity in Ec with those latent types according to the property-based model(Used to compute Information Gain)
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Property Selection:
Based on the labeled entities, we can compute the information gain of a property.
The property that has the maximum information gain is selected to recommand to
user to decide whether to add to the type to further select the entities.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
One Iteration of the interactive process:
examples
Type Movie:
Ec:
Pc+={actor, director}
Auto Property Selection
Component
Entity e
that hit(e,genre)=1
will be selected
......
Updated Type Movie:
Pc+={actor, director,Genre}
Entity e
that hit(e,genre)=0
will be selected
Identify a property
genre
positive
User
Feedback?
negative
Introduction
7/8/2017
......
property
genre
extend
property
genre
extend
System
Description
Dictionary(genre)
Dictionary(genre)
Evaluation
Updated Type Movie:
Pc+={actor, director}
Pc-={genre}
Conclusion
Evaluation
17
Initial Type Definition:
•Starting from different concepts presented in strucutured form with schemata on
Schema.org.
•Build initial type definition for each concept.Each type comprimises the first four
properties that have been most frequently annotated in ClueWeb12 for the
corresponding schema.org schemata.
•The set of entities from the BTC data corpus is selected.
•The qualitiy score threshold is 0.65.
•We stimulate the user feedback by relying on information from schema.org.If the
property with the highest information gain is part of the schema that
describes the concept in schema.org, the user feedback is positive.Otherwise, the user feedback is negative.
Drawbacks:
schema.org doesn't ensure the completeness of properties
Some feedbacks are not suitable from a human's perspective.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
•
•
•
•
Build a Gold Standard(for the supposed entity type) to compute precision
and recall.
Build by bootstraping from a set of seed entities we know are instances of
the given concept type. All concept-related types annotated by rdf:type are
retrieved and others are pruned. Then randomly choose a subset of all
initial selected entities. All entities are through a process of manual
inspection.
We build a system RAND which randomly selects properties for each
iteration as a baseline.
We evaluate ProSWIP on multiple concepts from various fields with different
characteristics.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Quality Metric
0.94
Pearson’s linear correlation
coefficient
Precision
quality metric's expressiveness
for the quality of the data selection
Precision rapidly increases towards values
above 90%, showing the success of the
whole approach.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Implemented in Scala.
Use lucene to index and query.(neo4j's performance is not good)
Intel I5-3550 quad-core
CPU with 3.3 GHz. 32 GB RAM and 8.5 ms access hard drive
•index creation time for
the complete BTC data set : 39h
•The resulting index & data: 1TB.
•One simple entity search: 16s.
•The complete process of property-based data access may
take up to hours as multiple queries, entity and property
retrievals are being performed.
•Computing the quality, principal components,
latent types and information gain for all properties on large data
samples : < 2s
Further work: adopting cache mechanism
& Lucene-based distributed index leveraging Hadoop is need.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Conclusion
22
•Property-based data access represents a cornerstone in programming with WOD.
•Our experiments show such approach suffers from quality problems.
•We use an entity homogeneity-based quality metric and iterative feedback from
the user on chosen properties to control the quality of selected entities.
•This approach is highly correlated to precision and the metric measure provides for
transparancy.
•By our approach, the precision can easily reaches values above 0.9.
•The sparse nature of data in the LOD cloud serverely affects recall. The recall
problem is to be tackled in future work.
Introduction
7/8/2017
System
Description
Evaluation
Conclusion
Thank you.
Any Questions?
7/8/2017