Download Extracting Schema From Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Extracting Schema From Data
• The difference between schemas for
semistructured data and traditional schemas is that
a given semistructured data can have more than
one schema .
• Given a semistructured data, compute
automatically some schema for it, given several
possible answers, we want the schema that best
describes the structure of that particular data.This
is called Schema Extraction.
•Schema Extraction for schema graphs
•Schema Extraction for Datalog Typings
Data Guides
Our goal is to construct a new OEM graph that is a finite
description of the list of paths. This is called Data Guide.
The two properties to be fulfilled:
•Accurate : Every path in the data occurs in the data guide,
and every path in the data guide occurs in the data.
•Concise : Every path occurs exactly once.
employee
&r
employee
employee
company
manages
&p1 Managedby&p2
Managedby
name
name
&p3
position
name
manages
&p4Managedby &p5
name
“6666”
“Marketing”
“Joe”
“Dupont”
name position
“Gaston”
manages
&p6 Managedby&p7
phone
“Smith”
“Jones”
manages
manages
“Salse”
Managedby
name
name
“Gonnet”
&p7
position
“Jack” “IT”
name
“IT”
“Fred”
worksfor worksfor
&c
worksfor
name
“Widget Trenton”
position
Figure7.13 An Example of OEM data
We proceed as follows: The Data Guide will have a root node,
call it Root. Next we examine one by one each path in the list
and add new nodes to the data guide, as needed:
•employee
•employee.name
•employee.manages
•employee.manages.managedby
•employee.manages.managedby.manages
•employee.manages.managedby.manages.managedby
•company
Root
&r
employee
company
Employees
&p1,&p2,&p3,&p4
&p5,&p6,&p7,&p8
managedby
position
name phone
manages
managedby
worksfor
worksfor
Regular
&p2,&p3,&p5
&p7,&p8
Boss
&p1,&p4,&p6
name
phone
manages
name
Company
&c
worksfor
name
A Data Guide
position
Root
employee
manages
company
Emp
worksfor
Comp
name
managedby
name
position
phone
Schema graph
Simulation between a data graph and a data guide
Node in data graph
&r
&p1, &p2, &p3, &p4, &p5,
&p6, &p7, &p8
&p1, &p4, &p6
&p2, &p3, &p5, &p7, &p8
&c
Node in data guide
Root
Employee
Boss
Regular
Company
Simulation from the data guide to the schema graph
Node in data guide
Root
Employee
Boss
Regular
Company
Node in schema graph
Root
Emp
Emp
Emp
Comp
This construction of the data guide resembles the technique
to transform a nondeterministic finite state automaton into a
deterministic one .
The data guide is the most specific schema graph for that
data with the following features:
•The data guide is a deterministic schema graph.
•Any other deterministic schema graph to which our data
conforms subsumes the data guide.
A nondeterministic schema
Root&r
employee
employee
company
worksfor
managedby
Regular
&p2,&p3,&p5
&p7,&p8
Boss &p1,
&p4,&p6
manages
name
Comp
&c
phone
worksfor
name
Extracting Datalog rules from data
We have a semistructured data instance and want to extract
automatically the most specific typing given by a set of
Datalog rules.
We create one predicate for each complex value object in the
data. We create the following predicates:
pred_r, pred_c, pred_p1, pred_p2, pred_p3, pred_p4, pred_p5,
pred_p6, pred_p7, pred_p8
corresponding to the objects &r, &c, &p1, &p2, &p3, &p4,
&p5, &p6, &p7, &p8 .
Next we write a set of Datalog rules defining each predicate
based exactly on the outgoing edges of its corresponding
object:
pred_r(X) :- ref(X, company, Y), pred_c(Y),
ref(X, employee, Z1), pred_p1(Z1),
……
ref(X, employee, Z8), pred_p8(Z8)
pred_c(X) :- ref(X, name, N), string(N)
pred_p1(X) :- ref(X, worksfor, Y), pred_c(Y), ref(X,
name, N) , string(N), ref(X, phone, P),
string(P), ref(X, manages, Z), pred_p2(Z),
ref(X, manages, U) , pred_p3(U)
pred_p2(X) :- ref(X , worksfor, Y) , pred_c(Y),
ref(X, name, N) , string(N),
ref(X, manageby, Z), pred_p1(Z)
pred_p3(X) :- …..
……
We have to compute the largest fixpoint of the Datalog
program on the given data.
Object
&r
&c, &p1, &p2, &p3, &p4,&p5, &p6, &p7,
&p8
&p1
&p2, &p3, &p5, &p7, &p8
&p3, &p5, &p7, &p8
&p1, &p4, &p6
&p3, &p5, &p7, &p8
&p1, &p4, &p6
&p3, &p5, &p7, &p8
&p3, &p5, &p7, &p8
Extents of predicates after one iteration
Predicate
pred_r
pred_c
pred_p1
pred_p2
pred_p3
pred_p4
pred_p5
pred_p6
pred_p7
pred_p8
Object
&r
&c, &p1, &p2, &p3, &p4,&p5, &p6, &p7,
&p8
&p1
&p2, &p3
&p3
&p1, &p4, &p6
&p3, &p5, &p7, &p8
&p1, &p4, &p6
&p3, &p5, &p7, &p8
&p3, &p5, &p7, &p8
Extents of predicates after two iterations
Predicate
pred_r
pred_c
pred_p1
pred_p2
pred_p3
pred_p4
pred_p5
pred_p6
pred_p7
pred_p8
We obtain the following Datalog rules:
Root(X)
:- ref(X, company, Y), Company(Y), ref(X, employee, Z1),
Boss1(Z1), ref(X, employee, Z2), Boss2(Z2),
ref(X, employee, U1), Regular1(U1),……..,
ref(X, employee, U3), Regular3(U3)
Company
:- ref(X, name, N), string(N)
Boss1(X)
:- ref(X, worksfor, Y), Company(Y), ref(X, name, N),
string(N), ref(X, phone, P), string(P), ref(X, manages,Z),
Regular1(Z), ref(X, manages, U), Regular2(U)
Boss2(X)
:- ref(X, worksfor, Y), Company(Y), ref(X, name, N),
string(N), ref(X, phone, P), string(P), ref(X, manages,Z),
Regular3(Z)
Regular1(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N),
string(N), ref(X, managedby, Z), Boss1(Z)
Regular2(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N),
string(N), ref(X, position, P), string(P),
ref(X, managedby, Z), Boss1(Z)
Regular3(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N),
Inferring Schemas From Queries
Some semistructured data instances are the result of queries.
Query
Result Schema
Inferring
The following query takes a bibliography file and constructs a homepage
for every author:
where bib -> L -> X, X - > “author” -> A,
X -> “title” -> T, X -> “year” -> Y
create Root( ), HomePage(A),
YearEntry(A,Y), PageEntry(X)
link Root() -> “person” -> HomePage(A),
Homepage(A) -> “year” -> YearEntry(A,Y)
YearEntry(A,Y) -> “paper” -> PaperEntry(X)
PaperEntry(X) -> “title” -> T,
PaperEntry(X) -> “author” -> HomePage(A),
Root
person
person
Homepage
(“smith”)
author
Homepage
(“Jones”)
author
year
author
paper
PaperEntry
(o423)
title
year
year
YearEntry
(“smith”,1995)
year
YearEntry
(“smith”,1997)
YearEntry
(“smith”,1997)
paper
PaperEntry
(o552)
title
year
author
paper
paper
PaperEntry
(o153)
title
year
Schema graph inferred from the query
The schema will have one class for
each function , and one edge for
each line in the link clause.
Root
person
HomePage
year
YearEntry
author
paper
PaperEntry
title
year
For the following example:
where
create Root( ), F(X), F(Y), G(X), H(Y)
link Root( ) -> “A” -> F(X), F(X) -> “C” -> G(X),
Root( ) -> “B” -> F(Y), F(X) -> “D” -> H(Y)
We reach the following schema:
Root: {A : F, B : F}
F : {C : G, D : H}
Path Constraints
In Relational Databases
•in RDB, the relational declaration tell us more than the types
•imposes a key constraint so that no two tuples have the same key
Example
Create table Employees
( Emp Id: integer, EmpName: char(30),
DeptId: integer, …
primary key(EmpId),
foreign key(DeptId) references Departments )
Create table Departments( DeptID: integer, Dname: char(10),
……
primary key(DeptId)
)
In Object-Oriented Databases
Interface Publication
extent publication
{ attribute String title;
attribute Date date;
relationship set<Author> auth
--->inclusion constraints
inverse Author::pub;
--->inverse relationship
}
Interface Author
extent author
{ attribute String title;
attribute String address;
relationship set<Publication> pub --->inclusion constraints
inverse Publication::auth; --->inverse relationship
}
Inclusion constrainsts:
•For any publication p, the set p.auth is a subset of the set
author. Similarly, for any author a, the set a.pub is a subset of
publication.
Inverse relationships:
•For any publication p, and for any author a in p.auth, p is a
member of a.pub .
•For any author a, and for any publication p in a.pub, a is a
member of p.auth .
Illustration of path constraints on semistructured data
r
publication
publication
author
pub
...
auth
auth
auth
title
author
pub
pub
date
name
address title
date
...
...
...
...
...
name
...
address
...
In semistructured data
•inclusion constraint is expressed as follows
p (a (author(r,a)  pub(a,p)) -> publication(r,p))
The general form of an inclusion constraint is
x ((r,x)) -> (r,x))
• inverse relationship is
p ( publication(r,p) ->  a(auth(p,a) -> pub(a,p)))
The general form of this constraint is
x ((r,x)) -> y((x,y)-> (y,x)))
Constraints are also important in Query Optimization. Here
is an example:
Select row: P2
from r.publication P1,
r.publication P2,
P1.auth A
where “Database Systems” in P1.title and A in P2.auth
Select row: P’
from r.publication P,
P.auth A,
A.pub P’
where “Database Systems” in P.title
The query plan implicit in the first one requires two iterations over
publication - with P1,P2 - whereas the second requests only one iteration
- with P .
Related documents