Download Type Inference Problem - University of Washington

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Web Data and the Resurrection of
Database Theory
Dan Suciu
University of Washington
“In theory there is no difference between
theory and practice. In practice there is.”
Jan L.A. van de Snepscheut
September 12, 1953 - February 23, 1994
Short History of Database Theory
The legendary beginnings, 1970-1971:
• Relational databases are the brainchild of a
theoretician (Codd)
• Heavily debated at the time (against CODASYL)
• It took several years for the concept to be
validated in practice
Theory driving the industry
Short History of Database Theory
The golden years (end of 70s, early 80s)
• Relational theory
– Functional dependencies
– Query containment
• Transactions
• Access methods
Theory listening to the industry
Short History of Database Theory
Refined decadence (end of 80s, early 90s)
• Descriptive complexity
• Logic databases
• Complex objects
• Constraint databases
Divorce ?
“Database Metatheory:
Asking the Big Queries”
Christos Papadimitriou, in PODS, 1995
• Theory is inevitable:
CS is a science of the artificial, and
its artifact is being changed
by the very act of studying it
• Kuhn’s paradigm principle, for natural sciences
Immature
science
Normal
science
Crisis
Revolution
Is DB Theory in a Crisis Today ?
• Industry’s focus:
– one particular data model: relational/SQL
– one particular application (client-server)
• Theory’s focus is on Logic:
– New data models, query languages (query
containment, complex objects, recursion)
– New applications (incomplete information,
query rewriting using views)
One Example of Unused Theory
Containment of conjunctive queries is NP complete
[Chandra and Merlin’77]
Dozens of extensions:
• With union and difference [Sagiv and Yannakakis’81]
• With order predicates [Klug’88, van den Meyden’92]
• With complex objects [Levy and Suciu’97]
• With regular expressions [Florescu, Levy and Suciu’98]
Query Containment
The query:
Q1 = SELECT DISTINCT x.name, x.phone
FROM Person x, Person y, Person z
WHERE x.department = y.department AND
x.manager = z.manager
Is minimized to: Q2 = SELECT DISTINCT x.name, x.phone
FROM Person x
The following can be checked: Q1  Q2 and Q1  Q2
…hence Q1=Q2
Minimization not used by RDBMs today
Why Today Things Are Changing
Just one reason: The Web
More precisely:
• A new data model
– Semistructured data
– XML syntax
• New applications
– Transformation
– Integration
Web Data Management
• Who creates the new rules
– W3C working groups
– Sometimes the industry
The new artifacts are not concepts, but standards
• The double role of theory
– Long term: conceptualize/rationalize
• E.g. keys for XML [Buneman, Davidson, Fan, Hara, Tan’01]
– Short term: answer technical questions
Some Questions for Database
Theory
•
•
•
•
XML publishing
Typechecking XML transformations
XML storage
Data distribution
XML
Storage
application
application
object-relational
XML
XML
Data
Typechecking
Integrate
Transform
XML
Publishing
Warehouse
WEB (HTTP)
Warehouse
application
relational data
XML
legacy data
Distribution
XML Publishing
Today:
• Legacy data
– fragmented into many flat relations
– 3rd normal form
– proprietary
• XML data
– nested
– un-normalized
– public (450 schemas at www.biztalk.org)
XML Publishing: an Example
Legacy data in E/R:
name
country
name
euSid
url
usSid
Eu-Stores
US-Stores
date
date
Eu-Sales
tax
US-Sales
Products
pid
name
priceUSD
XML Publishing: an Example
• XML view
<allsales>
<country> <name> France </name>
<store> <name> Nicolas </name>
<product> <name> Blanc de Blanc </name>
<sold> 10/10/2000 </sold>
<sold> 12/10/2000 </sold>
…
</product>
<product>…</product>…
</store>….
</country> …
</allsales>
• In summary: group by country store product
allsales
Output “schema”:
*
country
*
name
store
PCDATA
?
*
name
product
url
*
PCDATA
name
PCDATA
sold
?
PCDATA
date
tax
PCDATA
PCDATA
XML Publishing
In SilkRoute [Fernandez, Suciu, Tan ’00]
{ FROM EuStores $S, EuSales $L, Products $P
WHERE $S.euSid = $L.euSid AND $L.pid = $P.pid
CONSTRUCT
<allsales()>
<country($S.country)>
<name> $S.country </name>
<store($S.euSid)>
<name> $S.name </name>
<product($P.pid)>
<name> $P.name </name>
<price> $P.priceUSD </price>
</product>
</store>
</country>
<allsales>
}
/* union….. */
…. /* union */
{ FROM USStores $S, EuSales $L, Products $P
WHERE $S.usSid = $L.euSid AND $L.pid = $P.pid
CONSTRUCT
<allsales()>
<country(“USA”)>
<name> USA </name>
<store($S.euSid)>
<name> $S.name </name>
<url> $S.url </url>
<product($P.pid)>
<name> $P.name </name>
<price> $P.priceUSD </price>
<tax> $L.tax </tax>
</product>
</store>
</country>
<allsales>
}
Internal Representation
View Tree:
allsales()
Non-recursive datalog
(SELECT DISTINCT … )
allsales():-
*
country(c) :-EuStores(x,_,c), EuSales(x,y,_), Products(y,_,_)
country(“USA”) :-
country(c)
*
name(c)
store(c,x)
c
store(c,x) :- EuStores(x,_,c), EuSales(x,y,_), Products(y,_,_)
store(c,x) :- USStores(x,_,_), USSales(x,y,_), Products(y,_,_), c=“USA”
*
?
product(c,x,y)
name(n)
n
name(n)
url(c,x,u)
u url(c,x,u):-USStores(x,_,u), USSales(x,y,_),Products(y,_,_)
*
sold(c,x,y,d)
n
date(c,x,y,d)
d
Tax(c,x,y,d,t)
t
Large query (x100 lines), large XML answer (x100 MB)
Users Ask Specific XML Queries
• find names, urls of all stores who sold on
1/1/2000 (in XML-QL / XQuery melange):
WHERE <allsales/country/store>
<product/sold/date> 1/1/2000 </>
<name> $X </>
<url> $Y </>
</>
RETURN $X , $Y
Small query, small answer
Query Composition
View Tree
XML-QL Query Pattern
allsales()
allsales
country(c)
country
name(c)
store(c,x)
$n1
$n2
store
$n3
c
name(n)
product(c,x,y)
url(c,x,u)
n
u
name(n)
name
product
url
$Y
$X
sold(c,x,y,d)
$n4
sold
$n5
n
date(c,x,y,d)
d
Tax(c,x,y,d,t)
t
date
$Z
1/1/2000
“Evaluate” the XML pattern(s) on the view tree, combine all datalog rules
Query Composition
Result (in theory…):
( SELECT S.name, S.url
FROM USStores S, USSales L, Products P
WHERE S.usSid=L.usSid AND L.pid=P.pid AND L.date=‘1/1/2000’)
UNION
( SELECT S2.name, S2.url
FROM EUStores S1, EUSales L1, Products P1
USStores S2, USSales L2, Products P2,
WHERE S1.usSid=L1.usSid AND L1.pid=P1.pid AND L1.date=‘1/1/2000’
AND S2.usSid=L2.usSid AND L2.pid=P1.pid
AND S1.country=“USA” AND S1.euSid = S2.usSid)
Complexity of XML Publishing
• But in practice: 5-7 times more joins !
– Need query minimization
• Could this be avoided ?
– We thought hard and couldn’t find a better way
– Asked students to re-implement: same problem
– It is NP-hard !
XML Publishing Is NP-Hard
View Tree:
customer
?
order():- Q1
order
PCDATA
XML query:
?
complaint
complaint():- Q2
PCDATA
WHERE <customer> <order> $x </>
<complaint> $y </>
</>
RETURN ( )
Q1 JOIN Q2
The composed SQL query is :
Minimizing it is NP hard ! (can be shown…)
Recent Advancements in Query
Containment
Definition FOk = First Order Logic with k
variables
Fact If Q2  FOk and k “is small”, then Q1  Q2
can be checked efficiently
[Kolaitis, Vardi’98], [Vardi’00], [Chekuri, Ramajaran’97]
XML Publishing: Finale
Prediction techniques based on FOk and/or
query width will be deployed in XML
publishing in the future
(perhaps under different names)
XML Typechecking
Purpose: ensure that the generated XML conforms to the
desired DTD (or XML Schema)
Two kinds:
• Dynamic typechecking
– Easy: lots of XML validating parsers available
• Static typechecking
– Hard: need complex analysis of the XML generation program
XML Typechecking
XML generation programs:
• Publishing: RDBMS  XML
(e.g. SilkRoute)
• Transformation: XML  XML
(e.g. XSL, Xquery)
• Integration: XML + XML  XML
This talk: XML  XML
The XML Typechecking Problem
Given an XML  XML transformation f:
Type Checking Problem
Given DTDs t1, t2, check D t1, f(D) t2
sometimes t1 = any: check D, f(D) t2
Today’s Systems Try to Do
Type Inference
Type Inference Problem
Given DTD t1, find the DTD f(t1) = {f(D) | D t1}
Today’s systems:
• “Compute” f(t1)
• Check f(t1)  t2 (which is possible)
sometimes t1 = any:
compute f(any)
check f(any)  t2
Theory’s Role:
Send a Warning
This approach fails in general !
But it may work OK in most “practical” cases...
Why XML Type Inference Fails
Xquery f =
RETURN
<a> (FROM Employee $x RETURN <b/>),
(FROM Employee $x RETURN <c/>),
(FROM Employee $x RETURN <d/>)
</a>
• “Inferred” (wrong) DTD f(any):
• “Real” output “DTD”
<!ELEMENT a (b*,c*,d*)>
<!ELEMENT a ({bn,cn,dn | n  0})>
• Fails to typecheck f(any)  t2 when t2=
<!ELEMENT a ((b,b)*,(c,c)*,(d,d)* | (b,b)*,b,(c,c)*,c,(d,d)*,d)>
The Typechecking Problem in
Theory and Practice
• In practice, we care about typechecking
• Question for theory: is this possible ?
• Positive result [Milo, Suciu, Vianu, 2000]:
– Decidable for k-pebble tree tansducers
– Hence: decidable for:
• Join-free XQuery
• Simple XSLT programs
• Negative result [Alon, Milo, Neven, Suciu, Vianu 2001]:
– Undecidable for transformations with value joins
The Typechecking: Finale
Prediction: systems will continue to use type
inference, but will never be as robust as
type checking in programming languages
Need to understand well their applicability
XML Storage
Problem:
• Given: a (large) XML data instance
• Goal: store/process it in a RDBMS
• Problem: find the relational schema !
• Current approaches:
– Generic schema [Florescu, Kossman 99]
– Derive schema from DTD [Shanmungasudaram et al 99]
– Derive schema from XML data[Deutsch, Fernandez, Suciu 99]
The Theory of XML Storage
• The simplest case: flat, unique subelements
M=
Oid
E1
E2
E3
E4
…
E5000
&1
1
0
0
1
…
0
&2
0
1
1
0
…
0
&3
0
1
0
1
…
0
&4
0
1
1
1
…
0
&5
1
0
1
0
…
0
&6
1
1
0
0
…
0
…
…
…
&o10000000
0
1
0
0
0
• How do we cover all 1’s most economically ?
– R1(E2, E3, E4), R2(E1, E5, E9, E12), …
The Theory of XML Storage
• XML storage and matrix rank
M=
Oid
E1
E2
E3
E4
…
E5000
&1
1
0
0
1
…
0
&2
0
1
1
1
…
0
&3
0
1
1
1
…
0
&4
0
1
1
1
…
0
&5
1
1
0
0
…
0
&6
1
1
0
0
…
0
&7
0
0
0
1
...
…
…
…
…
&10000000
1
0
0
1
…
0
• Can store XML data in k relations  rank(M)=k
• Conversely: if rank(M)=k  what about storage ?
XML Storage: Finale
Prediction: we will see several clever XML
storage techniques discovered in the near
future
The Data Distribution
• Many data consumers, many places to cache
• Data can be replicated, transformed
– How to transform it ? The view selection problem
– Where to place it ? The data distribution problem.
NP-complete
Prediction: no predictions here (too early…)
Conclusions:
Resurrection of Database Theory
• Is theory irrelevant ?
– [Papadimitriou, 95]: wrong question to ask
• Respect for practice: only a recent development in human culture
• Applicability pressure in CS: annoying trend of last 10 years or so
• Database theory: are we in a revolution ?
– The past: researchers created artifacts for the industry
– Today: society (Web, W3C) is creating artifacts for
researchers to study, improve
Prediction: there will be no difference between
theory and practice…
at least, in theory !