Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Entity–attribute–value model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Concurrency control wikipedia , lookup
Relational algebra wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
ContactPoint wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Oblivious Querying of Data with Irregular Structure 1 Based on Several Works • Queries with Incomplete Answers – Yaron Kanza, Werner Nutt, Shuky Sagiv • Flexible Queries – Yaron Kanza, Shuky Sagiv • SQL4X – Sara Cohen, Yaron Kanza, Shuky Sagiv • Computing Full Disjunctions – Yaron Kanza, Shuky Sagiv 2 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration 3 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration 4 The Semistructured Data Model • Data is described as a rooted labeled directed graph • Nodes represent objects • Edges represent relationships between objects • Atomic values are attached to atomic nodes 5 Movie Database 1 Movie Movie 11 Actor Actor 21 Name 30 Mark Hamill 22 Name Title 12 Star 1977 Wars Harrison Ford 25 Title 27 Léon 33 Magnolia A Movie Database Example 6 28 Kyle Title MacLachlan Name Natalie Portman 14 Movie Name T.V. Series 26 32 Actor 13 Actor Title Year 23 24 31 Film 29 Title Year 34 35 36 Twin Peaks Dune 1984 <?xml version=“1.0”?> <MDB> <Movie> <Title>Star Wars</Title> <Year>1977</Year> <Actor> <Name>Mark Hamill</Name> </Actor> <Actor> <Name>Harrison Ford</Name> </Actor> </Movie> … </MDB> XML that Encodes the Semistructured Data 7 What Should be the form of the Query? Movie Database 1 Movie Movie 11 Actor Actor 21 Name 30 Mark Hamill 8 22 Name Title 12 Star 1977 Wars Harrison Ford 25 Title 27 Léon 33 Magnolia 29 28 Kyle Title MacLachlan Name Natalie Portman 14 Movie Name T.V. Series 26 32 Actor 13 Actor Title Year 23 24 31 Film Title Year 34 35 36 Twin Peaks Dune 1984 Consider a Query that Requests Movies, Actors that Acted in the Movies and the Movies’ Year of Release The movie has a year attribute Movie Database The year of the movie is missing 1 Movie Movie 11 Actor Actor 21 Name 30 Mark Hamill 22 Name Title 12 Star 1977 Wars Harrison Ford 25 Title 27 Léon 33 Magnolia Incomplete Data 9 28 Kyle Title MacLachlan Name Natalie Portman 14 Movie Name T.V. Series 26 32 Actor 13 Actor Title Year 23 24 31 Film 29 Title Year 34 35 36 Twin Peaks Dune 1984 Movie Database Actor below movie Movie Movie 11 Actor Actor 21 Name 30 Mark Hamill 22 Name Title Harrison Ford 25 Title 27 Léon 33 Magnolia Variations in Structure 10 28 Kyle Title MacLachlan Name Natalie Portman 14 Movie Name T.V. Series 26 32 Actor 13 Actor Title Year Star 1977 Wars 31 Film 12 23 24 Movie below actor 1 29 Title Year 34 35 36 Twin Peaks Dune 1984 Movie Database 1 A movie label 11 Actor Actor 21 Name 30 Mark Hamill 22 Name Title 25 Movie Name T.V. Series Title 26 27 Léon Natalie Portman 28 Kyle Title MacLachlan Name 32 A film label 13 13 Actor Title Year Star 1977 Wars Harrison Ford Film 12 23 24 31 Actor Movie Movie 34 Magnolia Title Year 33 34 35 Twin Peaks Dune 1984 Dealing with ontology variations is Ontology Variations beyond the scope of this talk 11 29 Irregular Data • Data is incomplete – Missing values of attributes in objects • Data has structural variations – Relationships between objects are represented differently in different parts of the database • Data has ontology variations – Different labels are used to describe objects of the same type 12 Irregular data does not conform to a strict schema Queries over irregular data should not be rigid patterns The schema cannot guide a user in formulating a query 13 Data is contributed by many users in a variety of designs The query should deal with different structures of data The structure of the database is changed frequently Queries should be rewritten frequently The description of the schema is large (e.g., a DTD of XML) It is difficult to use the schema when formulating queries In Which Cases is it Difficult to Formulate 14 Queries over Semistructured Data? Can Regular Expressions Help in Querying Irregular Data? • In many cases, regular expressions can be used to query irregular data • Yet, regular expressions are – Not efficient – it is difficult to evaluate regular expressions – Not intuitive – it is difficult for a naïve user to formulate regular expressions 15 More on Using Regular Expressions • When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database – For n types of objects, there are n! possible hierarchies – For an object with n attributes, there are 2n subsets of missing attributes 16 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration 17 Queries with Incomplete Answers • We have developed queries that deal with incomplete data in a novel way and return incomplete answers • The queries return maximal answers rather than complete answers • Different query semantics admit different levels of incompleteness 18 Queries with complete answers Queries with AND Semantics Queries with Weak Semantics Queries with OR Semantics Queries with Incomplete Answers 19 Increasing level of incompleteness Queries and Matchings • The queries are labeled rooted directed graphs • Query nodes are variables • Matchings are assignments of database objects to the query variables according to – the constraints specified in the query, and – the semantics of the query 20 Constraints On Complete Matchings • Root Constraint: • Satisfied if the query root is mapped to the db root Query Root r 1 Database Root • Edge Constraint: • Satisfied if a query edge with label l is mapped to a database edge with label l x l l y 21 12 25 Movie Database A Complete Matching 1 r Movie Movie Producer Movie 11 Uncredited Director Actor Actor Title Actor Year 21 22 23 24 12 Actor 27 25 x Director Title 26 29 Star 1977 Name Name Hook Name Name Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 22 Movie Producer y Director Uncredited Actor z Date of birth Name u The root constraint and All the nodes are mapped all the edge constraints to non-null values are satisfied v Movie Database 1 r Movie Movie Producer Movie 11 Uncredited Director Actor Actor Title Actor Year 21 22 23 24 12 Actor 27 25 x Director Title 26 29 Star 1977 Name Name Hook Name Name Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 Consider the case where Node 35 is removed from the database 23 Movie Producer y Director Uncredited Actor z Date of birth Name u v No Complete Matching Exists! Movie Database 1 r Movie Movie Producer Movie 11 Uncredited Director Actor Actor Title Actor Year 21 22 23 24 Star 1977 Name Name Wars 30 31 Mark Harrison Hamill Ford 24 Actor 27 25 NULL 12 Director Title 26 29 Name Name Name Hook 32 33 George Dustin Lucas Hoffman 34 Steven Spielberg x Movie NULL Producer y Director Uncredited Actor z NULL Date of birth Name u v NULL This is not a matching, since the sequence of labels Every from the Not database rootPartial to NodeAssignment 31 is different from is an Incomplete Matching any sequence of labels that starts at the query root and ends in variable v The Reachability Constraint on Partial Matchings • A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path Database are satisfied 11 r l 2 l1 x w l4 l3 y 25 Query z l5 l6 vv 5 l1 l2 7 l4 l3 8 9 l5 l6 55 “And” Matchings • A partial matching is an AND matching if – The root constraint is satisfied – The reachability constraint is satisfied by every query node that is mapped to a database node – If a query node is mapped to a database node, all the incoming edge constraints are satisfied r x Producer Actor Director 26 y z Movie Database 1 r Movie Movie Producer Movie 11 Uncredited Director Actor Actor Title Actor Year 21 22 23 24 Star 1977 Name Name Wars 30 31 Mark Harrison Hamill Ford 12 Actor 27 25 x Director Title 29 26 Name Name Name Hook 32 34 33 George Dustin Lucas Hoffman Steven Spielberg An AND Matching 27 Movie Producer y Director Uncredited Actor z Date of birth Name u NULL v Movie Database 1 r Movie Movie Producer Movie 11 Uncredited Director Actor Actor Title Actor Year 21 22 23 24 Star 1977 Name Name Wars 30 31 Mark Harrison Hamill Ford 12 Actor 27 25 x Director Title 26 29 Name Name Name Hook 32 33 George Dustin Lucas Hoffman Suppose that we remove the edges that are labeled with Uncredited Actor 28 34 Movie Producer y Director Uncredited Actor z Date of birth Name u v Steven Spielberg In an AND matching, Node z must be null! Weak Satisfaction of Edge Constraints • Edge Constraint: • Is Weakly Satisfied if it is either • Satisfied (as defined earlier), or • One (or more) of its nodes is mapped to a null value null x 12 l l y 25 x 12 l m y 25 null 29 x l null 12 m y 25 x l y null Weak Matchings • A partial matching is a weak matching if – The root constraint is satisfied – The reachability constraint is satisfied by every query node that is mapped to a database node – Every edge constraint is weakly satisfied 30 Movie Database 1 r Movie Movie Producer Movie 11 12 Actor Title Director Actor Year 21 22 23 24 Star 1977 Name Name Wars 30 27 Mark Harrison Hamill Ford 25 Director Title 26 33 George Dustin Lucas Hoffman A Weak Matching 31 29 Name Name Name Hook 32 31 Actor x 34 Steven Spielberg Movie Producer y Director Uncredited Actor z NULL Date of birth Name u v NULL Edges that are weakly satisfied In a weak matching, all four options are permitted In an AND matching, only the first three options are permitted null x 12 l l y 25 x 12 l m y 25 null 32 null x x l l y m y null 12 25 Movie Database 1 r Movie Movie Producer Movie 11 12 Actor Title Director Actor Year 21 22 23 24 Star 1977 Name Name Wars 30 31 Mark Harrison Hamill Ford 27 Actor 25 Director Title 26 29 Name Name Name Hook 32 33 George Dustin Lucas Hoffman Consider the case where edges labeled with Producer are removed 33 x 34 Movie Producer y Director Uncredited Actor z Date of birth Name u v Steven Spielberg In a weak matching, Node z must be null! “OR” Matchings • A partial matching is an OR matching if – The root constraint is satisfied – The reachability constraint is satisfied by every query node that is mapped to a database node 34 Movie Database 1 r Movie Movie Movie 11 12 Actor Title Director Actor Year 21 22 23 24 Star 1977 Name Name Wars 30 31 Mark Harrison Hamill Ford 27 Actor 25 x Director Title 29 26 Name Name Name Hook 32 34 33 George Dustin Lucas Hoffman Steven Spielberg An OR Matching 35 Movie Producer y Director Uncredited Actor z NULL Date of birth Name u v NULL An edge which is not weakly satisfied Increasing Level of Incompleteness • A complete matching is an AND matching • An AND matching is a weak matching • A weak matching is an OR matching 36 Maximal Matchings • A tuple t1 subsumes a tuple t2 if t1 is the result of replacing some null values in t2 by non-null values: t1=(1, 5, 2, null) t2=(1, null, 2, null) Matchings are represented as tuples of oid’s and null values • A matching is maximal if no other matching subsumes it • A query result consists of maximal matchings only 37 On the Complexity of Computing Queries with Incomplete Answers • The size of the result can be exponential in the size of the input (database and query) – Note that the same is true when joining relations – the size of the result can be exponential in the size of the input (database and query) • Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity 38 Input-Output Complexity In input-output complexity, the time complexity is a function of the size of the query, the size of the database, and the size of the result. 39 The Motivation for Using I/O Complexity • Measuring the time complexity with respect to the size of the input does not separate between the following two cases: – An algorithm that does an exponential amount of work simply because the size of the output is exponential in the size of the input – An algorithm that does an exponential amount of work even when the query result is small • Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard 40 I/O Complexity of Query Evaluation (lower bounds are for non-emptiness) Query / Semantics Complete 41 Path Query PTIME Tree Query DAG Query Cyclic Query PTIME NPComplete NPComplete AND PTIME PTIME PTIME NPComplete Weak PTIME PTIME PTIME PTIME OR PTIME PTIME PTIME PTIME Recent Results (PODS’03) Filter Constraints • Constraints that filter the results (i.e., the maximal matchings) • There are – Weak filter constraints (the constraint is satisfied if a variable in the constraint is null) – Strong filter constraints (all variables must be non-null for satisfaction) • Existence constraint: !x is true if x is not null 42 I/O Complexity of Query Evaluation with Existence Constraints (lower bounds are for non-emptiness) Query / Semantics Complete AND 43 Path Query PTIME PTIME Tree Query DAG Query Cyclic Query PTIME NPComplete NPComplete PTIME NPComplete NPComplete NPComplete NPComplete Weak PTIME PTIME NPComplete OR PTIME PTIME NPComplete I/O Complexity of Query Evaluation with Weak Equality/Inequality Constraints (lower bounds are for non-emptiness) Path Query Tree Query DAG Query Cyclic Query PTIME NPComplete NPComplete NPComplete PTIME NPComplete NPComplete NPComplete Weak PTIME NPComplete NPComplete NPComplete OR PTIME NPComplete NPComplete NPComplete Query / Semantics Strong AND 44 Query Containment • Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers • Q1 Q2 if for all database D, every matching of Q1 w.r.t. to D is subsumed by a matchings of Q2 w.r.t. to D • Query containment (query equivalence) is useful for the development of optimization techniques 45 Containment in AND Semantics • Homomorphism between the query graphs is necessary and sufficient for containment Q1 r l1 l2 x Q2 r l1 p l2 l2 y l3 u z l3 l4 v u q Q1 Q2 l4 v homomorphism • Deciding whether one query is contained in another is NP-Complete 46 Containment in OR Semantics • The following is a necessary and sufficient condition for query containment in OR semantics • For every spanning tree T1 of the contained query, there a spanning tree T2 of the containing query, such that there is a homomorphism from T2 to T1 – is in ΠP2 – NP-Complete if the containee is a tree – polynomial if the container is a tree 47 Containment in Weak Semantics • Similar to containment in OR Semantics, with the following difference • Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments – A graph fragment is a restriction of the query to a subset of the variables that includes the query root such that every node in the fragment is reachable from the root 48 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration 49 Flexible Queries • To deal with structural variations in the data, we have developed flexible queries 50 Rigid Queries Semiflexible Queries Flexible Queries Flexible Queries 51 Increasing level of flexibility Example Movie Database r Actor A query that finds all pairs of actors that acted in the same movie Actor x y Movie Movie z However, if in the database, actors are descendents of movies, the query has to be reformulated Instead, we propose new ways of matching queries to databases 52 Rigid matchings and complete matchings are the same Returning rigid matchings is the usual semantics for queries (e.g., XQuery, Lorel, XML-QL, etc.) 53 Constraints On Rigid Matchings • Root Constraint: • Satisfied if the query root is mapped to the db root Query Root r 1 Database Root • Edge Constraint: • Satisfied if a query edge with label l is mapped to a database edge with label l x l l y 54 12 25 Movie Database 1 Movie Movie 11 Actor Actor 21 Name 30 22 Name Title Movie T.V. Series Star 1977 Wars Name 28 Léon Title Name 31 x Movie y 29 26 34 32 Natalie Portman A Rigid Matching 55 14 Actor Title Year 25 Actor Actor 12 23 24 Mark Harrison Hamill Ford r 27 Title Year 35 36 Twin 1984 Peaks Dune Kyle MacLachlan This is not a Rigid Matching A Semiflexible Matching • The query root is mapped to the db root • A query node with an incoming label l is mapped to a db node with an incoming label l • The image of every query path is embedded in some database path • SCC is mapped to SCC 56 Query Root DB Root 1 r x l y × 9 l 11 A Semiflexible Matching Query • The query root is Root The lasttotwo conditions mapped the db root r cannot be verified locally, • A query node with an i.e., by label considering incoming l is one query mapped to a dbedge node at a time with an incoming label l • The image of every query path is embedded in some database path • SCC is mapped to SCC 57 DB Root 1 x l 9 l y 11 Movie Database 1 Movie Movie 11 Actor Actor 21 Name 30 22 Name Title Star 1977 Wars Mark Harrison Hamill Ford 14 Movie T.V. Series Actor Title Year 25 Name 28 Léon Title Name x Movie y 29 26 32 Natalie Portman 34 27 Title Year 35 36 Twin 1984 Peaks Dune Kyle MacLachlan The Semiflexible Matchings 58 Actor Actor 12 23 24 31 r We get all the actor-movie pairs r r Actor Movie y Actor x x Movie y Under semiflexible semantics, these two queries are equivalent The user does not have to know if movies are above or below actors in the database 59 Movie Database 1 Movie Movie 11 Actor Actor 21 Name 30 22 Name Title Star 1977 Wars Mark Harrison Hamill Ford Actor x 14 Movie Movie T.V. Series Actor Title Year 25 Actor Actor 12 23 24 31 r Name 28 Léon Title Name y Movie z 29 26 32 Natalie Portman 34 27 Title Year 35 36 Twin 1984 Peaks Dune Kyle MacLachlan Impossible to get this pair by means We get of apairs rigid of Another since Example a is a dag matching, the of query actors andthat theacted db in 60 Matching is Semiflexible a tree the same movie A Flexible Matching • The query root is mapped to the db root • A query node with an incoming label l is mapped to a db node with an incoming label l • An edge is mapped to two nodes on one path • Notice that a path in the query is not necessarily mapped to a path in the db 61 Query Root r DB Root 1 x l 9 l y 11 An Example of a Flexible Query r Director A director The director name x Name Movie A movie of the director y z Title Actor An actor in the movie v u Name The name of the actor 62 w The title of the movie Movie Database 1 r Movie Director Producer Movie 11 Actor Title Director Actor Year 21 22 23 24 x 12 27 Actor 25 Director Title 26 Name Movie z 29 Star 1977 Name Name Hook Name Name Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 y Title Actor flexible neither a rigid AThis query edge matching is mappedisto matching nor a semiflexible matching two db nodes on one path 63 v u Name w Movie Database 1 r Movie Producer Producer Movie 11 Actor Title Director Actor Year 21 22 23 24 x 12 27 Actor 25 Director Title 26 29 Movie y Star 1977 Name Name Hook Name Name Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 In thisWhy flexible matching, a producer are semiflexible matchingsis given with that he directed but did not produce preferred sometimes to flexible matchings? 64 a movie Movie Database 1 r Movie Producer Producer Movie 11 Producer Actor 99 Title Director Actor Year 21 22 23 24 27 x 12 Actor 25 Director Title 26 29 Movie y Star 1977 Name Name Hook Name Name Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 In semiflexible semantics, the problem is solved since the image of a query path is embedded in 65 a database path Differences Between the Semiflexible and Flexible Semantics • On a technical level, in flexible matchings – Query paths are not necessarily embedded in database paths – SCC’s are not necessarily mapped to SCC’s • On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence – Query paths are embedded in database paths • In the flexible semantics, this condition is relaxed: – Query edges are embedded in database paths 66 Increasing Level of Flexibility • A rigid matching is a semiflexible matching • A semiflexible matching is a flexible matching 67 Verifying that Mappings are Semiflexible Matchings • Is a given mapping of query nodes to database nodes a semiflexible matching? – Not as simple as for rigid matchings (no local test, i.e., need to consider paths rather than edges) • In a dag query, the number of paths may be exponential – Yet, verifying is in polynomial time • In a cyclic query, the number of paths may be infinite – Yet, verifying is in exponential time 68 Verifying that a Mapping is a Semiflexible Matching Query / Database Path Query Tree Query DAG Query Cyclic Query No Path Database PTIME PTIME PTIME Tree Database PTIME PTIME PTIME DAG Database PTIME PTIME PTIME matchings Cyclic Database PTIME PTIME coNP coNP 69 matchings No matchings No Input-Output Complexity of Query Evaluation for the Semiflexible Semantics • Next slide summarizes results about the input-output complexity – Polynomial for a dag query and a tree database (or simpler cases) • Rather difficult to prove, even when the query is a tree, since there is no local test for verifying that mappings are semiflexible matchings – Exponential lower bounds for other cases 70 I/O Complexity for SF Semantics (lower bounds are for non-emptiness) Query / Database Path Database Path Query PTIME Tree Query PTIME DAG Query Cyclic Query PTIME Result is empty Tree Database PTIME PTIME PTIME Result is empty DAG Database NPComplete NPComplete NPComplete Result is empty Cyclic Database NPComplete NPComplete NP-Hard NP-Hard (in P2) (in P2) 71 Data Complexity is Polynomial in all Cases Query Evaluation for the Flexible Semantics • The database is replaced with a relationship graph which is a graph, such that – The nodes are the nodes of the database – Two nodes are connected by an edge if there is a path between them in the database (the direction of the path is unimportant) • The query is evaluated under rigid semantics w.r.t. the relationship graph 72 I/O Complexity of Query Evaluation for the Flexible Semantics • Results follow from a reduction to query evaluation under the rigid semantics • Tree query – Input-Output complexity is polynomial • DAG query – Testing for non-emptiness is NP-Complete 73 Query Containment • Q1 Q2 if for all database D, the set of matchings of Q1 w.r.t. to D is contained in the set of matchings of Q2 w.r.t. to D • We assume that – Both queries have the same set of variables 74 Complexity of Query Containment • Under the semiflexible semantics, Q1 Q2 iff the identity mapping from the variables of Q2 to the variables of Q1 is a semiflexible matching of Q2 w.r.t. Q1 • Thus, containment is – in coNP when Q1 is a cyclic graph and Q2 is either a dag or a cyclic graph – in polynomial time in all other cases • Under the flexible semantics, query containment is always in polynomial time 75 Database Equivalence • D1 and D2 are equivalent if for all queries Q, the set of matchings of Q w.r.t. to D1 is equal to the set of matchings of Q w.r.t. to D2 • Both databases must have the same set of objects and the same root 76 Complexity of Database Equivalence • For the semiflexible semantics, deciding equivalence of databases is – in polynomial time if both databases are dags – in coNP if one of the databases has cycles • For the flexible semantics, deciding equivalence of databases is polynomial in all cases 77 Database Transformation MDB MDB 1 Actor 1 Actor Actor 2 3 4 Dustin Harrison Movie Hoffman Ford Movie Movie 6 Hook 8 Star Wars Movie Hook Mark Hamill Actor 2 Dustin Hoffman Movie 6 8 Star Wars Actor 3 Harrison Ford Actor 4 Mark Hamill The databases are equivalent under both A DAG has become a TREE! the flexible and semiflexible semantics 78 Transforming a Database into a Tree • Reasons for transforming a database into an equivalent tree database: – Evaluation of queries over a tree database is more efficient – In a graphical user interface, it is easier to represent trees than DAGs or cyclic graphs – Storing the data in a serial form (e.g., XML) requires no references 79 Transformation into a Tree • There are algorithms for – Testing if a database can be transformed into an equivalent tree database, and – Performing the transformation • For the semiflexible semantics – The algorithms are polynomial • For the flexible semantics – The algorithms are exponential 80 Implementing Flexible Queries • Flexible queries were implemented in SQL4X • In an SQL4X query, relations and XML documents are queried simultaneously • A query result can be either a relation or an XML document 81 A query under the Flexible Semantics r Director x Name Movie y z Title v An SQL4X Query QUERY AS RELATION SELECT text(y) as director, text(v) as title FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of z 82 A query under the Flexible Semantics r Director x Name Movie y z Title v An SQL4X Query Constraints can be added QUERY AS RELATION SELECT text(y) as director, text(v) as title FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of x WHERE text(v) = ‘Star Wars’ 83 A query under the Flexible Semantics FilmBudgets r Director Title x … … … … Name Movie y z Title v Budget A relation with data about film budgets An SQL4X Query QUERY AS RELATION SELECT text(x) as director, text(v) as title, Budget FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of x, FilmBudgets WHERE text(v) = FilmBudgets.Title 84 Relations and XML Documents can be queried simultaneously Agenda Why is is difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration 85 Combining the Paradigms • In oblivious querying: – The user does not have to know where data is incomplete – The user does not have to know the exact structure of the data • The paradigm of flexible queries and the paradigm of queries with incomplete answers should be combined 86 Flexible Queries with Incomplete Answers • A flexible query w.r.t. a database is actually a rigid query w.r.t. the relationship graph • Evaluating a query in AND-semantics (weak semantics, OR-Semantics) w.r.t. the relationship graph produces a flexible query that returns maximal answers rather than complete answers 87 Movie Database 1 r Movie Director Producer Movie 11 Actor Title Director Actor Year 21 22 23 24 x 12 27 Actor 25 Director Title 26 29 Star 1977 Name Name Hook Name Name Name Wars Date of 34 30 32 31 33 birth George Dustin Steven Mark Harrison Lucas Hoffman Spielberg 35 Hamill Ford 14 May 1944 88 Consider the case where Node 25 and Node 33 are removed Name Movie y z Title Actor v u Name w Movie Database 1 r Movie Director Producer Movie 11 Actor Title Director Actor Year 21 22 23 24 x 12 27 Star 1977 Name Name Name Wars Date of 30 32 31 birth George Mark Harrison Lucas 35 Hamill Ford 14 May 1944 Director Title 26 Name Movie z 29 Title Actor Hook Name 34 Steven Spielberg y v u Name NULL w NULL A Flexible matching which is also an incomplete (maximal) matching 89 Agenda Why is is difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration 90 Full Disjunction • Intuitively, the full disjunction of a given set of relations is the join of these relations that does not discard dangling tuples • Dangling tuples are padded with nulls • Only maximal tuples are retained in the full disjunction (as in the case of QwIA) 91 Movies Actors m-id title year 1 Zelig 2 Antz language a-id name date-of-birth 1983 English 1 Woody Allen 1/12/1935 1998 English 2 Bruce Willis 19/3/1955 3 Armageddon 1998 English 3 Julia Roberts 28/10/1967 4 Fantasia 1940 English Acted-in a-id m-id role Actors-that-Directed a-id m-id 1 1 The Full Disjunction of the Given Relations 1 1 Zelig 1 2 Z 2 3 Harry m-id title year language a-id name Date-of-birth role 1 Zelig 1983 English 1 Woody Allen 1/12/1935 Zelig 2 Antz 1998 English 1 Woody Allen 1/12/1935 Z 3 Armageddon 1998 English 2 Bruce Willis 19/3/1955 Harry 4 Fantasia 1940 English 3 Julia Roberts 28/10/1967 92 Movies m-id title year language 1 Zelig 1983 English 2 Antz 1998 English 3 Armageddon 1998 English 4 Fantasia This tuple will not be in the full disjunction 1940 English m-id title year language a-id name 1 Zelig 1983 English Date-of-birth role The Full Disjunction of the Given Relations m-id title year language a-id name Date-of-birth role 1 Zelig 1983 English 1 Woody Allen 1/12/1935 Zelig 2 Antz 1998 English 1 Woody Allen 1/12/1935 Z 3 Armageddon 1998 English 2 Bruce Willis 19/3/1955 Harry 4 Fantasia 1940 English The does not subsumed 93 full disjunction 3 include Julia Roberts 28/10/1967 tuples Movies Actors m-id title year 1 Zelig 2 Antz language a-id name date-of-birth 1983 English 1 Woody Allen 1/12/1935 1998 English 2 Bruce Willis 19/3/1955 3 Armageddon 1998 English 3 Julia Roberts 28/10/1967 4 Fantasia 1940 English Acted-in a-id m-id role a-id m-id 1 1 Zelig 1 1 2 Z The Full Disjunction of the Given Relations 2 3 Harry m-id m-id 1 4 2 title title Zelig Fantasia Antz year year 1983 1940 1998 language language English English English a-id a-id 1 3 1 name name Woody Allen Julia Roberts Woody Allen Date-of-birth Date-of-birth 1/12/1935 28/10/1967 1/12/1935 role role Zelig Z 3 Armageddon 1998 English 2 Bruce Willis 19/3/1955 Harry Actors-that-Directed 1 The full disjunction does not include tuples that are based 4 Fantasia 1940 English on Cartesian Product rather than join 94 3 Julia Roberts 28/10/1967 In the Full Disjunction of a Given Set of Relations: Every tuple of the input is a part of at least one tuple of the output Tuples are joined as in a natural join, padded with null values The result includes only “maximal connected portions” 95 Motivation for Full Disjunctions • Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94] • Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96] 96 Computing Full Disjunctions for γ-acyclic Relation Schemas • Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic • Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic 97 Weak Semantics Generalizes Full Disjunctions • Relations can be converted into a semistructured database • The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics We have developed an algorithm that uses this generalization to compute full disjunctions in polynomial time under I/O complexity, even when the relation schemas are cyclic 98 Generalizing Full Disjunctions • In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join) • We can generalize full disjunctions to support constraints that are not merely equality among attributes 99 Example Movies (m-id, title, year, language, location) Actors (a-id, name, date-of-birth) Acted-in (a-id, m-id, role) Actors-that-Directed (a-id, m-id) Historical-Events (name, date, description) Historical-Sites (Country, State, City, Site) The date of the historical event is a date in the year when The filming location is near the historical site the movie was released 100 Another Way of Generalizing Full Disjunctions: Use OR-Semantics • OR-semantics is used rather than weak semantics when tuples are joined • This relaxes the requirement that every pair of tuples should be join consistent • Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent 101 Example Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employee: (007, James Bond, London, 6) Department: (6, MI-6, 10) Located-in: (10, Liverpool, King) e-id ename city 007 James Bond London 6 6 MI-6 10 6 MI-6 10 10 Liverpool King 102 dept dept dname -no -no building building city The Full Disjunction street Example Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employee: (007, James Bond, London, 6) Department: (6, MI-6, 10) Located-in: (10, Liverpool, King) e-id ename city 007 James Bond London 103 dept dept dname -no -no 6 6 MI-6 building building city 10 10 Liverpool The Full Disjunction under OR-Semantics street King Information Integration from Heterogeneous Sources Integrated Relation Relation Relation Relation Query Query Query Data Source Data Source Data Source 104 We use queries that combine flexible semantics and weak semantics: -The queries are insensitive to changes in the data - Easy to formulate the query Integrated Relation Relation Relation Relation Query Query Query Data Source Data Source Data Source 105 The integration of the relations is done with a full disjunction of the computed relations Integrated Relation Relation Relation Relation Query Query Query Data Source Data Source Data Source 106 Conclusion • Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases – Querying the database even when the user is oblivious to the structure of the database – Queries are insensitive to variations in the structure of the database 107 Conclusion (continued) • Queries in AND semantics, OR semantics or weak semantics facilitate easy and intuitive querying of incomplete databases – Querying the database even when the user is oblivious to missing data – Queries return maximal answers rather than complete answers 108 Conclusion (continued) • The two paradigms of flexible queries and queries with maximal answers can be combined • The combination of the paradigms can facilitate integration of information from heterogeneous sources 109 Conclusion (continued) • Full disjunctions can be computed using queries in weak semantics • Full disjunctions can be generalized so that relations are joined using constraints that are not merely equality constraints 110 Thank You Questions? 111