Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computing Provenance and Annotations of Derived Data Wang-Chiew Tan UC Santa Cruz Provenance of data • When you see some data on the Web, do you know – where it came from? – why it is there? • This information (provenance) is typically lost in the process of copying/transcribing/transforming databases • Loss of provenance is an acute problem in some scientific databases 2 Complex interdependencies (Example from scientific databases) GERD Various problems: •Trace provenance of data •Propagate annotations TRRD EpoDB BEAD Swissprot GAIA EMBL GenBank DDBJ Transfac flow of data 3 Two kinds of provenance NYRestaurants (Source table) NYHotels (Source table) Rating Zip Waldorf Astoria 10022 4.5 Holiday Inn DT 10013 4.0 Hotel Cost Restaurant Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Type Zip $$$ $$$ French 10022 Seafood 10022 $ $ Chinese 10013 American 10022 JOIN, PROJECT View Hotel Rating Restaurant Cost $$$ Waldorf Astoria Peacock Alley 4.5 Bull & Bear 4.5 $$$ Waldorf Astoria Waldorf Astoria $ Soho Kitchen & Bar 4.5 Holiday Inn DT 4.0 Pacifica $ (Why-provenance) Why? Where? (Where-provenance) 4 SDSS - Sloan Digital Sky Server Select Specobj.z, photoobj.g, photoobj.r From Specobj, photoobj Where Specobj.objid = photoobj.objid and Specobj.specclass = 3 and Specobj.zconf > .95 5 Compute provenance • Question: Suppose a database is created by a query. Can we compute the why and where provenance of an element? • Answer: Computing provenance (both why and where) is NP-hard in general. 6 Annotations • Adds value to data – knowledge sharing : annotations can be read & reviewed by independent parties • Annotations are loosely structured – Annotations on data at various levels of granularity, annotations on annotations • Source Data: – proprietary – fixed schema • A system that overlays annotations on existing data • Useful tool for scientific databases • Annotations should spread back to the source and forward to other databases 7 Propagating annotations Serves fine French Cuisine in elegant setting. Jackets required. NYRestaurants (Source Table) Cost Restaurant Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Extensive wine list! Type Zip $$$ $$$ French 10022 Seafood 10022 $ $ Chinese 10013 American 10022 Yummy chicken curry!! Cheap Restaurants (View 2) All Restaurants (View 1) Restaurant Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Cost Type $$$ $$$ French Seafood $ $ Chinese American Restaurant Pacifica Soho Kitchen & Bar Cost $ $ Type Chinese American 8 Location and Propagation Rules • A location is a triple: (R, t, A) relation name tuple in R A is an attribute in schema of R • Propagation Rules: – Select: R – Project: R – Join: – Union: A1 A2 A3 A1 A2 A3 A1 A2 A2 R2 A2 A3 A1 A2 A3 A1 A2 A3 A3 A3 R2 R1 R1 A1 A1 A2 A3 A1 A2 A3 9 Computing annotation propagation Model: Source: Relational Database Query View : result of query applied on source • Question: Suppose a database is created by a query over some source data, can we compute how to propagate an annotation on a data element back to the source with minimum side-effects? • Answer: Computing the minimum side-effect annotation is NP-hard in general 10 Related Work on Annotations (not exhaustive!) • Superimposed Information (D. Maier, L. Delcambre [WebDB’99]) – data “placed over” existing information e.g. bookmark files, schema of a database • Annotation Systems – Annotea (W3C) • annotate web pages – Multivalent Browser (R. Wilensky, T. A. Phelps. UC Berkeley DL Project) • annotate on PDF files, HTML, etc. – BioDAS (Distributed Annotation Server) (L.Stein et al. ) • annotate on genome sequences • No one has formally studied annotation placement problem 11 Provenance and Annotations • Where-provenance & annotation placement – where should the annotation be placed in the source in order to propagate the annotation to view data d ? • Annotate the source data in one of the source locations in the where-provenance of d • Provenance & Archiving – trace a piece of data to its correct source version • Why-provenance & view deletion • which source data should be deleted in order to delete view data d ? A combination of source data that altogether “disable” every witness for d 12 How do we attach annotations to data? • Relational tables: Identify a particular column of a particular table of a particular relation: (R, t, A) A R t • Tree-like data: Need a canonical path to the data element 13 Lots more to do! • Further study on provenance for queries that involve negation, aggregates select sum(sal) from Employee where sal > 50K • Handle “irregular” annotations and on tree-like data. • How about databases which are manually constructed and annotated? – Organize data with keys • Use of constraints and special cases to derive efficient algorithms for propagating annotations back • Language specific issues 14 Inconsistencies in “annotation-aware” language(s) • The same query in different languages, but different annotation behavior Emp Department Name Sal Dept Dept Manager Joe 50K Marketing Marketing Jane Relational Algebra: Emp JOIN Department [Name:”Joe”, Sal:50K , Dept:”Marketing” SQL: SELECT e.Name, e.Sal, e.Dept, d.Manager FROM Emp e, Department d WHERE e.Dept ==a d.Dept , Manager:”Jane”] [Name:”Joe”, Sal:50K , Dept:”Marketing” , Manager:”Jane”] • Equivalent queries in the same language, but different annotation behavior Q1 = SELECT e.Name, e.Sal FROM Emp e WHERE e.Sal = “50K” [Name:”Joe”, Sal:50k ] Q2 = SELECT e.Name, “50K” AS Sal FROM Emp e WHERE e.Sal = “50K” [Name:”Joe”, Sal:50k] 15 Do we need an “annotation-aware” QL? • Relational algebra suggests a natural set of propagation rules • SQL suggests another natural propagation rule – based on variable bindings • Question: Can we extend/design the the query language(s) so that – Equivalent queries have the same annotation behavior – Translation of a query from one language (e.g. SQL) into another (e.g. relational algebra) yields the same annotation behavior • Perhaps a more fundamental question... – Should a query language be “annotation-aware” ? – Perhaps we should have language constructs to allow the user to explicitly control annotation propagation? 16 End 17