Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Course 236378 Introduction Faculty of Computer Science Technion – Israel Institute of Technology Winter 2016-2017 Assumed Background • Databases – Relational model, database querying, SQL, relational algebra, schema, integrity constraints (e.g., functional dependencies) • Algorithms and complexity – Asymptotic running time, polynomial time, NP, completeness, reduction • Basic probability theory – Probability spaces, random variables, conditional probability 2 Requirements 1. Home assignments – 5 x dry (20% each) 2. Mandatory attendance – Contact me in advance if you are having a problem attending a specific lecture 3 Principles of Managing Uncertain Data: Introduction HISTORICAL PERSPECTIVE 4 Pre-Relational Databases • Cross-app solutions for data store/access proposed already in the 1960s • Examples: – The CODASYL committee standardized a network data model (Codasyl Data Model) • A network of entities linked to each other, very similar object-oriented models C. W. toBachman – Integrated Data Stores (Charles Bachman) • High-performance graph database from 1964 (!) – IBM’s Information Management System (IMS) driven by the Apollo program • Hierarchical data model, index and transaction support 5 Codd’s Vision (1) • 1970: Codd invents the relational database model – Idea: interface via First-Order Logic! • • • • Edgar F. Codd (1923-2003) Data = collection of relations, interconnected via keys Relations conform to a schema Questions via a query language over the schema System translates queries into actual execution plans – Principle: separate logical from physical layers – Work done in IBM San Jose, now IBM Almaden – [E. F. Codd: A Relational Model of Data for Large Shared Data Banks. In Communications of the ACM 13(6): 377-387 (1970) ] Codd’s Vision (2) • 1970-1972: Codd introduced the relational algebra and the relational calculus – Algebraic and logical QLs, respectively – Proved their equal expressive power Edgar F. Codd (1923-2003) – [E. F. Codd: Relational Completeness of Data Base Sublanguages. In: R. Rustin (ed.): Database Systems: 65-98] 7 Codd Catches On (1) • 1973: Michael Stonebraker and Eugene Wong implement Codd’s vision in INGRES – Commercialized in 1983 – Evolved to Postgres (now PostgreSQL) in 1989 M. Stonebraker E. Wong 8 Codd Catches On (2) • 1974: A group from the IBM San Jose lab implements Codd’s vision in System R, which evolved to DB2 in 1983 R. F. Boyce (1947-1974) – SQL initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce • [Chamberlin, Boyce: SEQUEL: A Structured English Query Language. SIGMOD Workshop, Vol. 1 1974: 249-264] D. D. Chamberlin • 1977: Influenced by Codd, Larry Ellison founds Software Development Labs – Becomes Relational Software in 1979 – Becomes Oracle Systems Corp (1982), named after its Oracle database product J. Grey L. Ellison 9 P. G. Selinger Selected Database Research Topics* System Design Database Security • Distributed, storage, in-memory, recovery Views • View-based access • Incremental maintain Query Languages • Codasyl, SQL, recursion, nesting System Optimization • Caching & replication • Indexing • Clustering Schema Design • ER models, normal forms, dependency Benchmarking Transaction & concur. DB Performance • Query process & opt. • Evaluation methods Data Models • OO, geo, temporal Logic • Deductive (Datalog) • Integrity/constraints Incompleteness (null) 1980 Heterogeneity • Data Integration • Interoperability Analytics (OLAP) Data Models • Multimedia, DNA • Text, XML Mining & Discovery • Discovering association rules 1990 Further XML • Query eval / optimize • Compression Database Privacy Data Models • Streaming data • Graph data DB Uncertainty • Inconsistency & cleaning • Probabilistic DB DB & IR • DB for search • Search for DB Entity Resolution Information Extraction from Web/text Crowdsourcing • Utilizing crowd input in databases Social Networks & Social Media Data Models • Semantic Web (RDF, ontologies) • NoSQL (doc, graph, key-value) DB & ML & AI Schema Matching & Discovery Provenance/ lineage Data Exchange Ranking & personalization Cloud Databases 2000 * Based on SIGMOD session topics from DBLP • Model / compute Column Stores 10 Publication Venues for DB Research • Conferences: – General: • SIGMOD: ACM Special Interest Group on Management of Data (since 1975) • VLDB: Intl. Conf. on Very Large Databases (since 1975) • ICDE: IEEE Intl. Conf. on Data Engineering (since 1984) • EDBT: Intl. Conference on Extending Database Technology (since 1988) – Theory oriented: • PODS: ACM Symp. on Principles of Database Systems (since 1982) • ICDT: Intl. Conference on Database Theory (since 1986) • Journals: – TODS: ACM Transactions on Database Systems (since 1976) – VLDBJ: The VLDB Journal (since 1992) – SIGMOD REC: ACM SIGMOD Record (since 1969) 11 Turing Awards for DB Technology 1973 1981 2014 1998 12 Some Modern Database Content Knowledge Bases Business DBs Sensing Data 13 Integration Signal / Image Processing Text Analytics / NLP Web Pages Social Media Financial Reports OCR / Image Gov Reports Med Reports Knowledge Bases Attribute Concept Value Instance Instance Concept country 0.4 Probability Relationship Relationship Israel location 0.35 Person 0.2 • MPI YAGO • Stanford DeepDive • Microsoft Probase • CMU NELL • Google Knowledge Graph • Freebase • DBPedia • ... 14 Relating to Big Data • Missing information • Conflicting Information • Probabilistic information 15 Uncertainty is Popular in DB Research • VLDB 2014 Ten Year Best Paper – Nilesh Dalvi and Dan Suciu: Efficient Query Evaluation on Probabilistic Databases • PODS 2014 Keynote – Leonid Libkin: Incomplete data: what went wrong, & how to fix it • SIGMOD/PODS 2014 Workshop on Big Uncertain Data – Kimelfeld (DB) and Kersting (AI) • ICDT 2013 Test-of-Time Award – Ronald Fagin, Phokion Kolaitis, Renee Miller, and Lucian Popa: Data Exchange: Semantics and Query Answering 16 2016 Dagstuhl Perspective Workshop Gathering of top database theorists, evaluating the field and planning ahead 17 In the Course • Foundations of database complexity – Data/combined complexity, join acyclicity, hypertree width • Principled, application-independent paradigms to managing uncertainty in data – Incomplete / inconsistent / probabilistic databases – Two key aspects for every paradigm: • Representation & Semantics – How do we represent what we know? what is missing? what is conflicting? what is our confidence? • Query evaluation – What is the meaning of query answering in the presence of uncertainty? What is the computational complexity? 18 Principles of Managing Uncertain Data: Introduction BASIC DATABASE CONCEPTS 19 Schema and Databases • A database schema is finite set of relation names, each mapped into a relation schema – Example: Student(sid,name,year) , Course(cid,topic) , Studies(sid,cid) • A (database) instance over a schema consists of a relation for each relation schema Student Course Studies sid name year cid topic sid cid 861 Alma 2 23 PL 861 23 753 Amir 1 45 DB 861 45 955 Ahuva 2 76 OS 753 45 955 76 20 Integrity Constraints in Databases A student cannot get two grades for the same course Grade must be > 53 (check constraint) Student Course Took ID name addr number name lecturer sID cNum grade 1234 Avia Haifa 363 DB Anna 1234 363 95 2345 Boris Nesher 319 PL Barak 2345 319 73 No two tuples have the same ID (key constraint) Courses with the same number have the same name (functional dependency) sID is a Student.ID; cNum is a Course.number (referential constraint) 21 What are Integrity Constraints? • Schema-level (data-independent) specifications on how records should behave beyond the relational structure – (e.g., students with the same ID have the same name, take the same courses, etc.) • DBMS guarantees that constraints are always satisfied, by disabling actions that cause violations • What if we get data that violates the constraints to begin with?? – Wait for “inconsistent databases” 22 Why Integrity Constraints? • Maintenance: consistency assured without custom code • Development complexity: no reliance on consistency tests – But exceptions need to be handled • Optimization: operations may be optimized if we know that some constraints hold – (e.g., once a sought student ID is found, you can stop; you won’t find it again) 23 Querying: Which Courses Avia Took? S C T ID name addr number name lecturer sID cNum grade 1234 Avia Haifa 363 DB Anna 1234 363 95 2345 Boris Nesher 319 PL Barak 1234 319 82 2345 319 73 Assembly ... mov $1, %rax mov $1, %rdi mov $message, %rsi mov $13, %rdx syscall mov $60, %rax xor %rdi, %rdi ... Python for s in S: for c in C: for t in T: if s.sName==‘Avia’ and s.ID==t.sID and t.cNum == c.number: print c.name QL SQL SELECT C.name FROM S,C,T WHERE S.name = ‘Avia’ AND S.ID = T.sID AND T.cNum = C.number Algebra (RA) πC.name(σS.name=‘Avia’, )) number=cNum, ID=sID(S⨉C⨉T) Logic (RC) {⟨x⟩|∃y,n,z,l,g [S(y,n,'Avia')∧C(z,x,l)∧T(y,z,g)]} Logic Programming (Datalog) Q(x) S(y,n,‘Avia’),C(z,x,l),T(y,z,g)24 Relational Algebra • Primitive operators: 1. 2. 3. 4. 5. 6. Projection () Selection () Renaming () Union (∪) Difference (\) Cartesian Product (×) • Natural join (⨝) can be defined using × • Conjunctive Queries (CQ): ⨝ • Unions of CQs (UCQs, “positive RA”): ⨝ ∪ 25 • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) RC Query { (x,u) | Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w [Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)] ] } z y w u Which relatives does this query find? x 26 Domain Independence What is the Meaning of the Following? { (x) | ¬Person(x, 'female', 'Canada') } { (x,y) |∃z [Spouse(x,z) ⋀ y=z] } { (x,y) |∃z [Spouse(x,z) ⋀ y≠z] } • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) 27 Equivalence Between RA and D.I. RC THEOREM: RA and domain-independent RC have the same expressive power. More formally, on every schema S: – For every RA expression E there is a domainindependent RC query Q such that Q≡E – For every domain-independent RC query Q there is an RA expression E such that Q≡E 28 Complexity of Joins R Which join is more complicated? In what sense? ⋀1⩽i<j⩽nR(Xi,Xj) X1 X3 X4 Clique X1 X2 X3 X4 X5 Acyclic / Path X2 X5 ⋀1⩽i<nR(Xi,Xi+1) ⋀1⩽i<nR(Xi,Xi+1) ⋀ R(Xn,X1) X1 Bounded treewidth X5 X2 X4 X3 29 Principles of Managing Uncertain Data: Introduction INCOMPLETE DATABASES 30 Missing Information • Problem: pieces of data missing, but we need to keep whatever partial knowledge we have Registrations Courses student course course lecturer Ahuva PL PL Eran • A source tells us that Alon is a student of Keren – How can we represent it in our DB? Registrations ⊥=NULL Courses student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren 31 SQL’s NULL • NULL is SQL’s special “missing value” • Same queries as complete tables, but SQL assigns a special behavior to logic over NULL – “Three-valued logic”: true, false, unknown • Alas, there are some issues... 32 Try It Yourself (psql) CREATE TABLE Registrations( student varchar(40), course varchar(40)); CREATE TABLE Courses( course varchar(40), lecturer varchar(40)); INSERT INTO Registrations VALUES ('Ahuva','PL'), ('Alon',NULL); INSERT INTO Courses VALUES ('PL','Eran'), (NULL,'Keren'); Registrations Courses student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren SELECT student, lecturer FROM Registrations R, Courses C WHERE R.course = C.course; student lecturer Ahuva Eran Of course, we've lost our initial association (join)... 33 Try More Yourself (psql) Courses Registrations student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren SELECT student FROM Registrations; student SELECT student FROM Registrations WHERE course='PL'; Ahuva student Alon Ahuva Inconsistent logic... real problem! SELECT student FROM Registrations WHERE course!='PL'; student SELECT student FROM Registrations WHERE course='PL' OR course!='PL'; student Ahuva Alon?? 34 Labeled Nulls in “Naive” Tables • Just like nulls, but each null has a name – We do not know what the value is, but we do know that two nulls with the same name are the same Registrations Courses student course course lecturer Ahuva PL PL Eran Alon ⊥1 ⊥1 Keren Ahuva ⊥2 ⊥2 Shaul ⨝ = student course lecture r Ahuva PL Eran Alon ⊥1 Keren Ahuva ⊥2 Shaul ? ? ? ? ? ? 35 Possible Worlds Registrations Closed-World Assumption: Registrations student course Ahuva PL Alon ⊥1 Ahuva ⊥2 student course Registrations Open-World Assumption: student course Ahuva PL Ahuva PL Alon PL Alon PL Ahuva DB Ahuva DB Anna AI Registrations student course Ahuva PL Alon DB Ahuva DB Registrations student course Ahuva PL Alon Ahuva Registrations course ⊥1 Ahuva PL ⊥2 Alon DB Ahuva DB Ahuva AI Avi ML ... ... student 36 Semantics of Query Answering Incomplete DB Possible Worlds 37 Semantics of Query Answering Incomplete DB Possible Worlds 38 Semantics of Query Answering Incomplete DB Certain answers (“weak) Represent as an incomplete relation (“strong”) Possible Worlds 39 FQL Table Schema Application: Data Exchange status link PK uid status_id time source message group PK nid pic_small pic_big pic description group_type group_subtype recent_news creator update_time office website venue privacy uid name value expires path note PK note_id uid created_time updated_time content title comment xid post_id fromid time text id username reply_xid Messages Users Associations Global Schema friend_request uid_from uid_to friend source_id target_id target_type is_following updated_time is_deleted user PK Mappingname link_id owner created_time title summary url image_urls cookies gid connection page uid first_name last_name name pic_small pic_big pic_square pic affiliations profile_update_time timezone religion birthday birthday_date sex hometown_location meeting_sex meeting_for relationship_status significant_other_id political current_location activities interests is_app_user music tv movies books quotes about_me hs_info education_history work_history notes_count wall_count status has_added_app online_presence locale proxied_email profile_url email_hashes pic_small_with_logo PK standard_user_info uid first_name last_name name locale affiliations profile_url timezone birthday sex proxied_email profile id name url pic pic_square pic_small pic_big type page_admin uid page_id type 40 page_id name pic_small pic_big pic_square pic pic_large page_url type website has_added founded company_o mission products location parking public_tran hours attire payment_o culinary_te general_ma price_range restaurant_ restaurant_ release_da genre starring screenplay directed_by produced_b studio awards plot_outline network season schedule written_by band_mem hometown current_loc record_labe booking_ag The Clio Project IBM + U. Toronto – tool for data exchange Commercialized in IBM DB2 41 Formalism [Fagin et al. 05] A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T S T StudLecturer student lecturer Courses Registrations student course course lecturer Σ StudLecturer(x,y) ∃z Registrations(x,z) ⋀ Courses(z,y) StudLecturer student course Ahuva Shaul Alon Keren source instance ?? We don’t have z! So 2 options: 1) Abort 2) Do our best to max usability Formalism [Fagin et al. 05] A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T S T StudLecturer student lecturer Courses Registrations student course course lecturer Σ StudLecturer(x,y) ∃z Registrations(x,z) ⋀ Courses(z,y) StudLecturer Courses Registrations student course student course course lecturer Ahuva Shaul Ahuva ⊥1 ⊥1 Shaul Alon Keren Alon ⊥2 ⊥2 Keren source instance solution 43 Problems Studied in Data Exchange • Materialization – Many solutions exist; what makes one solution “better” than another? If there a “best” solution? How to find it? • Target query answering – Given a source instance and a query over the target, evaluate the query (semantics / complexity) • Manipulating schema mappings – Composition and inversion of mappings 44 Principles of Managing Uncertain Data: Introduction INCONSISTENT DATABASES 45 Inconsistency • An inconsistent database contains inconsistent (or impossible) information – Two students have the same ID – A student gets credit for the same course twice – A student takes a course that is not listed in the course database – A student has a grade for this course but a grade is missing for an assignment • Modeling: (D,Σ) where D is a database and Σ is a set of required logical integrity constraints over DBs; alas, D violates Σ 46 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: student, course grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 85 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 47 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: student, course grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 87 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 48 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: student, course grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 80 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 49 Minimal Repairs [Arenas, Bertossi, Chomicki 99]: DEFINITION: Let (D,Σ) be an inconsistent DB. A repair is a DB D', such that: 1. DB D' is consistent (with respect to Σ) 2. DB D' differs from D in a “minimal way” Grades Grades student course grade Ahuva PL 90 Alon PL 86 Alon PL 81 Inconsistent database D student course grade Ahuva PL 90 Alon PL 86 Repair D'1 Grades student course grade Ahuva PL 90 Alon PL 81 Repair D'2 50 Semantics of Query Answering Inconsistent DB Repairs (consistent DBs) 51 Semantics of Query Answering Inconsistent DB Repairs (consistent DBs) 52 Semantics of Query Answering Inconsistent DB Consistent Answers Repairs (consistent DBs) 53 Algorithms / Complexity Koutris & Wijsen [2015]: For consistent query answering with key constraints, we know how Select-Project-Join (SPJ) w/o self joins are classified into 3 categories: 1. Inconsistent DB 2. 3. Inconsistent DB coNP-complete (exptime under standard complexity assumptions) ignore inconsistency Rewriting Graph algorithm 54 Incorporating Preferences Functional dependencies: course lecturer lecturer course Courses course lecturer DB Keren DC Keren DC Eran DB Eran What if we trust some tuples more than others? Staworko, Chomicki, Marcinkowski: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2-3): 55 Principles of Managing Uncertain Data: Introduction PROBABILISTIC DATABASES 56 How to accommodate the probabilistic nature of data at the database & query level? Student University Ahuva Technion Alon Technion Employee Employer Role Eng Ahuva Intel PM VP HaifaU Alon Yahoo! Eng Google Eng Intel PM • Find the students that are employed as engineers • How many students work at Intel? • Is any PM a Technion student? 57 How to accommodate the probabilistic nature of data at the database & query level? Student University Pr Ahuva Technion 1.0 Technion 0.7 HaifaU 0.3 Alon Role Pr Eng 0.7 PM 0.2 VP 0.1 Yahoo! Eng 0.4 Google Eng 0.4 Intel PM 0.1 Employee Employer Ahuva Alon Intel • Find the students that are employed as engineers • Ahuva (0.7), Alon (0.8) • How many students work at Intel? • Expectation = 1 + 0.1 • Is any PM a Technion student? • Yes w/ prob 1-((1-0.2)*(1-0.7*0.1)) 58 Semantics Probabilistic Database p1 p2 p3 p4 pn Space of ordinary DBs 59 Semantics of Query Answering Probabilistic Database p1 p2 p3 p4 pn Space of ordinary DBs 60 Semantics of Query Answering Probabilistic Database p1 p1 p2 p2 p3 p3 p4 p4 pn pn Space of ordinary DBs 61 Semantics of Query Answering Probabilistic Database p1 p1 p2 p2 p3 p3 p4 Rep of the probability space Mapping tuple marginal probability p4 pn pn Space of ordinary DBs 62 Algorithms for Query Answering • Dalvi & Suciu dichotomy: SPJ queries can be fully classified into: – Queries that can be solved in polynomial time • By repeated decomposition into simpler queries – Queries for which answering is #P-hard • Hence, cannot be computed in polynomial time under standard complexity assumptions • Heuristic via BDDs [Olteanu+] • Guaranteed approximation via sampling – Additive approx. p±𝜀 is simple – Multiplicative approx. (1±𝜀)p requires more work 63 Probabilistic XML university department 0 .8 0.9 position name position Paul 3 0. chair f. prof a. prof 0.5 0. 6 name ph.d. studs Nicole 4 0. 0.8 0.5 0. 7 member 0.6 0.7 member chair f. prof a. prof name name name David Amy Emily [Abiteboul, Kimelfeld, Sagiv, Senellart]: Representation systems and XPath evaluation 64 Principles of Managing Uncertain Data: Introduction PLANNED SCHEDULE 65 1 2 3 4 5 6 7 8 9 10 11 12 30/10 Intro 06/11 Database Query Languages 13/11 Querying Complexity 20/11 Acyclic Joins Assignment 1 Due 24/11 27/11 Complements 04/12 Incomplete Data Assignment 2 Due 08/12 11/12 Data Exchange 18/12 Complements Assignment 3 Due 22/12 No Lecture (Hanukkah) 01/01 Inconsistent Data 08/01 Consistent Query Answering Assignment 4 Due 12/01 15/01 Probabilistic Databases 22/01 Inference on Probabilistic Databases Assignment 5 Due 25/01 Semester Ends 26/01 66