Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced Topics in Computer Science Course 236605 Lecture 1: Introduction Faculty of Computer Science Technion – Israel Institute of Technology Spring 2016 Assumed Background • Databases – Relational model, database querying, SQL, relational algebra, schema, integrity constraints (e.g., functional dependencies) • Algorithms and complexity – Asymptotic running time, polynomial time, NP, completeness, reduction • Basic probability theory – Probability spaces, random variables, conditional probability 2 Requirements 1. Home assignments – 3 x dry (15% each), 1 x wet (25%) 2. Paper presentation 3. Mandatory attendance – Contact me in advance if you are having a problem attending a specific lecture 3 Lecture 1: Introduction UNCERTAINTY IN DATABASES 4 Some Modern Database Content Knowledge Bases Business DBs Sensing Data 5 Integration Signal / Image Processing Text Analytics / NLP Web Pages Social Media Financial Reports OCR / Image Gov Reports Med Reports Knowledge Bases Attribute Concept Value Instance Instance Concept country 0.4 Probability Relationship Relationship Israel location 0.35 Person 0.2 • Microsoft Probase • MPI YAGO • Google Knowledge Graph • CMU NELL • Stanford DeepDive • Freebase • ... 6 Relating to Big Data • Missing information • Conflicting Information • Probabilistic information 7 Popular Topics in DB Research • VLDB 2014 Ten Year Best Paper – Nilesh Dalvi and Dan Suciu: Efficient Query Evaluation on Probabilistic Databases • PODS 2014 Keynote – Leonid Libkin: Incomplete data: what went wrong, & how to fix it • SIGMOD/PODS 2014 Workshop on Big Uncertain Data – Kimelfeld (DB) and Kersting (AI) • ICDT 2013 Test-of-Time Award – Ronald Fagin, Phokion Kolaitis, Renee Miller, and Lucian Popa: Data Exchange: Semantics and Query Answering 8 In the Course • Principled, application-independent paradigms to managing uncertainty in data – Incomplete / inconsistent / probabilistic databases • Two key aspects for every paradigm: – Representation & Semantics • How do we represent what we know? what is missing? what is conflicting? what is our confidence? – Query evaluation • What is the meaning of query answering in the presence of uncertainty? What is the computational complexity? 9 Course Logo Probabilistic Database Incomplete Inconsistent 10 Lecture 1: Introduction DATABASE RESEARCH 11 Publication Venues for DB Research • Conferences: – General: • • • • SIGMOD: ACM Special Interest Group on Management of Data (since 1975) VLDB: Intl. Conf. on Very Large Databases (since 1975) ICDE: IEEE Intl. Conf. on Data Engineering (since 1984) EDBT: Intl. Conference on Extending Database Technology (since 1988) – Theory oriented: • PODS: ACM Symp. on Principles of Database Systems (since 1982) • ICDT: Intl. Conference on Database Theory (since 1986) • Journals: – TODS: ACM Transactions on Database Systems (since 1976) – VLDBJ: The VLDB Journal (since 1992) – SIGMOD REC: ACM SIGMOD Record (since 1969) 12 Selected Database Research Topics* System Design Database Security • Distributed, storage, in-memory, recovery Views • View-based access • Incremental maintain Query Languages • Codasyl, SQL, recursion, nesting System Optimization • Caching & replication • Indexing • Clustering Schema Design • ER models, normal forms, dependency Benchmarking Transaction & concur. DB Performance • Query process & opt. • Evaluation methods Data Models • OO, geo, temporal Logic • Deductive (Datalog) • Integrity/constraints Incompleteness (null) 1980 Heterogeneity • Data Integration • Interoperability Analytics (OLAP) Data Models • Multimedia, DNA • Text, XML Mining & Discovery • Discovering association rules 1990 Further XML • Query eval / optimize • Compression Database Privacy Data Models • Streaming data • Graph data DB Uncertainty • Inconsistency & cleaning • Probabilistic DB DB & IR • DB for search • Search for DB Entity Resolution Information Extraction from Web/text Crowdsourcing • Utilizing crowd input in databases Social Networks & Social Media Data Models • Semantic Web (RDF, ontologies) • NoSQL (doc, graph, key-value) DB & ML & AI Schema Matching & Discovery Provenance/ lineage Data Exchange Ranking & personalization Cloud Databases 2000 * Based on SIGMOD session topics from DBLP • Model / compute Column Stores 13 Turing Awards for DB Technology 1973 1981 2014 1998 14 Lecture 1: Introduction INCOMPLETE DATABASES 15 Missing Information • Problem: pieces of data missing, but we need to keep whatever partial knowledge we have Registrations Courses student course course lecturer Ahuva PL PL Eran • A source tells us that Alon is a student of Keren – How can we represent it in our DB? Registrations ⊥=NULL Courses student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren 16 SQL’s NULL • NULL is SQL’s special “missing value” • Same queries as complete tables, but SQL assigns a special behavior to logic over NULL – “Three-valued logic”: true, false, unknown • Alas, there are some issues... 17 Try It Yourself (psql) CREATE TABLE Registrations( student varchar(40), course varchar(40)); CREATE TABLE Courses( course varchar(40), lecturer varchar(40)); INSERT INTO Registrations VALUES ('Ahuva','PL'), ('Alon',NULL); INSERT INTO Courses VALUES ('PL','Eran'), (NULL,'Keren'); Registrations Courses student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren SELECT student, lecturer FROM Registrations R, Courses C WHERE R.course = C.course; student lecturer Ahuva Eran Of course, we've lost our initial association (join)... 18 Try More Yourself (psql) Courses Registrations student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren SELECT student FROM Registrations; student SELECT student FROM Registrations WHERE course='PL'; Ahuva student Alon Ahuva Inconsistent logic... real problem! SELECT student FROM Registrations WHERE course!='PL'; student SELECT student FROM Registrations WHERE course='PL' OR course!='PL'; student Ahuva Alon?? 19 Labeled Nulls in “Naive” Tables • Just like nulls, but each null has a name – We do not know what the value is, but we do know that two nulls with the same name are the same Registrations Courses student course course lecturer Ahuva PL PL Eran Alon ⊥1 ⊥1 Keren Ahuva ⊥2 ⊥2 Shaul ⨝ = student course lecture r Ahuva PL Eran Alon ⊥1 Keren Ahuva ⊥2 Shaul ? ? ? ? ? ? 20 Possible Worlds Registrations Closed-World Assumption: Registrations student course Ahuva PL Alon ⊥1 Ahuva ⊥2 student course Registrations Open-World Assumption: student course Ahuva PL Ahuva PL Alon PL Alon PL Ahuva DB Ahuva DB Anna AI Registrations student course Ahuva PL Alon DB Ahuva DB Registrations student course Ahuva PL Alon Ahuva Registrations course ⊥1 Ahuva PL ⊥2 Alon DB Ahuva DB Ahuva AI Avi ML ... ... student 21 Semantics of Query Answering Incomplete DB Possible Worlds 22 Semantics of Query Answering Incomplete DB Possible Worlds 23 Semantics of Query Answering Incomplete DB Certain answers (“weak) Represent as an incomplete relation (“strong”) Possible Worlds 24 FQL Table Schema Application: Data Exchange status link PK uid status_id time source message group PK nid pic_small pic_big pic description group_type group_subtype recent_news creator update_time office website venue privacy uid name value expires path note PK note_id uid created_time updated_time content title comment xid post_id fromid time text id username reply_xid Messages Users Associations Global Schema friend_request uid_from uid_to friend source_id target_id target_type is_following updated_time is_deleted user PK Mappingname link_id owner created_time title summary url image_urls cookies gid connection page uid first_name last_name name pic_small pic_big pic_square pic affiliations profile_update_time timezone religion birthday birthday_date sex hometown_location meeting_sex meeting_for relationship_status significant_other_id political current_location activities interests is_app_user music tv movies books quotes about_me hs_info education_history work_history notes_count wall_count status has_added_app online_presence locale proxied_email profile_url email_hashes pic_small_with_logo PK standard_user_info uid first_name last_name name locale affiliations profile_url timezone birthday sex proxied_email profile id name url pic pic_square pic_small pic_big type page_admin uid page_id type 25 page_id name pic_small pic_big pic_square pic pic_large page_url type website has_added founded company_o mission products location parking public_tran hours attire payment_o culinary_te general_ma price_range restaurant_ restaurant_ release_da genre starring screenplay directed_by produced_b studio awards plot_outline network season schedule written_by band_mem hometown current_loc record_labe booking_ag The Clio Project IBM + U. Toronto – tool for data exchange Commercialized in IBM DB2 26 Formalism [Fagin et al. 05] A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T S T StudLecturer student lecturer Courses Registrations student course course lecturer Σ StudLecturer(x,y) ∃z Registrations(x,z) ⋀ Courses(z,y) StudLecturer student course Ahuva Shaul Alon Keren source instance ?? We don’t have z! So 2 options: 1) Abort 2) Do our best to max usability Formalism [Fagin et al. 05] A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T S T StudLecturer student lecturer Courses Registrations student course course lecturer Σ StudLecturer(x,y) ∃z Registrations(x,z) ⋀ Courses(z,y) StudLecturer Courses Registrations student course student course course lecturer Ahuva Shaul Ahuva ⊥1 ⊥1 Shaul Alon Keren Alon ⊥2 ⊥2 Keren source instance solution 28 Problems Studied in Data Exchange • Materialization – Many solutions exist; what makes one solution “better” than another? If there a “best” solution? How to find it? • Target query answering – Given a source instance and a query over the target, evaluate the query (semantics / complexity) • Manipulating schema mappings – Composition and inversion of mappings 29 Lecture 1: Introduction INCONSISTENT DATABASES 30 Inconsistency • An inconsistent database contains inconsistent (or impossible) information – Two students have the same ID – A student gets credit for the same course twice – A student takes a course that is not listed in the course database – A student has a grade for this course but a grade is missing for an assignment • Modeling: (D,Σ) where D is a database and Σ is a set of required logical integrity constraints over DBs; alas, D violates Σ 31 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: student, course grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 85 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 32 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: student, course grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 87 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 33 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: student, course grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 80 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 34 Minimal Repairs [Arenas, Bertossi, Chomicki 99]: DEFINITION: Let (D,Σ) be an inconsistent DB. A repair is a DB D', such that: 1. DB D' is consistent (with respect to Σ) 2. DB D' differs from D in a “minimal way” Grades Grades student course grade Ahuva PL 90 Alon PL 86 Alon PL 81 Inconsistent database D student course grade Ahuva PL 90 Alon PL 86 Repair D'1 Grades student course grade Ahuva PL 90 Alon PL 81 Repair D'2 35 Semantics of Query Answering Inconsistent DB Repairs (consistent DBs) 36 Semantics of Query Answering Inconsistent DB Repairs (consistent DBs) 37 Semantics of Query Answering Inconsistent DB Consistent Answers Repairs (consistent DBs) 38 Algorithms / Complexity Koutris & Wijsen [2015]: For consistent query answering with key constraints, we know how Select-Project-Join (SPJ) w/o self joins are classified into 3 categories: 1. Inconsistent DB 2. 3. Inconsistent DB coNP-complete (exptime under standard complexity assumptions) ignore inconsistency Rewriting Graph algorithm 39 Incorporating Preferences Functional dependencies: course lecturer lecturer course Courses course lecturer DB Keren DC Keren DC Eran DB Eran What if we trust some tuples more than others? Staworko, Chomicki, Marcinkowski: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2-3): 40 Lecture 1: Introduction PROBABILISTIC DATABASES 41 How to accommodate the probabilistic nature of data at the database & query level? Student University Ahuva Technion Alon Technion Employee Employer Role Eng Ahuva Intel PM VP HaifaU Alon Yahoo! Eng Google Eng Intel PM • Find the students that are employed as engineers • How many students work at Intel? • Is any PM a Technion student? 42 How to accommodate the probabilistic nature of data at the database & query level? Student University Pr Ahuva Technion 1.0 Technion 0.7 HaifaU 0.3 Alon Employee Ahuva Alon Employer Role Pr Eng 0.7 PM 0.2 VP 0.1 Yahoo! Eng 0.4 Google Eng 0.4 Intel PM 0.1 Intel • Find the students that are employed as engineers - Ahuva (0.7), Alon (0.8) • How many students work at Intel? - Expectation = 1 + 0.1 • Is any PM a Technion student? - Yes w/ prob 1-((1-0.2)*(1-0.7*0.1)) 43 Semantics Probabilistic DB p1 p2 p3 p4 pn Space of ordinary DBs 44 Semantics of Query Answering Probabilistic Database p1 p2 p3 p4 pn Space of ordinary DBs 45 Semantics of Query Answering Probabilistic Database p1 p1 p2 p2 p3 p3 p4 p4 pn pn Space of ordinary DBs 46 Semantics of Query Answering Probabilistic Database p1 p1 p2 p2 p3 p3 p4 Rep of the probability space Mapping tuple marginal probability p4 pn pn Space of ordinary DBs 47 Algorithms for Query Answering • Dalvi & Suciu dichotomy: SPJ queries can be fully classified into: – Queries that can be solved in polynomial time • By repeated decomposition into simpler queries – Queries for which answering is #P-hard • Hence, cannot be computed in polynomial time under standard complexity assumptions • Heuristic via BDDs [Olteanu+] • Guaranteed approximation via sampling – Additive approx. p±𝜀 is simple – Multiplicative approx. (1±𝜀)p requires more work 48 Probabilistic XML university department 0 .8 0.9 position name position Paul 3 0. chair f. prof a. prof 0.5 0. 6 name ph.d. studs Nicole 4 0. 0.8 0.5 0. 7 member 0.6 0.7 member chair f. prof a. prof name name name David Amy Emily [Abiteboul, Kimelfeld, Sagiv, Senellart]: Representation systems and XPath evaluation 49 Lecture 1: Introduction PLANNED SCHEDULE 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14/3 No Lecture 20/3 Intro 21/3 DB Essentials 28/3 Incompleteness 4/4 Data Exchange 11/4 No Lecture 18/4 Inconsistent DBs Assignment 1 due 25/04 Passover 02/05 Consistent Q Answering 9/05 Consistent Q Answering 16/05 Pref. Repairs + Misc Assignment 2 due 23/05 Probabilistic DB 30/05 Query Inference Assignment 3 due 06/06 Query Inference 13/06 Probabilistic DB Extras 20/06 Perspectives Assignment 4 due 51