Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TDD: Topics in Distributed Databases (Querying and cleaning big data) Wenfei Fan University of Edinburgh 1 What is big data? 2 Big data: What is it anyway? Everyone talks about big data. But what is it? Volume: horrendously large • PB (1015B) • EB (1018B) Variety: heterogeneous, semi-structured or unstructured • 9:1 ratio of unstructured data vs. structured data • collecting 95% restaurants requires at least 5000 sources Velocity: dynamic cf. Online ordering of overlapping data sources, • think of the Web and Facebook, … PVLDB 7(3), 2013, Mariam Salloum, Xin Luna Dong, Veracity: trust in its quality Divesh Srivastava, Vassilis J. Tsotra • real-life data is typically dirty! A departure from our familiar data management! 3 Why is the data so big? Worldwide information volume is growing annually at a minimum rate of 59% Gartner 2011 A single jet engine produces 20TB (1012B) of data per hour Facebook has 1.38 billion users, 140 billion links, about 300 PB of data Genome of human: sampling, biochemistry, immunology, imaging, genetic, phenotypic data • 1 person: 1PB (1015B) • 1000 people: 1EB (1018B) • 1 billion people: 1ZB (1024B) Big data is a relative notion: 1TB is already too big for your laptop 4 Why do we care about big data? 5 Example: Medicare Google Flu Trends: • advance indication in the 2007-08 flu season • the 2009 H1N1 outbreak Nature, 2009 IBM: Predict Heart Disease Through Big Data Analytics • traditional: EKGs, heart rate, blood pressure • big data analysis: connecting • exercise and fitness tests: • diet • fat and muscle composition • genetics and environment • social media and wellness: share information • … A new game: large number of data sources of big volume 6 Big data is needed everywhere Social media marketing: • 78% of consumers trust peer (friend, colleague and family member) recommendations – only 14% trust ad • if three close friends of person X like items P and W, and if X also likes P, then the chances are that X likes W too Social event monitoring: • Prevent terrorist attack • The Net Project, Shenzhen, China (Audaque) Scientific research: • A new yet more effective way to develop theory, by exploring and discovering correlations of seemingly disconnected factors The world is becoming data-driven, like it or not! 7 The big data market is BIG US HEALTH CARE $300 B Increase industry value per year by $300 B US RETAIL 60+% Increase net margin by 60+% MANUFACTURING –50% Decrease development and assembly costs by 50% GLOBAL PERSONAL LOCATION DATA $100 B Increase service provider revenue by $100 B EUROPE PUBLIC SECTOR ADMIN 250 B Euro Increase industry value per year by 250 B Euro McKinsey Global Institute, May 2011 Big Data: The next frontier for innovation, competition and productivity 8 Why study big data? Want to find a job? • Research and development of big data systems: ETL, distributed systems (eg, Hadoop), visualization tools, data warehouse, OLAP, data integration, data quality control, … • Big data applications: social marketing, healthcare, … • Data analysis: to get values out of big data discovering and applying patterns, predicative analysis, complexity theory, distributed databases, business intelligence, andalgorithms, security, … queryprivacy answering, data quality Prepare you for • graduate study: current research and practical issues; • the job market: skills/knowledge in need Big data = Big $$$ 9 What challenges are introduced by big data? 10 Big data: Through the eyes of computation Computer science is the topic about the computation of function f(x) Big data: the data parameter x is horrendously large: PB or EB What is the challenge introduced to query answering? Fallacies: Big data introduces no fundamental problems Big data = MapReduce (Hadoop) Big data = data quantity (scalability) Are these true? 11 Flashback: Relational queries Questions: What is a relational schema? A relation? A relational database? What is a query? What is relational algebra? What does relationally completeness mean? What is a conjunctive query? updates query answer DBMS store data DB The bible for database researchers: Foundations of Databases 12 Traditional database management systems A database is a collection of data, typically containing the information about one or more related organizations. A database management system (DBMS) is a software package designed to store and manage databases. Database: local DBMS: centralized; single processor (CPU); managing local databases (single memory, disk) updates query answer DBMS store data 13 DB Facebook: Graph Search Find me restaurants in New York my friends have been to in 2013 • friend(pid1, pid2) • person(pid, name, city) • dine(pid, rid, dd, mm, yy) SQL query (in fact, a conjunctive query, or an SPC query) select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and Facebook : more than 1.38 billion pid2 = dine.pid and city =and NYC = 2013 nodes, overand 140yy billion links Is it feasible on big data? 14 Example queries: Graph pattern matching Input: A pattern graph Q and a graph G Output: All the matches of Q in G, i.e., all subgraphs of G that are isomorphic to Q Applications a bijective function f on nodes: • pattern recognition (u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G • intelligence analysis • transportation network analysis • Web site classification • social position detection • user targeted advertising • knowledge base disambiguation … What other graph queries do you know? 15 Graph pattern matching Find all matches of a pattern in a graph B Identify suspects in a drug ring B A1 Am 1 A 3 3 W pattern graph S W W W W Is this feasible? W W W Facebook : more W than 1.38 billion nodes, and over 140 billion links “Understanding the structure of drug trafficking organizations” 16 Querying big data: New challenges Given a query Q and a dataset D, compute Q(D) Q( D ) traditional database Q( D ) big data (PB or EB) What are new challenges introduced by querying big data? Does querying big data introduce new fundamental problems? What new methodology do we need to cope with the sheer size of big data D? Why? A departure from classical theory and traditional techniques 17 The good, the bad and the ugly Traditional computational complexity theory of almost 50 years: • The good: polynomial time computable (PTIME) • The bad: NP-hard (intractable) • The ugly: PSPACE-hard, EXPTIME-hard, undecidable… Howwhen long does it take? What happens it comes to big data? Using SSD of 6G/s, a linear scan of a data set D would take • 1.9 days when D is of 1PB (1015B) What query is this? • 5.28 years when D is of 1EB (1018B) O(n) time is already beyond reach on big data in practice! Polynomial time queries become intractable on big data! 18 Tractability revisited for big data NP and beyond Parallel polylog time P BD-tractable not BD-tractable Yes, querying big data comes with new and hard fundamental problems BD-tractable queries: properly contained in P unless P = NC 19 Challenges: query evaluation is costly Graph pattern matching by subgraph isomorphism • NP-complete to decide whether there exists a match • possibly exponentially many matches Membership problem for relational queries What is the complexity? Input: a query Q, a database D, and a tuple t Question: Is t in Q(D)? • NP-complete if Q is a conjunctive query (SPC) • PSPACE-complete if Q is in relational algebra (SQL) intractable even in the traditional complexity theory Already beyond reach in practice when the data is not very big 20 Is it still feasible to query big data? Can we do better if we are given more resources? Parallel and distributed query processing – TDD Using 10000 SSD of 6G/s, a linear scan of D might take: 1.9 days/10000 = 16 seconds when D is of 1PB (1015B) 5.28 years/10000 = 4.63 days when D is of 1EB (1018B) Only ideally! interconnection network P P P M M M DB DB DB 10,000 processors Yes, parallel query processing. But how? 21 The two sides of a coin Data = quantity + quality When we talk about big data, we typically mean its quantity: What capacity of a system provides to cope with the sheer size of the data? Is a query feasible on big data within our available resources? How can we make our queries tractable on big data? ... Can we trust the answers to our queries? Dirty data routinely lead to misleading financial reports, strategic business planning decision loss of revenue, credibility and customers, disastrous consequences The study of data quality is as important as data quantity 22 Data consistency FN LN address AC city Mary Smith 2 Small St 908 NYC Mary Dupont 10 Elm St 610 PHI Mary Dupont 6 Main St 212 NYC Bob Luth 8 Cowan St 215 PHI Robert Luth 6 Drum St 212 NYC Q1: how many employees are in the NY office? 3 may not be the correct answer: the AC and city in the first tuple are inconsistent! Error rates: 10% - 75% (telecommunication) 23 Information completeness FN LN address AC city Mary Smith 2 Small St 908 NYC Mary Dupont 10 Elm St 610 PHI Mary Dupont 6 Main St 212 NYC Bob Luth 8 Cowan St 215 PHI Robert Luth 6 Drum St 212 NYC Q2: how many distinct employees have first name Marry? 3 may not be the correct answer: • • The first three tuples refer to the same person The information may be incomplete “information perceived as being needed for clinical decisions 24 was unavailable 13.6%--81% of the time” (2005) Data currency FN LN address salary status Mary Smith 2 Small St 50k single Mary Dupont 10 Elm St 50k married Mary Dupont 6 Main St 80k married Bob Luth 8 Cowan St 80k married Robert Luth 6 Drum St 55k married Entities: Mary Consistent, complete, and once correct Robert Q3: what is Mary’s current salary? 80k In the real world, salary is monotonically increasing In a customer file, within two years about 50% of record may become obsolete” (2002) 25 Data fusion FN LN address salary status Mary Smith 2 Small St 50k single Mary Dupont 10 Elm St 50k married Mary Dupont 6 Main St 80k married Bob Luth 8 Cowan St 80k married Robert Luth 6 Drum St 55k married Q4: what is Mary’s current last name? In real life: • • Dupont Marital status only changes from single married divorced Tuples with the most current marital status also have the most current last name Deduce the true values of an entity 26 Data in real-life is often dirty Pentagon asked 200+ dead officers to re-enlist 81 million National Insurance numbers but only 60 million eligible citizens 98000 deaths each year, caused by errors in medical data 500,000 dead people retain active Medicare Data error rates in industry: 1% - 30% (Redman, 1998) cards Dirty data: inconsistent, inaccurate, incomplete, stale 27 Dirty data are costly Poor data cost US businesses $611 billion annually Erroneously priced data in retail databases cost US customers $2.5 billion each year 2000 1/3 of system development projects were forced to delay or 2001 cancel due to poor data quality 30%-80% of the development time and budget for data warehousing are for data cleaning 1998 CIA dirty data about WMD in Iraq! Can we trust answers to our queries in dirty data? The scale of the data quality problem is far worse on big data! 28 What does this course cover? Big data = quantity + quality Volume (quantity) Veracity (quality) 29 Basic topic 1: Parallel database management systems Recall traditional DBMS: Database: “single” memory, disk DBMS: centralized; single processor (CPU); Can we do better provided with multiple processors? Parallel DBMS: exploring parallelism Improve performance Reliability and availability interconnection network P P P M M M DB DB DB 30 Basic topic 2: Distributed databases Data is stored in several sites, each with an independent DBMS Local ownership: physically stored across different sites Increased availability and reliability Performance query answer global schema local schema DBMS DB network local schema DBMS DB local schema network DBMS DB 31 Advanced topic 1: MapReduce A programming model with two primitive functions: Map: <k1, v1> list (k2, v2) Reduce: <k2, list(v2)> list (k3, v3) Connection between MapReduce and parallel query processing Other parallel programming models • BSP (Bulk Synchronous Parallel) • Vertex-centric • Partial evaluation Applications in cloud computing 32 Advanced topic 2: Querying big data Foundations for querying big data Tractability Parallel revised for querying big data scalability Bounded evaluability of queries Techniques for querying big data Develop parallel algorithms for querying big data Bounded evaluability and access constraints Query preserving compression Query answering using views Bounded incremental query processing Querying big data: theory and practice 33 Advanced topic 3: Data quality management Big data = quantity + quality! Central issues for data quality Object identification (data fusion): do two objects refer to the same real-world entity? What is the true value of the entity? Data consistency: do our data values have conflicts? Data accuracy: is one value more accurate than another for a realword entity? Data currency: is our data out of date? Information completeness: does D have enough information to answer our queries? TDD: the Veracity of big data Make our data consistent, accurate, complete and up to date! 34 Advanced topic 4: Dependencies as data quality rules • • • Data quality rules: Conditional (functional and inclusion) dependencies to capture data inconsistencies Matching dependencies for record matching Data consistency: do our data values have conflicts? There are also quality rules for data accuracy, data currency and information completeness – in the textbook A revision of classical dependencies • • Fundamental problems for data quality rules: consistency: are the data quality rules “dirty” themselves? implication: can we optimize the rules by removing redundant ones? A uniform logic framework for improving data quality 35 Advanced topic 5: Data cleaning Repair Detect errors Reasoning Discover rules • • • • • • Discover data quality rules Validate rules discovered Detect errors with rules Repairing data with rules Certain fixes Deducing the true values of entities Semi-automated systems for improving data quality 36 Putting together Basic technology Parallel DBMS: architectures, data partition, (intra/inter) operator parallelism, parallel query processing and optimization Distributed DBMS: architectures, fragmentation, replication Advanced topics Big data: the Volume – MapReduce and other parallel programming models – Querying big data: theory and practice Big data: the Veracity Volume (quantity) – Central issues for data quality relational query processing, Veracityalgebra/SQL, (quality) – Dependenciesbasic as data quality complexity and algorithmic background • Variety (entityrules resolution, conflict resolution – Cleaning distributed data: rule NP, discovery, rule validation, error (e.g., undecidability) • Velocity (incremental computation) detection, data repairing, certain fixes 37 Prerequisites Course format 38 Basic information Web site: http://homepages.inf.ed.ac.uk/wenfei/tdd/home.html – Syllabus – Announcements – Lecture notes – deadlines TA: Chao Tian – [email protected] Office hours: – Informatics Forum 5.23, 11:00-12:00, Thursday 39 Course format Seminar course: there will be no exam! – Lectures: background. http://homepages.inf.ed.ac.uk/wenfei/tdd/lecture/lecture-notes.html – Textbook: R. Ramakrishnan, J. Gehrke: Database Management Systems. WCB/McGraw-Hill 2003 (3rd edition). Chap 22 Database System Concept, 4th edition, A. Silberschatz, H. Korth, S. Sudarshan, Part 6 (Parallel and Distributed Database Systems) W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012 (Chapters 1-4; e-copy available upon request) – Research papers or chapters related to the topics (3-4 each) 40 • At the end of ln3-ln8 Grading Reviews of research papers (8 in total) 40% Project (report): 45% Project presentation: 15% down from 12, 2012 Homework: Four sets of homework, starting from week 4; deadlines: • 9am, Thursday, February 5, week 4 • 9am, Thursday, February 19, week 6 • 9am, Thursday, March 5, week 8 • 9am, Thursday, March 19, week 10 – Papers: choose two each time (two reviews) – not chapters – 5% for each paper, and 10% for each homework 41 Review Evaluation Pick 2 research papers each time from the lecture note to be covered in next two weeks, starting from Week 4. Write a one-page review for each of the papers, 10 marks Summary: 2 marks • A clear problem statement: input, question/output • The need for this line of research: motivation • A summary of key ideas, techniques and contributions Evaluation: 5 marks – Criteria for the line of research (e.g., expressive power, complexity, accuracy, scalability, etc) – Evaluation based on your criteria; justify your evaluation • 3 strong points • 3 weak points Suggest possible extensions: 3 marks 42 Project – Research and development (recommended) Research and development: – Topic: pick one from lecture notes (ln3 – ln8) Example: A MapReduce algorithm for graph simulation Development – Pick a research paper from the reading list of ln3—ln8 – Implement its main algorithms – Conduct its experimental study You are encouraged to come up with your own project – talk to me first Multiple people may work on the same project independently Start early! 43 Grading – design and development Distribution: – Algorithms: technical depth, performance guarantees 20% – Prove the correctness, complexity analysis and performance guarantees of your algorithms 15% – Justification (experimental evaluation) 10% Report: in the form of technical report/research paper – – – – – – Introduction: problem statement, motivation Related work: survey Techniques; algorithms, illustration via intuitive examples Correctness/complexity/property/proofs Experimental evaluation Possible extensions 44 Project – survey Topic: pick one topic from a lecture note (ln3 – ln8) Example: techniques for conflict resolution Distribution: – Select 5-6 representative papers, independently 10% – Sample Develop a set of criteria: the most important issues in that survey: A Brief Survey of Automatic Methods for Author line of research, based your own understanding; justify Name on Disambiguation your criteria Find and download it from Google 10% – Evaluate each of the papers based on your criteria 15% – A table to summarize the assessment, based on your criteria, draw and justify your conclusion and recommendation for various application 10% Your understanding of the topic 45 Project report and presentation – 15% A clear problem statement Motivation and challenges Key ideas, techniques/approaches Key results – what you have got, intuitive examples Findings/recommendations for different applications Demonstration: a must if you do a development project Presentation: question handling (show that you have developed a good understanding of the line of work) Learn how to present your work 46 Summary and Review What is big data? What is the volume of big data? Variety? Velocity? Veracity? Why do we care about big data? Is there any fundamental challenge introduced by querying big data? Why study data quality? What is consistency? Information completeness? Data currency? Data accuracy? Object identification? 47 Reading list For next week, parallel databases, before the next lecture – Database Management Systems, 2nd edition, R. Ramakrishnan and J. Gehrke, Chapter 22. – Database System Concept, 4th edition, A. Silberschatz, H. Korth, S. Sudarshan, Part 6 (Parallel and Distributed Database Systems) About relational databases: – Foundations of databases, S. Abiteboul, R. Hull, V. VIanu About big data – W. Fan and J. Huai. Querying Big Data: Theory and Practice, JCST 2014 http://homepages.inf.ed.ac.uk/wenfei/papers/JCST14.pdf 48