Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Oracle Database wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Ingres (database) wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Relational model wikipedia , lookup
Concurrency control wikipedia , lookup
ContactPoint wikipedia , lookup
CPT-S 580-06 Advanced Databases Yinghui Wu EME 49 1 Welcome! Instructor: Yinghui Wu Office: EME 49 Email: [email protected] Office hour: Wed/Fri (1PM to 3PM) or by appointment Course website: TBF 2 Motivation This course let you get familiar with current development in database research We discuss problems/topics in advanced database research, and introduce solutions – New and currently making their way into database management systems and applications – not yet fully developed and open problems Outcome: possible starting points for your research project, master and PhD thesis, technique report – Get you prepared for positions in academia and industry 3 Database Management Systems Database concepts – Database • A database represents some aspect of the real world • A database is a logically coherent collection of data with some inherent meaning. • A database is designed, built, and populated with data for a specific purpose. • It has an intended group of users and some preconceived applications in which these users are interested. – Database Management System • A database management system (DBMS) is a collection of programs that enables users to Define, Construct, Manipulate and Share a database. 4 Typical DBMS Functionality Define a particular database in terms of its data types, structures, and constraints Construct or Load the initial database contents on a secondary storage medium Manipulate the database: – Retrieval: Querying, generating reports – Modification: Insertions, deletions and updates to its content – Accessing the database through Web applications Share a database allows multiple users and programs to access the database simultaneously Database Management System (DBMS) DBMS contains information about a particular enterprise – Collection of interrelated data – Set of programs to access the data – An environment that is both convenient and efficient to use Database Applications: – Banking: transactions; Airlines: reservations, schedules; Universities: registration, grades; Sales: customers, products, purchases – Online retailers: order tracking, customized recommendations – Manufacturing: production, inventory, orders, supply chain; Internet of Things – Human resources: employee records, salaries, tax deductions – GIS; scientific computing; Databases can be large and at any complexity. Databases touch all aspects of our lives File systems to manage data? Data redundancy and inconsistency – Multiple file formats, duplication of information in different files Difficulty in accessing data – Need to write a new program to carry out each new task Data isolation – Multiple files and formats Integrity problems – Integrity constraints (e.g., account balance > 0) become “buried” in program code rather than being stated explicitly – Hard to add new constraints or change existing ones File systems to manage data? Atomicity of updates – Failures may leave database in an inconsistent state with partial updates carried out – Example: Transfer of funds from one account to another should either complete or not happen at all Concurrent access by multiple users – Concurrent access needed for performance – Uncontrolled concurrent accesses can lead to inconsistencies • Example: Two people reading a balance (say 100) and updating it by withdrawing money (say 50 each) at the same time Security problems – Hard to provide user access to some, but not all, data Database systems offer solutions to all the above problems Basic concepts in Database Database concepts – Data model • A collection of concepts that can be used to describe the structure of a database – Schema • The description of a database is called the database schema, which is specified during database design and is expected not to change frequently – The three-schema architecture • Internal schema • Conceptual schema • External schema 9 Basic concepts in Database Database concepts – Data independence: logical and physical – Database languages • DBMS languages – Database interfaces Data modeling – Conceptual – Logical – Physical Database design – Normalization 10 Databases: a classification Logical organization of data – – – – – Records-based database systems Object-oriented database systems Object-relational database systems Deductive/logic database systems Functional database systems Physical organization of data – Centralized database systems – Distributed database systems • Homogeneous and heterogeneous – Client-server database systems – Mobile database systems 11 Databases: a classification Contents – – – – – – Symbolic databases Textual databases Multi-media databases Image databases Spatial databases Temporal databases Application domain – – – – – Engineering databases Scientific databases Statistical databases Manufacturing databases Business 12 Databases: a classification Data usage – – – – – Operational databases Decision-support databases Data warehousing Data mining Tactical and planning databases Nature of data – Structured databases – Semi-structured (like XML data) – Unstructured (like Web) Self modifiability – Passive databases – Active databases (Triggers) 13 Yesterday’s DBMS Landscape Application ... DBMS Database Application “Banking, SAP, …" “Server “ “Disk" Yesterday’s Data Structured data Centralized data Homogeneous data Small Cleaned Static 15 Yesterday’s DBMS Hardware Small main memory Disk-based systems 16 Assumptions of yesterday’s DBMSs Structured data with well-defined schema Capacity of main memory <1% of the stored data Central control to manage transactions Pipelining is always beneficial (no storage of intermediate results) Cleaned data A repository of data with simple query support Today’s DBMS Landscape 18 Today’s DBMS Topics 19 Today’s Data ( “Big Data”) Tables temporal/streams distributed data networks scientific data Web/text Natural language 20 Today’s DBMS Hardware Large main memory Solid state disks Multi-core CPUs Co-processors GPUs 21 Today’s DBMS information need Traditional DB Users need to know complete schema for querying Users must write SQL queries Database system does not help in searching Lacks semantic value Stores only facts Intelligent DB Users need not know complete database schema Users can simply use query expressions Provides help to make searching effective Semantic information is stored via Knowledge graphs and other data structures in the database itself Stores facts and rules 22 Assumptions of yesterday’s DBMSs Structured data with well-defined schema • Capacity of main memory <1% of the stored data • Large-scale parallelism Preprocessed, cleaned data • Effective distributed processing, job scheduling and balancing Pipelining is always beneficial (no storage of intermediate results) • DB in main memory Central scheduler to schedule transactions • Semi-structured/unstructured, schemaless data,... Manage dirty, noisy, incomplete data A repository of data with simple query support • Complex analytical queries; intelligent DBMS Question of this course What do we have to change in traditional DBMS to meet tomorrow’s challenges? Overview of topics DBMS beyond relational databases (week 2-3) – noSQL and newSQL – In-situ processing – Data stream management systems Main-memory databases (week 4) – Architecture and design principles – Query and indexing strategy 25 Overview of topics Advanced query techniques (week 5-6) – Indexing techniques – Query optimization – Approximate querying Parallel and distributed databases (week 7-8) – – – – partition techniques parallel and distributed models fault tolerance and concurrency control Distributed stream processing 26 Overview of topics Database and knowledge discovery (9-11) – DBMS and IR – DBMS and scalable DM/ML – Intelligent DBMS Data quality (12-13) – Dirty data: issues and problems – Data cleaning and repairing Other new trends and applications (14-15) – Crowdsourcing – Data warehouse and datacenters – Privacy and security 27 Course format 28 Course format A Seminar-style course: there will be no exam! – – – – – • References: • • • • Lectures: background. Paper presentation 6 Homework 1 Final course project Textbook: No Official Textbooks Database systems the complete book, Hector Garcia-Molina, Jeffrey D.Dullman, Jennifer Widom,2008 Hadoop: The Definitive Guide, Tom White, O’Reilly Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer (www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf) Data Mining: Concepts and Techniques, Third Edition, by Jiawei Han et al. Online Tutorials and Papers – Research papers or chapters related to the topics (3-4 each) • At the end of lecture notes from Ln2 29 Grading Reviews and Presentation: 40% Course Project: 45% Final Project report and presentation: 15% Homework: Six sets of homework, starting from week 2; deadlines: • 5 pm, Thursday, 1/28, week 3 • 5 pm, Thursday, 2/16, week 5 • 5 pm, Thursday, 2/25, week 7 • 5 pm, Thursday, 3/22, week 10 • 5 pm, Thursday, 4/5, week 12 • 5 pm, Thursday, 4/14, week 14 – 30 Review Evaluation Pick 2 research papers each time from the lecture note to be discussed in next two weeks, starting from Week 2. Write a one-page review for each of the papers, 10 marks Summary: 2 marks • A clear problem statement: input, question/output • The need for this line of research: motivation • A summary of key ideas, techniques and contributions Evaluation: 5 marks – Criteria for the line of research (e.g., expressive power, complexity, accuracy, scalability, etc) – Evaluation based on your criteria; justify your evaluation • 3 strong points • 3 weak points Suggest possible extensions: 3 marks 31 Presentation Evaluation Presentation (15 minutes + 2-3 minutes Q&A) – Background and motivation • Challenges • Why the problem is needed – Problem formulation – Algorithm description – Experimental study – Conclusion and Future work 32 Project – Research and development (recommended) Research and development: – Topic: pick one from a suggested project list (will be published on course website) Example: Association rule mining over temporal networks Development – Pick a research paper from the reading list of ln3 – ln11 Implement its main algorithms – Conduct its experimental study You are encouraged to come up with your own project – talk to me first Start early! 33 Project – Research and development Evaluation Distribution: – Algorithms: technical depth, performance guarantees 20% – Prove the correctness, complexity analysis and performance guarantees of your algorithms 15% – Justification (experimental evaluation) 10% Report: in the form of technical report/research paper – – – – – – Introduction: problem statement, motivation Related work: survey Techniques; algorithms, illustration via intuitive examples Correctness/complexity/property/proofs Experimental evaluation Possible extensions 34 Project – survey Topic: pick one topic from a suggested list Example: distributed graph query engines; temporal/streaming querying approaches. Distribution: – Select 5-6 representative papers, independently 10% – Develop a set of criteria: the most important issues in that line of research, based on your own understanding; justify your criteria 10% – Evaluate each of the papers based on your criteria 15% – A table to summarize the assessment, based on your criteria, draw and justify your conclusion and recommendation for various application 10% Your understanding of the topic 35 Project report and presentation – 15% A clear problem statement Motivation and challenges Key ideas, techniques/approaches Key results – what you have got, intuitive examples Findings/recommendations for different applications Demonstration: a must if you do a development project Presentation: question handling (show that you have developed a good understanding of the line of work) Learn how to present your work 36 Readings for this week and next week Overview of traditional databases – What goes around comes around, by Michael Stonebreaker, https://mitpress.mit.edu/sites/default/files/titles/content/97802 62693141_sch_0001.pdf – Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke http://pages.cs.wisc.edu/~dbbook/index.html Preview of noSQL and newSQL – “noSQL databases”, by Christof Strauch –sections 1-3 – Scalable SQL and noSQL data stores, Rick Cattell, http://www.sigmod.org/publications/sigmodrecord/1012/pdfs/04.surveys.cattell.pdf 37