Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 440 Database Management Systems Course overview Welcome to CS440! • • • • Instructor: Arash Termehchy Assistant Professor at EECS Research on data management and analytics Information & Data Management and Analytics (IDEA) Lab The Era of Big Data • Technological shifts, e.g., mobile devices, have created a staggering number of enormous data sets. • Both opportunities and challenges. Opportunities: unreasonable effectiveness of data • A. Halevy, et al. The unreasonable effectiveness of data, IEEE Intelligence Systems, 2009. • Observation from working with large datasets in Google. – More data generally outperforms complex statistical models in the data-centric prediction and discovery. • Conclusion: – Usually, no need for overly complex statistical models. Opportunities are priceless! The story of John Snow “In the mid-1850s, Dr. John Snow plotted cholera deaths on a map, and in the corner of a particularly hard-hit buildings was a water pump. A 19th-century version of Big Data, which suggested an association between cholera and the water pump.” Integrating data sets has saved millions of lives! Paradigm shifting influence on scientific discovery • “The Fourth Paradigm: Data-Intensive Scientific Discovery”, Jim Gray – Empirical – Theoretical – Computational – Data-centric • Sloan Sky Server database is a top cited resource in the field of astronomy. – Astronomical observation => database query Challenges: data volume • Sloan Sky Server will soon store 30 terabyte per day. • Hardon Colider can generate 500 exabyte per day. • 90% of world data generated in the last two years (2013) – Every two year : ten times more data Challenges: data variety/ diversity • Database systems used to deal with a single static database. • Need to transform and or integrate large number of evolving data sets. • Impossible to do manually. “A data integration expert is never without a job” Challenges: usability “….(in the next few years) we project a need for 1.5 million additional analysts in the United States who can analyze data effectively…“, -- McKinsey Big Data Study, 2012 Current systems are not built for scientists and normal users. “It may take a PhD in computer science to successfully deploy a data analytics algorithm!” The notion of database management system (DBMS) • Data processing used to be mostly ad-hoc programming. • W. McGee, Generalization: Key to Successful Electronic Data Processing, Journal of ACM, 1959. • Generalization, aka abstraction/ data modeling – File: A sequence of records. – Operation: sort, select part of the file, … • Makes data management and processing usable. – People can learn and use the abstraction instead of developing new data processing programs. Abstraction is the key • How to develop usable abstractions for our data? – Data models, query languages, – Relational data model, graph data model, … • How to implement these abstractions efficiently? – Database systems internal – Storage management, indexing, …. Topics • How to develop usable abstractions for our data? – relational data model – graph data model – database programming • How to implement these abstractions efficiently? – storage management and indexing – query processing algorithms – query optimization – Transaction management – parallel and distributed data processing Our plan • Learn the fundamental concepts and ideas – Foundational models, algorithms, and systems. – Textbooks, resources, and lectures. • Apply them to new problems – Apply the lessons learned to interesting database problems. – By doing assignments. Learning the fundamentals: Lectures • Review and discuss the material. • Will be available on the course website after the class. • Provide the road map for studying – The course material can seem overwhelming. • Attendance is not required but encouraged. • Read the course material before the class. • Participate and ask questions! Learning the fundamentals: Readings • Textbooks: – Database management systems, 3rd edition, R. Ramakrishnan and J. Gehrke. • Cow book – Mining Massive data sets, Jure Leskovec, Anand Rajaraman, Jeff Ullman. • Free Online – Papers for newer material: posted on the course website. Learning the fundamentals: Readings • Recommended – Database systems: the complete book, 2nd edition, Hector Garcia Molina, Jeffry Ullman, and Jennifer Widom. • The complete book – Foundations of databases, Serge Aitboul, Richard Hull, Victor Vianu • Alice book Learning the fundamentals: Exam • Midterm exam in class. – Closed books and notes – Tests your knowledge of the subjects discussed in the class. – 40% of the overall grade – In class • No final exam Apply your understanding: assignments • Seven assignments: • Announced on Piazza and course website, posted on the course website. • Both written and programming. • Submit using TEACH • Write using word processors and submit in pdf. • Start early! • 60% of the overall grade How to get the most out of the course? • Communicate with the course staff – TA: Vahid Ghadakchi, Parisa Ataie – Piazza • preferred method of communication – Office hours • Arash: Tuesday 4:30 – 5:30 pm • Vahid: Monday/ Wednesday 4 – 5 pm • Parisa: Monday 9 – 10 am – Email the staff for other types of questions • Use [cs440] tag in the subject line. • Communicate with your peers on course materials and lectures. • Check the Piazza and course website for announcements or possible changes in the schedule. What is next? • A review of relational model, relational algebra, and SQL. • You refresh your memory by working on some advanced problems on relational model and database design.