* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CSCI3170 Introduction to Database Systems
Asynchronous I/O wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Data center wikipedia , lookup
Data analysis wikipedia , lookup
Operational transformation wikipedia , lookup
Information privacy law wikipedia , lookup
Data vault modeling wikipedia , lookup
3D optical data storage wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Clusterpoint wikipedia , lookup
Concurrency control wikipedia , lookup
CSCI5570 Large Scale Data Processing Systems Course Overview Instructor: Prof. James Cheng Course Webpage 2 Check course webpage regularly http://www.cse.cuhk.edu.hk/~jcheng/5570.html Remark: I prefer to put the course webpage under my own directory to make it easier for offcampus access. Topics Overview 3 Topic Tentative Schedule Distributed Data Analytics Systems Distributed Database Systems NoSQL Weeks 1-3 Weeks 4-5 Weeks 6-7 NewSQL Distributed Graph Processing Systems Distributed Data Storage Systems Weeks 8-9 Weeks 9-10 Weeks 11-12 Distributed Stream Processing Systems Other Large Scale Data Processing Systems Weeks 12-13 ??? Distributed Data Analytics Systems 4 Focus on state-of-the-art big data platforms, widely adopted by industry (e.g., Hadoop, Spark) or best in research (e.g., Naiad, Husky) Fundamental concepts of big data analytics systems Applications (too ad hoc to teach them all, but you can try them out with the course project): Data collecting, data extraction, data cleaning … Machine learning (e.g., classification, clustering, recommendation, feature selection, dimensionality reduction …) OLAP, data cube Data mining Graph analytics (including social network analysis) Similarity search (e.g., scalable locality sensitive hashing) Distributed Database Systems 5 Fundamental concepts of distributed database systems, prerequisite to NoSQL and NewSQL, as well as other distributed data processing systems Parallel query processing Distributed query processing NoSQL/NewSQL 6 Relational databases are the foundation of western civilization, but now is the era of NoSQL databases NoSQL databases, such as MongoDB, Cassandra, CouchDB, etc., are rapidly taking large shares of the market from traditional vendors such as Oracle Must learn for big data analytics NewSQL databases try to combine the pros of both traditional DBMS and NoSQL Distributed Graph Processing Systems 7 Graph data: web graphs, online social networks, mobile communication networks, financial networks, biological networks, neutral networks … Distributed systems that make the analysis of these large scale graphs/networks possible Key techniques and algorithms for large scale graph data processing Distributed Data Storage Systems 8 How to store massive volumes of different types of data, retrieve them, and update them efficiently? How to handle consistency issues? How to handle availability issues? Distributed Stream Processing Systems 9 Streaming data become common today, e.g., tweets, news feeds, … How to analyze such massive high-speed data in real time? Key techniques and applications Reading List 10 A list of papers for each topic (except for the older topics such as Relational Database Systems and Distributed Database Systems) will be released weekly Reference 11 Database Systems – The Complete Book •Second edition (Prentice Hall) •Hector Garcia-Molina, Jeffrey Ullman Jenifer Widom Reference 12 Database Management Systems •Third edition •Raghu Ramakrishnan, Johannes Gehrke Assessment Criteria 13 Short (Bring-Home) Quizzes: 50% Select 5 topics and read papers from the reading list for those topics. Select one paper for each of the 5 topics, write a review for the paper. The review should include a summary of the paper, 3 strong points and 3 weak points, and more detailed comments and suggestions. Make an appointment with me to discuss these 5 papers. You should show me that you have deep understanding about the works. Assessment Criteria 14 Course Project: 50% Either individual or a group of two students Students may choose to do one of the following: develop an application, a library package, or a sub-system based on some existing system (e.g., Spark, Husky, Hadoop, Storm, etc.) improve an existing system (by either improving its performance in some aspects, or adding new functionalities) develop a new system (prototype) for large scale data processing High flexibility for students to explore different things, but students must first get our approval of their project proposal (must be finalized on Sept 29 1p.m., so talk to us earlier) Some suggested projects will be posted after the first lab