Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISOM3370 Big Data Technologies – Spring 2017 Instructor Contact Jia Jia, ISOM Email: [email protected] Office: LSK 5045 Begin subject: [ISOM3370]... <--- Note! Office Hours Course Schedule and Classroom Fri. 13:30 – 16:30 and by appt. Lecture: Wed. & Fri. 13:30 – 14:50 (LSK1007) Lab: Thu. 15:00 – 15:50 (LSKG005) Course Website Accessible from Canvas 1. Course Overview Over the decades there has being an explosion of data. With diversified data provisions, such as large Internet sites, sensor networks, scientific experiments, and government records, the volume of data that we create and capture keeps increasing at an exponential rate. The off-the-shelf techniques and technologies that we ready used to store and analyze data cannot work efficiently for large-scale data processing. The challenges arise especially in the context of data-intensive computing. We need to develop and create new techniques and technologies to excavate “Big Data” and benefit our specified purposes. The emergence of large distributed clusters enables data storage and computation to be distributed across thousands of commodity machines in datacenters. One key breakthrough that makes this possible is the development of abstractions and frameworks that allow us to reason about computations at a massive scale, while hiding low-level details such as data movement, synchronization, and fault tolerance. Such disruptive technologies have become important data processing platforms for a variety of applications, and have transformed business, science, and many aspects of our society. This course aims to provide an introduction to big data technologies, starting with MapReduce, which is the first of these datacenter-scale computation abstractions and whose Hadoop implementation lies at the core of an application stack that has been gaining widespread adoption in both industry and academia. Because of the success of Hadoop, a large number of big data tools, with specialization ranging from cluster resource management to complex data analytics, were built on and around Hadoop, creating a complete big data application stack. We will then cover some of the tools in this stack, such as YARN, Pig, Hive, HBase, and Spark. As many fields such as data mining and information retrieval have adapted their algorithms to the new computation model, another focus of this course is algorithm design and "thinking at scale" in these fields. The course will cover some widely used distributed algorithms in academia and industry. Some basics of programming languages, such as Java and Scala, will also be covered to help you understand algorithms and run them on massive datasets. 2. Course Goals and Objectives At the end of this course, you will be able to: • • Understand MapReduce as a computation model and an execution framework Work with following tools in the big data application stack: o Hadoop o o o o o • • • YARN Hive Pig Spark and perhaps others… Realize how different tools in the Hadoop stack fit in the big picture of big data analytics Design distributed machine learning algorithms Use cloud computing services (Amazon Web Services) to build your clusters and run large-scale data processing applications 3. Prerequisites ISOM3230 Business Applications Programming. Knowledge of programming in Java and data mining is highly desirable. 4. Lecture Notes, Textbook, and Readings For most classes I will hand out lecture notes, which will outline the primary material for the class. Other readings (posted to Canvas or distributed in class) are intended to supplement the material we learn in class. They give alternative perspectives and additional details about the topics we cover. The principal textbooks for this course are: Hadoop: The Definitive Guide 5. Course Website The most recent version of all materials for the course will be posted on Canvas, including the syllabus, readings, slides used in class, and homework assignments. Please check the course website frequently for updates. 6. Grading The grade breakdown is as follows: • • • • Lab participation: 10% Homework (4): 40% Midterm quiz: 20% Final project: 30% (1/3 of which is determined by peer evaluation) 7. Important Notes on Labs This is primarily a lecture-based course. But follow-up labs and assignments ensure that you will get a hands-on experience with Hadoop, Hive, Pig, and Spark, and the design of scalable algorithms. Therefore, student participation is an essential part of the learning process. During the lab session, I will expect you to be entirely devoted to the class by following the instructions and completing the exercises. And you should actively link the empirical results you obtain from the labs to the concepts you learn in the lectures. 8. Homework Assignments, Midterm Exam, and the Final Project The homework assignments are designed for you to explore specific topics in a structured way. There will be a total of 4 individual homework assignments (using Canvas website), each comprising conceptual questions to be answered and hands-on tasks. Assignments will be graded and returned promptly. The due date of each assignment will be announced upon its release on Canvas. You may work together on the homework assignments, but all of the material that is turned in for grading must be produced individually. For example, you may form study groups and work out homework solutions together and then sharing what you’ve learned, but it would not be permissible for someone to prepare an answer set and then for others to copy those answers and submit it as their own work. Turning in copied files is specifically prohibited; you must individually write (type) any material that is submitted for grading. Late policy: turn in your assignment early if there is any uncertainty about your ability to turn it in on the due date. Assignments up to 24 hours late will have their grade reduced by 25%; assignments up to one week late will have their grade reduced by 50%. After one week, late assignments will receive no credit. The midterm exam is to be tentatively scheduled on March 15. Let me know as early as possible if there is any unavoidable conflict. One A4-sized cheat sheet will be allowed per student, as well as calculators if necessary; no books or other notes. Makeup exams will be offered only in cases of documented health or family emergencies or for official, university-sanctioned activities. Advanced notification of missing an examination is required. Any uncoordinated absence from an exam will result in a score of 0 for the exam. No final will be administered. Instead, you need to work collaboratively to tackle given data analytic problems. The project team will comprise at maximum 5 students. More details about the project will be announced later. 9. Tentative Schedule of Lecture Topics The following table shows the planned list of topics that we plan to cover. Please note that this schedule is tentative and is subject to adjust as the semester progresses. Week 1 Date Feb. 1 Lecture Topic L1: Introduction to Big Data • • • Feb. 3 L2: Cloud Computing Basics • 2 Feb. 8 Feb. 10 What is “Big Data” and its challenges? The datacenter as a computer A glance of planned topics Economics of Cloud Computing L3: MapReduce and Distributed File Systems • Distributed File Systems Remarks 3 Feb. 15 • • • Feb. 17 L4: MapReduce Algorithm Design • 4 Feb. 22 • • Feb. 24 5 Mar. 1 • 7 8 Mar. 8 Mar. 10 Mar. 15 Mar. 17 Mar. 22 9 Text retrieval basics MapReduce for Inverted indexing Midterm Review Midterm Exam L7: Pig: High-Level Procedural Language for Data Processing • • • Mar. 24 Mar. 29 Data characterizations: structured, semi-structured, and unstructured data Data warehouse basics How MapReduce kicks in o Selection o Projection o Group By o Relational join L6: MapReduce for Text Retrieval • • 6 Local aggregation in MapReduce o Proper use of combiners o In-mapper combining Key-value designs o “Pairs” vs. “stripes” o Value-to-key conversion and secondary sort o Order inversion Group Comparator and sort comparator L5: MapReduce for Processing Relational Data • • Mar. 3 MapReduce as a programming model MapReduce as an execution framework Cluster Resource Manager: YARN Limitations of MapReduce and the need for high-level languages Pig Latin Compilation into MapReduce L8: Hive: A Data Warehousing Application Using Hadoop • • • • Data model HiveQL Archetecture Pig vs. Hive 10 Mar. 31 Apr. 5 Apr. 7 L9: Spark: Cluster Computing Framework with In-Memory Data Sharing • • • • 11 12 13 14 Limitations of MapReduce and the need for in memory data sharing Resilient Distributed Datasets (RDD) Spark runtime Scala Basics Apr. 12 Apr. 14 Apr. 19 Apr. 21 Apr. 26 Apr. 28 No Class May 3 May 5 No Class Course Wrap-up Midterm Break Midterm Break L10: Scalable Data Mining • • • K-means clustering PageRank with GraphX … Holiday 10. Tentative Lab Schedule Week 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Date Feb. 2 Feb. 9 Feb. 16 Feb. 23 Mar. 2 Mar. 9 Mar. 16 Mar. 23 Mar. 30 Apr. 6 Apr. 13 Apr. 20 Apr. 27 May 4 Lab Topic Introduction to Amazon Web Services Basics of Linux file operations HDFS file operations WordCount: the first distributed application MapReduce algorithm: secondary sorting Run MapReuce jobs with Amazon EMR No Lab Pig Latin basics Run Pig Script in MapReduce Mode Hive No Lab Scala basics Data mining with Scala PageRank with GraphX Midterm Exam Midterm Break