Download ISOM 3370

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ISOM3370 Big Data Technologies – Spring 2017
Instructor
Contact
Jia Jia, ISOM
Email: [email protected]
Office: LSK 5045
Begin subject: [ISOM3370]...
<--- Note!
Office Hours
Course Schedule and
Classroom
Fri. 13:30 – 16:30 and by appt.
Lecture: Wed. & Fri. 13:30 – 14:50 (LSK1007)
Lab:
Thu. 15:00 – 15:50 (LSKG005)
Course Website
Accessible from Canvas
1. Course Overview
Over the decades there has being an explosion of data. With diversified data provisions, such as large
Internet sites, sensor networks, scientific experiments, and government records, the volume of data that
we create and capture keeps increasing at an exponential rate. The off-the-shelf techniques and
technologies that we ready used to store and analyze data cannot work efficiently for large-scale data
processing. The challenges arise especially in the context of data-intensive computing. We need to
develop and create new techniques and technologies to excavate “Big Data” and benefit our specified
purposes.
The emergence of large distributed clusters enables data storage and computation to be distributed across
thousands of commodity machines in datacenters. One key breakthrough that makes this possible is the
development of abstractions and frameworks that allow us to reason about computations at a massive
scale, while hiding low-level details such as data movement, synchronization, and fault tolerance. Such
disruptive technologies have become important data processing platforms for a variety of applications,
and have transformed business, science, and many aspects of our society.
This course aims to provide an introduction to big data technologies, starting with MapReduce, which is
the first of these datacenter-scale computation abstractions and whose Hadoop implementation lies at the
core of an application stack that has been gaining widespread adoption in both industry and academia.
Because of the success of Hadoop, a large number of big data tools, with specialization ranging from
cluster resource management to complex data analytics, were built on and around Hadoop, creating a
complete big data application stack. We will then cover some of the tools in this stack, such as YARN,
Pig, Hive, HBase, and Spark.
As many fields such as data mining and information retrieval have adapted their algorithms to the new
computation model, another focus of this course is algorithm design and "thinking at scale" in these
fields. The course will cover some widely used distributed algorithms in academia and industry. Some
basics of programming languages, such as Java and Scala, will also be covered to help you understand
algorithms and run them on massive datasets.
2. Course Goals and Objectives
At the end of this course, you will be able to:
•
•
Understand MapReduce as a computation model and an execution framework
Work with following tools in the big data application stack:
o
Hadoop
o
o
o
o
o
•
•
•
YARN
Hive
Pig
Spark
and perhaps others…
Realize how different tools in the Hadoop stack fit in the big picture of big data analytics
Design distributed machine learning algorithms
Use cloud computing services (Amazon Web Services) to build your clusters and run large-scale
data processing applications
3. Prerequisites
ISOM3230 Business Applications Programming. Knowledge of programming in Java and data mining is
highly desirable.
4. Lecture Notes, Textbook, and Readings
For most classes I will hand out lecture notes, which will outline the primary material for the class.
Other readings (posted to Canvas or distributed in class) are intended to supplement the material we
learn in class. They give alternative perspectives and additional details about the topics we cover.
The principal textbooks for this course are: Hadoop: The Definitive Guide
5. Course Website
The most recent version of all materials for the course will be posted on Canvas, including the syllabus,
readings, slides used in class, and homework assignments. Please check the course website frequently for
updates.
6. Grading
The grade breakdown is as follows:
•
•
•
•
Lab participation: 10%
Homework (4): 40%
Midterm quiz: 20%
Final project: 30% (1/3 of which is determined by peer evaluation)
7. Important Notes on Labs
This is primarily a lecture-based course. But follow-up labs and assignments ensure that you will get a
hands-on experience with Hadoop, Hive, Pig, and Spark, and the design of scalable algorithms. Therefore,
student participation is an essential part of the learning process. During the lab session, I will expect
you to be entirely devoted to the class by following the instructions and completing the exercises. And
you should actively link the empirical results you obtain from the labs to the concepts you learn in the
lectures.
8. Homework Assignments, Midterm Exam, and the Final Project
The homework assignments are designed for you to explore specific topics in a structured way. There
will be a total of 4 individual homework assignments (using Canvas website), each comprising
conceptual questions to be answered and hands-on tasks. Assignments will be graded and returned
promptly. The due date of each assignment will be announced upon its release on Canvas.
You may work together on the homework assignments, but all of the material that is turned in for grading
must be produced individually. For example, you may form study groups and work out homework
solutions together and then sharing what you’ve learned, but it would not be permissible for someone to
prepare an answer set and then for others to copy those answers and submit it as their own work. Turning
in copied files is specifically prohibited; you must individually write (type) any material that is submitted
for grading.
Late policy: turn in your assignment early if there is any uncertainty about your ability to turn it in on the
due date. Assignments up to 24 hours late will have their grade reduced by 25%; assignments up to one
week late will have their grade reduced by 50%. After one week, late assignments will receive no credit.
The midterm exam is to be tentatively scheduled on March 15. Let me know as early as possible
if there is any unavoidable conflict. One A4-sized cheat sheet will be allowed per student, as well as
calculators if necessary; no books or other notes. Makeup exams will be offered only in cases of
documented health or family emergencies or for official, university-sanctioned activities. Advanced
notification of missing an examination is required. Any uncoordinated absence from an exam will result in
a score of 0 for the exam.
No final will be administered. Instead, you need to work collaboratively to tackle given data analytic
problems. The project team will comprise at maximum 5 students. More details about the project will be
announced later.
9. Tentative Schedule of Lecture Topics
The following table shows the planned list of topics that we plan to cover. Please note that this schedule
is tentative and is subject to adjust as the semester progresses.
Week
1
Date
Feb. 1
Lecture Topic
L1: Introduction to Big Data
•
•
•
Feb. 3
L2: Cloud Computing Basics
•
2
Feb. 8
Feb. 10
What is “Big Data” and its challenges?
The datacenter as a computer
A glance of planned topics
Economics of Cloud Computing
L3: MapReduce and Distributed File Systems
•
Distributed File Systems
Remarks
3
Feb. 15
•
•
•
Feb. 17
L4: MapReduce Algorithm Design
•
4
Feb. 22
•
•
Feb. 24
5
Mar. 1
•
7
8
Mar. 8
Mar. 10
Mar. 15
Mar. 17
Mar. 22
9
Text retrieval basics
MapReduce for Inverted indexing
Midterm Review
Midterm Exam
L7: Pig: High-Level Procedural Language for Data Processing
•
•
•
Mar. 24
Mar. 29
Data characterizations: structured, semi-structured, and
unstructured data
Data warehouse basics
How MapReduce kicks in
o Selection
o Projection
o Group By
o Relational join
L6: MapReduce for Text Retrieval
•
•
6
Local aggregation in MapReduce
o Proper use of combiners
o In-mapper combining
Key-value designs
o “Pairs” vs. “stripes”
o Value-to-key conversion and secondary sort
o Order inversion
Group Comparator and sort comparator
L5: MapReduce for Processing Relational Data
•
•
Mar. 3
MapReduce as a programming model
MapReduce as an execution framework
Cluster Resource Manager: YARN
Limitations of MapReduce and the need for high-level
languages
Pig Latin
Compilation into MapReduce
L8: Hive: A Data Warehousing Application Using Hadoop
•
•
•
•
Data model
HiveQL
Archetecture
Pig vs. Hive
10
Mar. 31
Apr. 5
Apr. 7
L9: Spark: Cluster Computing Framework with In-Memory
Data Sharing
•
•
•
•
11
12
13
14
Limitations of MapReduce and the need for in memory
data sharing
Resilient Distributed Datasets (RDD)
Spark runtime
Scala Basics
Apr. 12
Apr. 14
Apr. 19
Apr. 21
Apr. 26
Apr. 28
No Class
May 3
May 5
No Class
Course Wrap-up
Midterm Break
Midterm Break
L10: Scalable Data Mining
•
•
•
K-means clustering
PageRank with GraphX
…
Holiday
10. Tentative Lab Schedule
Week
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Date
Feb. 2
Feb. 9
Feb. 16
Feb. 23
Mar. 2
Mar. 9
Mar. 16
Mar. 23
Mar. 30
Apr. 6
Apr. 13
Apr. 20
Apr. 27
May 4
Lab Topic
Introduction to Amazon Web Services
Basics of Linux file operations
HDFS file operations
WordCount: the first distributed application
MapReduce algorithm: secondary sorting
Run MapReuce jobs with Amazon EMR
No Lab
Pig Latin basics
Run Pig Script in MapReduce Mode
Hive
No Lab
Scala basics
Data mining with Scala
PageRank with GraphX
Midterm Exam
Midterm Break