Download Course Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Course Overview:
Introduction to Big Data Analytics
J. H. Wang
Feb. 21, 2017
Instructor & TA
• Instructor
–
–
–
–
–
–
J. H. Wang (王正豪)
Associate Professor, CSIE, NTUT
Office: R1534, Technology Building
E-mail: [email protected]
Tel: ext. 4238
Office Hour: 2:00-5:00 pm, every Wednesday and
Thursday
• TA
– Mr. Lin (@ R1424, Technology Building)
Big Data Analytics,
Spring 2017
NTUT CSIE
2
Course Description
• Course Web Page:
– http://www.ntut.edu.tw/~jhwang/BDA/
– for the latest announcements and updates of schedule, slides,
and homeworks
• Time: 2:10-4:00pm, Tue., 3:10-4:00pm, Fri.
• Classroom: R227, 6th Teaching Building
• Textbooks: (selected chapters)
– Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, 3rd ed., Morgan Kaufmann Publishers, July 2011.
– Tom White, Hadoop: The Definitive Guide, 4th ed., O’Reilly Media, 2015.
• Prerequisites:
– Basic knowledge of data structures (and algorithms), database
systems, and operating systems
– Programming experience is *required* for completing
homeworks & projects
Big Data Analytics,
Spring 2017
NTUT CSIE
3
Target Audience
• CSIE juniors and seniors
• Students who are interested in big data
analytics and willing to learn practical
skills
Big Data Analytics,
Spring 2017
NTUT CSIE
4
Additional Reading Materials
• Books
– Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with
MapReduce, Morgan & Claypool Publishers, 2010.
– Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, Mining
of Massive Datasets, 2nd Edition, Cambridge University Press,
November 2014.
– Holden Karau, Andy Konwinski, Patrick Wendell, Matei
Zaharia, Learning Spark: Lightning-Fast Big Data Analysis, O'Reilly
Media, January 2015.
• Online documents: Hadoop, Spark, …
• Tutorials on big data analytics in major conferences
• Academic papers
Big Data Analytics,
Spring 2017
NTUT CSIE
5
Grading Policy
• Homework assignments and
programming exercises: ~40%
• Mid-term exam: ~25%
• Term project: ~35%
– Including proposal, presentation, and final
report
• All homeworks, projects, and reports must
be submitted *before* the end of the
semester (Jun. 26, 2017)
Big Data Analytics,
Spring 2017
NTUT CSIE
6
Homeworks and Projects
• About 2-3 written assignments
– Concepts
• About 2-3 team-based programming exercises
– Maximum number of students per team: 4
• The term project (to be detailed later)
– Either team-based system development
• e.g. extension to exercises
– Or academic paper presentation
• Only one person per team allowed
Big Data Analytics,
Spring 2017
NTUT CSIE
7
About the Term Project
• The score you’ll get depends on the functionality,
difficulty, and quality of your project
– For system development:
• System functions and correctness
• You can either write your own program, or call existing open
source code or library (but NOT executing existing binary only)
– For academic paper presentation
• Quality and your presentation of the paper is critical
• Major methods/experimental results *must* be presented
• Papers from top conferences are strongly suggested
– Proposals, presentations, and reports are *required* for
each team, and will be counted in the score
Big Data Analytics,
Spring 2017
NTUT CSIE
8
Homework Submission
• Online Submission
– Systems, programs, project proposals, and
project reports in electronic files must be
submitted to the website.
– Submission instructions to be announced.
Big Data Analytics,
Spring 2017
NTUT CSIE
9
What This Course is about
• This course can help you
– Understand the general idea of big data
– Obtain a rough idea of data mining
– Have an idea of what Hadoop can do
• And MORE than these!
– Technical and practical skills for:
•
•
•
•
Environment setup of distributed computing
Data analysis
Data mining methods
Parallel programming
Big Data Analytics,
Spring 2017
NTUT CSIE
10
AlphaGo
Big Data Analytics,
Spring 2017
NTUT CSIE
11
Google Translate
Big Data Analytics,
Spring 2017
NTUT CSIE
12
Google FluTrends
Big Data Analytics,
Spring 2017
NTUT CSIE
13
Big Data: Some Examples
• Topic detection and tracking
• Trend analysis
• Social network analysis
• PageRank
• Predictive analytics
• Many others: healthcare, natural resources,
education, public sector, insurance,
transportation, finance and crime detection,
…
Big Data Analytics,
Spring 2017
NTUT CSIE
14
Google News
Big Data Analytics,
Spring 2017
NTUT CSIE
15
Google Trends
Big Data Analytics,
Spring 2017
NTUT CSIE
16
Google Hottrends
Big Data Analytics,
Spring 2017
NTUT CSIE
17
Daily View
Big Data Analytics,
Spring 2017
NTUT CSIE
18
AMiner
Big Data Analytics,
Spring 2017
NTUT CSIE
19
What is Big Data?
• Big data is a term for data sets that are so
large or complex that traditional data
processing application softwares are
inadequate to deal with them.
Challenges include capture, storage,
analysis, data curation, search, sharing,
transfer, visualization, querying, updating
and information privacy. [source:
Wikipedia]
Big Data Analytics,
Spring 2017
NTUT CSIE
20
Characteristics of Big Data
• META Group (now Gartner) defined data
growth challenges and opportunities as
being three-dimensional, i.e. increasing
volume, velocity, and variety [Doug
Laney, 2001]
– Volume: the quantity of generated and stored
data
– Velocity: the speed at which data is generated
and processed
– Variety: the type and nature of data
Big Data Analytics,
Spring 2017
NTUT CSIE
21
Characteristics of Big Data
• “Big data is high volume, high velocity,
and/or high variety information assets
that require new forms of processing to
enable enhanced decision making, insight
discovery and process optimization.”
[Gartner, 2012]
Big Data Analytics,
Spring 2017
NTUT CSIE
22
The Four V’s of Big Data
Big Data Analytics,
Spring 2017
[source: IBM]
NTUT CSIE
23
More Characteristics of Big Data
• Five V’s:
– Volume
• 2.5 EB per day (300 TB, Library of Congress)
– Velocity
• In 60 seconds: 350,000 tweets, 300 hours of YouTube video, 171 million
emails, 350 GB sensor data from a jet engine
– Variety
• Structured, Semi-structured, Unstructured
– Veracity
• Quality or fidelity
– Value
• Higher veracity, lower processing time -> higher value
• Other V’s:
– Variability: inconsistency of the data set can hamper processes to
handle and manage it
Big Data Analytics,
Spring 2017
NTUT CSIE
24
Data analysis vs. Data analytics
• “Analysis is the separation of a whole into its
component parts, and analytics is the method of
logical analysis.” [source: Merriam-Webster
dictionary]
• “Analysis is really a heuristic activity, where
scanning through all the data the analyst gains
some insight. “ [source: Quora.com]
• “Analytics is about applying a mechanical or
algorithmic process to derive the insights for
example running through various data sets
looking for meaningful correlations between
them. ” [source: Quora.com]
Big Data Analytics,
Spring 2017
NTUT CSIE
25
Related Terms
• Data science, predictive analytics
• Business intelligence, FinTech
• IoT, CPS, Industry 4.0
• Smart homes, smart cities
• Data mining, machine learning, artificial
intelligence
• Cloud computing, data-intensive computing,
parallel computing, distributed computing
•…
Big Data Analytics,
Spring 2017
NTUT CSIE
26
Google Trends Comparison:
Big data, IoT, Cloud computing
Big Data Analytics,
Spring 2017
NTUT CSIE
27
What is Big Data Analytics
• The management of complete data
lifecycle
–
–
–
–
–
–
–
Collecting
Cleansing
Organizing
Storing
Analyzing
Visualization
Governing
Big Data Analytics,
Spring 2017
NTUT CSIE
28
The Goal of the Course
• Big Data Analytics is to analyze huge
amount of diverse data with effective
algorithms in an efficient way
– Data preprocessing
– Data mining
– Parallel programming in distributed
platforms
Big Data Analytics,
Spring 2017
NTUT CSIE
29
A Big Picture
Data source
Open APIs / crawler
Data
analyst
Topic
Data Extraction
Similarity Estimation
Analytical
need
Topic-relevant
Search
Distributed
Storage
Data/Task Dispatch
Distributed Analysis
Feature
Extraction
Regression/
Classification/
Clustering
Machine
Learning
Feedback
Visualization
& Feedback
Presentation
Big Data Analytics,
Spring 2017
NTUT CSIE
30
The Topics to be Covered
• Introduction
• Data mining concepts
–
–
–
–
Data preprocessing
Frequent pattern mining
Data classification
Data clustering
• Distributed platform and parallel programming
– Introducing distributed platforms: Hadoop, Spark
– Parallel programming paradigm and concepts
– MapReduce programming
• Advanced topics
–
–
–
–
Link analysis
Mining social network graphs
Dimensionality reduction
Large-scale machine learning
Big Data Analytics,
Spring 2017
NTUT CSIE
31
Tentative Schedule
• Before midterm
–
–
–
–
–
Introduction (1 wk)
Data preprocessing (1 wk)
Frequent pattern mining (2 wks)
Data classification (2 wks)
Data clustering (2 wks)
• After midterm
– Introducing distributed platforms: Hadoop, Spark (1
wk)
– Parallel programming paradigm and concepts (1 wk)
– MapReduce programming (2-3 wks)
– Term Project Presentation (3-4 wks)
Big Data Analytics,
Spring 2017
NTUT CSIE
32
Notes on Homeworks
• Rule 1: Plagiarism is prohibited.
– Near-duplicate codes will get equal and
minimum basic scores
• Rule 2: Clear documentation is required in
your system projects.
– Instructions on downloading, installing,
configuring, and executing your code and
open source library, APIs, or codes must be
submitted
– Package control is recommended
Big Data Analytics,
Spring 2017
NTUT CSIE
33
More on Term Projects
• Options for term projects
– Option 1: team-based system project
• e. g., extension to system exercises
– Option 2: academic paper presentation
• Only one person, NOT team-based
• Tentative schedule for all teams:
– Proposal: *required* one week after midterm (Apr. 28,
2017)
– Presentations (including demos): *required* in the last
three-four weeks (starting as early as May 30, 2017)
– Final report: *required* before the end of the semester (Jun.
24, 2017)
• Slides, source code, documentation
Big Data Analytics,
Spring 2017
NTUT CSIE
34
For System Development
• You can write your own code in any
programming language
• You can also call existing open source APIs or
libraries
• But simply running existing binary executables or
commercial tools is NOT acceptable
• Selected datasets might be announced in the
course
• Any topic relevant to big data analytics
– Analysis of various types of media (text, Web, social
media) with big data characteristics (3V’s) using data
mining methods (regression, classification, clustering)
Big Data Analytics,
Spring 2017
NTUT CSIE
35
Open Source Tools for Big Data
Analytics
• Apache Hadoop, Spark (in Java, Scala,
Python, R)
– For distributed computing and data analysis
• Apache Pig, Hive, Flume, Hbase,
Cassandra, Alluxio, Mahout, …
– For data flow, SQL, streaming data,
distributed databases, distributed storage,
machine learning, …
• Spark SQL, Streaming, Mlib, GraphX
Big Data Analytics,
Spring 2017
NTUT CSIE
36
Hadoop Architecture
Big Data Analytics,
Spring 2017
NTUT CSIE
37
MapReduce
Big Data Analytics,
Spring 2017
NTUT CSIE
38
Spark and Hadoop
Big Data Analytics,
Spring 2017
NTUT CSIE
39
Thanks for Your Attention!
Big Data Analytics,
Spring 2017
NTUT CSIE
40