Download Big Data Mining

Document related concepts
no text concepts found
Transcript
Tamkang University
Big Data Mining
巨量資料探勘
Tamkang
University
巨量資料基礎:
MapReduce典範、Hadoop與Spark生態系統
(Fundamental Big Data:
MapReduce Paradigm, Hadoop and Spark Ecosystem)
1052DM02
MI4 (M2244) (3069)
Thu, 8, 9 (15:10-17:00) (B130)
Min-Yuh Day
戴敏育
Assistant Professor
專任助理教授
Dept. of Information Management, Tamkang University
淡江大學 資訊管理學系
http://mail. tku.edu.tw/myday/
2017-02-23
1
課程大綱 (Syllabus)
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
1 2017/02/16 巨量資料探勘課程介紹
(Course Orientation for Big Data Mining)
2 2017/02/23 巨量資料基礎:MapReduce典範、Hadoop與Spark生態系統
(Fundamental Big Data: MapReduce Paradigm,
Hadoop and Spark Ecosystem)
3 2017/03/02 關連分析 (Association Analysis)
4 2017/03/09 分類與預測 (Classification and Prediction)
5 2017/03/16 分群分析 (Cluster Analysis)
6 2017/03/23 個案分析與實作一 (SAS EM 分群分析):
Case Study 1 (Cluster Analysis – K-Means using SAS EM)
7 2017/03/30 個案分析與實作二 (SAS EM 關連分析):
Case Study 2 (Association Analysis using SAS EM)
2
課程大綱 (Syllabus)
週次 (Week) 日期 (Date) 內容 (Subject/Topics)
8 2017/04/06 教學行政觀摩日 (Off-campus study)
9 2017/04/13 期中報告 (Midterm Project Presentation)
10 2017/04/20 期中考試週 (Midterm Exam)
11 2017/04/27 個案分析與實作三 (SAS EM 決策樹、模型評估):
Case Study 3 (Decision Tree, Model Evaluation using SAS EM)
12 2017/05/04 個案分析與實作四 (SAS EM 迴歸分析、類神經網路):
Case Study 4 (Regression Analysis,
Artificial Neural Network using SAS EM)
13 2017/05/11 Google TensorFlow 深度學習
(Deep Learning with Google TensorFlow)
14 2017/05/18 期末報告 (Final Project Presentation)
15 2017/05/25 畢業班考試 (Final Exam)
3
2017/02/23
巨量資料基礎:
MapReduce典範、
Hadoop與Spark生態系統
(Fundamental Big Data:
MapReduce Paradigm,
Hadoop and Spark Ecosystem)
4
Big Data
Analytics
and
Data Mining
5
Architectures of
Big Data Analytics
6
Architecture of Big Data Analytics
Big Data
Sources
Big Data
Transformation
Middleware
* Internal
* External
* Multiple
formats
* Multiple
locations
* Multiple
applications
Big Data
Platforms & Tools
Raw
Data
Hadoop
Transformed MapReduce
Pig
Data
Extract
Hive
Transform
Jaql
Load
Zookeeper
Hbase
Data
Cassandra
Warehouse
Oozie
Avro
Mahout
Traditional
Others
Format
CSV, Tables
Big Data
Analytics
Applications
Queries
Big Data
Analytics
Reports
OLAP
Data
Mining
Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications
7
Architecture of Big Data Analytics
Big Data
Sources
* Internal
* External
* Multiple
formats
* Multiple
locations
* Multiple
applications
Big Data
Transformation
Big Data
Platforms & Tools
Data Mining
Big Data
Analytics
Applications
Middleware
Raw
Data
Hadoop
Transformed MapReduce
Pig
Data
Extract
Hive
Transform
Jaql
Load
Zookeeper
Hbase
Data
Cassandra
Warehouse
Oozie
Avro
Mahout
Traditional
Others
Format
CSV, Tables
Big Data
Analytics
Applications
Queries
Big Data
Analytics
Reports
OLAP
Data
Mining
Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications
8
Architecture for
Social Big Data Mining
(Hiroshi Ishikawa, 2015)
Enabling Technologies
• Integrated analysis model
Analysts
Integrated analysis
• Model Construction
• Explanation by Model
Conceptual Layer
•
•
•
•
Natural Language Processing
Information Extraction
Anomaly Detection
Discovery of relationships
among heterogeneous data
• Large-scale visualization
• Parallel distrusted processing
Data
Mining
Multivariate
analysis
Application
specific task
Logical Layer
Software
• Construction and
confirmation
of individual
hypothesis
• Description and
execution of
application-specific
task
Social Data
Hardware
Physical Layer
Source: Hiroshi Ishikawa (2015), Social Big Data Mining, CRC Press
9
Business Intelligence (BI) Infrastructure
Source: Kenneth C. Laudon & Jane P. Laudon (2014), Management Information Systems: Managing the Digital Firm, Thirteenth Edition, Pearson.
10
Data Warehouse
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
End User
Decision
Making
Data Presentation
Visualization Techniques
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Source: Jiawei Han and Micheline Kamber (2006), Data Mining: Concepts and Techniques, Second Edition, Elsevier
DBA
11
The Evolution of BI Capabilities
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
12
Data Science and
Business Intelligence
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015
13
Data Science and
Business Intelligence
Predictive Analytics
and Data Mining
(Data Science)
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015
14
Predictive
Analytics
Data Science
and
and Data
Mining
Business
Intelligence
(Data Science)
Structured/unstructured data, many types of sources,
very large datasets
Optimization, predictive modeling, forecasting statistical analysis
What if…?
What’s the optimal scenario for our business?
What will happen next?
What if these trends countinue?
Why is this happening?
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015
15
Data Mining at the
Intersection of Many Disciplines
ial
e
Int
tis
tic
s
c
tifi
Ar
Pattern
Recognition
en
Sta
llig
Mathematical
Modeling
Machine
Learning
ce
DATA
MINING
Databases
Management Science &
Information Systems
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
16
A Taxonomy for Data Mining Tasks
Data Mining
Learning Method
Popular Algorithms
Supervised
Classification and Regression Trees,
ANN, SVM, Genetic Algorithms
Classification
Supervised
Decision trees, ANN/MLP, SVM, Rough
sets, Genetic Algorithms
Regression
Supervised
Linear/Nonlinear Regression, Regression
trees, ANN/MLP, SVM
Unsupervised
Apriory, OneR, ZeroR, Eclat
Link analysis
Unsupervised
Expectation Maximization, Apriory
Algorithm, Graph-based Matching
Sequence analysis
Unsupervised
Apriory Algorithm, FP-Growth technique
Unsupervised
K-means, ANN/SOM
Prediction
Association
Clustering
Outlier analysis
Unsupervised
K-means, Expectation Maximization (EM)
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
17
Traditional Analytics
BI and
Analytics
Data Mart
Operational Data
Sources
Data Mart
EDW
Analytic
Mart
Analytic
Mart
Unstructured, Semi-structured and Streaming
data (i.e. sensor data) handled often outside the
Warehouse flow
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
18
Hadoop as a “new data” Store
BI and
Analytics
Data Mart
Operational Data
Sources
Data Mart
EDW
Analytic
Mart
Analytic
Mart
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
19
Hadoop as an additional input to
the EDW
BI and
Analytics
Data Mart
Operational Data
Sources
Data Mart
EDW
Analytic
Mart
Analytic
Mart
Analytic
Mart
Data Mart
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
20
Hadoop Data Platform As a
“staging Layer” as part of a “data Lake”
– Downstream stores could be Hadoop, data appliances or an RDBMS
BI and
Analytics
Operational Data
Sources
EDW
Data Mart
Data Mart
Analytic
Mart
Analytic
Mart
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
21
SAS Big data Strategy
– SAS areas
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
22
SAS Big data Strategy
– SAS areas
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
23
SAS® Within the
HADOOP ECOSYSTEM
EG
User
Interface
®
SAS User
SAS®
Enterprise
Guide®
EM
SAS® Data
Integration
Data
Processing
SAS®
Enterprise
Miner™
Base SAS & SAS/ACCESS®
to Hadoop™
SAS® In-Memory
Statistics for
Haodop
Pig
Hive
Next-Gen
®
SAS User
In-Memory
Data Access
SAS Embedded
Process
Accelerators
Impala
Map Reduce
File
System
SAS® Visual
Analytics
SAS Metadata
Metadata
Data
Access
VA
SAS® LASR™
Analytic
Server
SAS® HighPerformance
Analytic Procedures
MPI Based
HDFS
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
24
SAS enables the entire lifecycle
around HADOOP
SAS enableS the entire lifecycle around HADOOP
Done using either the Data Preparation,
Data Exploration or Build Model Tools
SAS Visual Analytics
Decision Manager
IDENTIFY /
FORMULATE
PROBLEM
EVALUATE /
MONITOR
RESULTS
SAS Scoring Accelerator for Hadoop
SAS Code Accelerator for Hadoop
SAS Visual Analytics
SAS Visual Statistics
SAS In-Memory Statistics for Hadoop
DATA
PREPARATION
DEPLOY
MODEL
DATA
EXPLORATION
VALIDATE
MODEL
TRANSFORM
& SELECT
Decision Manager
Done using either the Data
Preparation, Data Exploration
or Build Model Tools
BUILD
MODEL
SAS High Performance Analytics Offerings
supported by relevant clients like SAS
Enterprise Miner, SAS/STAT etc.
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics
25
Data Mining Process
26
Data Mining Process
•
•
•
•
A manifestation of best practices
A systematic way to conduct DM projects
Different groups has different versions
Most common standard processes:
– CRISP-DM
(Cross-Industry Standard Process for Data Mining)
– SEMMA
(Sample, Explore, Modify, Model, and Assess)
– KDD
(Knowledge Discovery in Databases)
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
27
Data Mining Process
(SOP of DM)
What main methodology
are you using for your
analytics,
data mining,
or data science projects ?
Source: http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html
28
Data Mining Process
Source: http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html
29
Data Mining:
Core Analytics Process
The KDD Process for
Extracting Useful Knowledge
from Volumes of Data
Source: Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data.
Communications of the ACM, 39(11), 27-34.
30
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996).
The KDD Process for
Extracting Useful Knowledge
from Volumes of Data.
Communications of the ACM, 39(11), 27-34.
31
Data Mining
Knowledge Discovery in Databases (KDD) Process
(Fayyad et al., 1996)
Source: Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data.
Communications of the ACM, 39(11), 27-34.
32
Knowledge Discovery in Databases (KDD)
Process
Data mining:
core of knowledge discovery process
Data Mining
Evaluation and
Presentation
Knowledge
Patterns
Selection and
Transformation
Cleaning and
Integration
Databases
Data
Warehouse
Task-relevant
Data
Flat files
Source: Jiawei Han and Micheline Kamber (2006), Data Mining: Concepts and Techniques, Second Edition, Elsevier
33
Data Mining Process:
CRISP-DM
1
Business
Understanding
2
Data
Understanding
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
5
Testing and
Evaluation
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
34
Data Mining Process:
CRISP-DM
Step 1: Business Understanding
Step 2: Data Understanding
Step 3: Data Preparation (!)
Step 4: Model Building
Step 5: Testing and Evaluation
Step 6: Deployment
Accounts for
~85% of total
project time
• The process is highly repetitive and experimental
(DM: art versus science?)
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
35
Data Preparation –
A Critical DM Task
Real-world
Data
Data Consolidation
·
·
·
Collect data
Select data
Integrate data
Data Cleaning
·
·
·
Impute missing values
Reduce noise in data
Eliminate inconsistencies
Data Transformation
·
·
·
Normalize data
Discretize/aggregate data
Construct new attributes
Data Reduction
·
·
·
Reduce number of variables
Reduce number of cases
Balance skewed data
Well-formed
Data
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
36
Data Mining Process:
SEMMA
Sample
(Generate a representative
sample of the data)
Assess
Explore
(Evaluate the accuracy and
usefulness of the models)
(Visualization and basic
description of the data)
SEMMA
Model
Modify
(Use variety of statistical and
machine learning models )
(Select variables, transform
variable representations)
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems
37
Data Mining Processing Pipeline
(Charu Aggarwal, 2015)
Data Preprocessing
Data
Collection
Feature
Extraction
Cleaning
and
Integration
Analytical Processing
Building
Block 1
Building
Block 2
Output
for
Analyst
Feedback (Optional)
Feedback (Optional)
Source: Charu Aggarwal (2015), Data Mining: The Textbook Hardcover, Springer
38
Fundamental Big Data:
MapReduce Paradigm,
Hadoop and Spark
Ecosystem
39
Source: https://www.thalesgroup.com/en/worldwide/big-data/big-data-big-analytics-visual-analytics-what-does-it-all-mean
40
MapReduce
Paradigm
41
MapReduce Paradigm
Big Data
Map
Map0
Map1
Map2
Map3
Reduce
Reduce0
Reduce1
Reduce2
Reduce3
MapReduce Data
Output Data
42
MapReduce Word Count
Input
Dog Love Cat
Bird Love Bird
Dog Bird Cat
Source: https://www.edureka.co/blog/mapreduce-tutorial/
43
MapReduce Word Count
Input
Output
Dog Love Cat
Bird Love Bird
Dog Bird Cat
Bird, 3
Cat, 2
Dog, 2
Love, 2
Source: https://www.edureka.co/blog/mapreduce-tutorial/
44
MapReduce Word Count
Input
Dog Love Cat
Bird Love Bird
Dog Bird Cat
Split
Map
Shuffle
Reduce
Bird, (1, 1, 1)
Bird, 3
Dog Love Cat
Dog, 1
Love, 1
Cat, 1
Cat, (1, 1)
Cat, 2
Bird Love Bird
Bird, 1
Love, 1
Bird, 1
Dog, (1, 1)
Dog, 2
Love, (1, 1)
Love, 2
Dog Bird Cat
Output
Bird, 3
Cat, 2
Dog, 2
Love, 2
Dog, 1
Bird, 1
Cat, 1
Source: https://www.edureka.co/blog/mapreduce-tutorial/
45
Hadoop
Ecosystem
46
The Apache™ Hadoop® project
develops open-source software
for reliable, scalable,
distributed computing.
Source: http://hadoop.apache.org/
47
MapReduce
HDFS
Processing
Storage
Source: http://hadoop.apache.org/
48
Big Data with Hadoop Architecture
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
49
Big Data with Hadoop Architecture
Logical Architecture
Processing: MapReduce
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
50
Big Data with Hadoop Architecture
Logical Architecture
Storage: HDFS
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
51
Big Data with Hadoop Architecture
Process Flow
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
52
Big Data with Hadoop Architecture
Hadoop Cluster
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
53
Hadoop Ecosystem
Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
54
HDP (Hortonworks Data Platform)
A Complete Enterprise Hadoop Data Platform
Source: http://hortonworks.com/hdp/
55
Apache Hadoop
Hortonworks Data Platform
Source: http://hortonworks.com/hdp/
56
Hadoop and Data Analytics Tools
Source: http://hortonworks.com/hdp/
57
Hadoop 1  Hadoop 2
Source: http://hortonworks.com/hadoop/tez/
58
Big Data Solution
EG
EM
VA
Source: http://www.newera-technologies.com/big-data-solution.html
59
Traditional ETL Architecture
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
60
Offload ETL with Hadoop
(Big Data Architecture)
Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
61
Spark
Ecosystem
62
Lightning-fast cluster computing
Apache Spark
is a fast and general engine
for
large-scale data processing.
Source: http://spark.apache.org/
63
Logistic regression in
Hadoop and Spark
Run programs up to 100x faster than
Hadoop MapReduce in memory,
or 10x faster on disk.
Source: http://spark.apache.org/
64
Ease of Use
• Write applications quickly in
Java, Scala, Python, R.
Source: http://spark.apache.org/
65
Word count in Spark's Python API
text_file = spark.textFile("hdfs://...")
text_file.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
Source: http://spark.apache.org/
66
Spark and Hadoop
Source: http://spark.apache.org/
67
Spark Ecosystem
Source: http://spark.apache.org/
68
Spark Ecosystem
Spark
Spark
Streaming
Kafka
MLlib
Flume
(machine
learning)
Spark
SQL
GraphX
H2O
Hive
Titan
(graph)
HBase
Cassandra
HDFS
Source: Mike Frampton (2015), Mastering Apache Spark, Packt Publishing
69
Hadoop vs. Spark
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Iter. 1
Iter. 2
Iter. 1
Iter. 2
Input
Input
Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
70
Steps to
Install Hadoop
on a
Personal Computer
(Windows/OS X)
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
71
Hodoop: Linux Based Software
LINUX
LINUX
LINUX
LINUX
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
72
Appliance
Personal Computer (Windows / OS X)
Virtual Machine (VirtualBox / VMWare)
Linux
Hadoop
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
73
Connection to Hadoop
Personal Computer (Windows / OS X)
Access
from
host
Browser
Virtual Machine (VirtualBox / VMWare)
Linux
Hadoop
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
74
Steps to Install Hadoop on a
Personal Computer (Windows/OS X)
Step 1. Download and Install VirtualBox
Step 2. Download Appliance
Step 3. Import Appliance
Step 4. Configure Virtual Machine (VM)
Step 5. Start Virtual Machine (VM)
Step 6. Test Connection From Host
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
75
Virtual Box
https://www.virtualbox.org/
76
Steps to Install Hadoop on a
Personal Computer (Windows/OS X)
Step 1. Download and Install VirtualBox
Step 2. Download Appliance
Hortonworks
Sandbox
Step 3. Import Appliance
Step 4. Configure Virtual Machine (VM)
Step 5. Start Virtual Machine (VM)
Step 6. Test Connection From Host
Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5
77
Hortonworks Sandbox
The easiest way to get started with Enterprise Hadoop
http://hortonworks.com/products/hortonworks-sandbox/#install
78
Get started on Hadoop with these tutorials
based on the Hortonworks Sandbox
http://hortonworks.com/tutorials/
79
Apache Hadoop
http://hadoop.apache.org/
80
Apache Hadoop
http://hadoop.apache.org/releases.html#Download
81
Apache Hadoop YARN
Source: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
82
Apache Spark
http://spark.apache.org/
83
References
• EMC Education Services (2015),
Data Science and Big Data Analytics: Discovering, Analyzing,
Visualizing and Presenting Data, Wiley
• Shiva Achari (2015),
Hadoop Essentials - Tackling the Challenges of Big Data with
Hadoop, Packt Publishing
• Mike Frampton (2015),
Mastering Apache Spark, Packt Publishing
• Deepak Ramanathan (2014),
SAS Modernization architectures - Big Data Analytics,
http://www.slideshare.net/deepakramanathan/sasmodernization-architectures-big-data-analytics
84
Related documents