Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tamkang University Big Data Mining 巨量資料探勘 Tamkang University 巨量資料基礎: MapReduce典範、Hadoop與Spark生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem) 1052DM02 MI4 (M2244) (3069) Thu, 8, 9 (15:10-17:00) (B130) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University 淡江大學 資訊管理學系 http://mail. tku.edu.tw/myday/ 2017-02-23 1 課程大綱 (Syllabus) 週次 (Week) 日期 (Date) 內容 (Subject/Topics) 1 2017/02/16 巨量資料探勘課程介紹 (Course Orientation for Big Data Mining) 2 2017/02/23 巨量資料基礎:MapReduce典範、Hadoop與Spark生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem) 3 2017/03/02 關連分析 (Association Analysis) 4 2017/03/09 分類與預測 (Classification and Prediction) 5 2017/03/16 分群分析 (Cluster Analysis) 6 2017/03/23 個案分析與實作一 (SAS EM 分群分析): Case Study 1 (Cluster Analysis – K-Means using SAS EM) 7 2017/03/30 個案分析與實作二 (SAS EM 關連分析): Case Study 2 (Association Analysis using SAS EM) 2 課程大綱 (Syllabus) 週次 (Week) 日期 (Date) 內容 (Subject/Topics) 8 2017/04/06 教學行政觀摩日 (Off-campus study) 9 2017/04/13 期中報告 (Midterm Project Presentation) 10 2017/04/20 期中考試週 (Midterm Exam) 11 2017/04/27 個案分析與實作三 (SAS EM 決策樹、模型評估): Case Study 3 (Decision Tree, Model Evaluation using SAS EM) 12 2017/05/04 個案分析與實作四 (SAS EM 迴歸分析、類神經網路): Case Study 4 (Regression Analysis, Artificial Neural Network using SAS EM) 13 2017/05/11 Google TensorFlow 深度學習 (Deep Learning with Google TensorFlow) 14 2017/05/18 期末報告 (Final Project Presentation) 15 2017/05/25 畢業班考試 (Final Exam) 3 2017/02/23 巨量資料基礎: MapReduce典範、 Hadoop與Spark生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem) 4 Big Data Analytics and Data Mining 5 Architectures of Big Data Analytics 6 Architecture of Big Data Analytics Big Data Sources Big Data Transformation Middleware * Internal * External * Multiple formats * Multiple locations * Multiple applications Big Data Platforms & Tools Raw Data Hadoop Transformed MapReduce Pig Data Extract Hive Transform Jaql Load Zookeeper Hbase Data Cassandra Warehouse Oozie Avro Mahout Traditional Others Format CSV, Tables Big Data Analytics Applications Queries Big Data Analytics Reports OLAP Data Mining Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications 7 Architecture of Big Data Analytics Big Data Sources * Internal * External * Multiple formats * Multiple locations * Multiple applications Big Data Transformation Big Data Platforms & Tools Data Mining Big Data Analytics Applications Middleware Raw Data Hadoop Transformed MapReduce Pig Data Extract Hive Transform Jaql Load Zookeeper Hbase Data Cassandra Warehouse Oozie Avro Mahout Traditional Others Format CSV, Tables Big Data Analytics Applications Queries Big Data Analytics Reports OLAP Data Mining Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications 8 Architecture for Social Big Data Mining (Hiroshi Ishikawa, 2015) Enabling Technologies • Integrated analysis model Analysts Integrated analysis • Model Construction • Explanation by Model Conceptual Layer • • • • Natural Language Processing Information Extraction Anomaly Detection Discovery of relationships among heterogeneous data • Large-scale visualization • Parallel distrusted processing Data Mining Multivariate analysis Application specific task Logical Layer Software • Construction and confirmation of individual hypothesis • Description and execution of application-specific task Social Data Hardware Physical Layer Source: Hiroshi Ishikawa (2015), Social Big Data Mining, CRC Press 9 Business Intelligence (BI) Infrastructure Source: Kenneth C. Laudon & Jane P. Laudon (2014), Management Information Systems: Managing the Digital Firm, Thirteenth Edition, Pearson. 10 Data Warehouse Data Mining and Business Intelligence Increasing potential to support business decisions End User Decision Making Data Presentation Visualization Techniques Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Source: Jiawei Han and Micheline Kamber (2006), Data Mining: Concepts and Techniques, Second Edition, Elsevier DBA 11 The Evolution of BI Capabilities Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 12 Data Science and Business Intelligence Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015 13 Data Science and Business Intelligence Predictive Analytics and Data Mining (Data Science) Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015 14 Predictive Analytics Data Science and and Data Mining Business Intelligence (Data Science) Structured/unstructured data, many types of sources, very large datasets Optimization, predictive modeling, forecasting statistical analysis What if…? What’s the optimal scenario for our business? What will happen next? What if these trends countinue? Why is this happening? Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015 15 Data Mining at the Intersection of Many Disciplines ial e Int tis tic s c tifi Ar Pattern Recognition en Sta llig Mathematical Modeling Machine Learning ce DATA MINING Databases Management Science & Information Systems Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 16 A Taxonomy for Data Mining Tasks Data Mining Learning Method Popular Algorithms Supervised Classification and Regression Trees, ANN, SVM, Genetic Algorithms Classification Supervised Decision trees, ANN/MLP, SVM, Rough sets, Genetic Algorithms Regression Supervised Linear/Nonlinear Regression, Regression trees, ANN/MLP, SVM Unsupervised Apriory, OneR, ZeroR, Eclat Link analysis Unsupervised Expectation Maximization, Apriory Algorithm, Graph-based Matching Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique Unsupervised K-means, ANN/SOM Prediction Association Clustering Outlier analysis Unsupervised K-means, Expectation Maximization (EM) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 17 Traditional Analytics BI and Analytics Data Mart Operational Data Sources Data Mart EDW Analytic Mart Analytic Mart Unstructured, Semi-structured and Streaming data (i.e. sensor data) handled often outside the Warehouse flow Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics 18 Hadoop as a “new data” Store BI and Analytics Data Mart Operational Data Sources Data Mart EDW Analytic Mart Analytic Mart Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics 19 Hadoop as an additional input to the EDW BI and Analytics Data Mart Operational Data Sources Data Mart EDW Analytic Mart Analytic Mart Analytic Mart Data Mart Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics 20 Hadoop Data Platform As a “staging Layer” as part of a “data Lake” – Downstream stores could be Hadoop, data appliances or an RDBMS BI and Analytics Operational Data Sources EDW Data Mart Data Mart Analytic Mart Analytic Mart Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics 21 SAS Big data Strategy – SAS areas Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics 22 SAS Big data Strategy – SAS areas Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics 23 SAS® Within the HADOOP ECOSYSTEM EG User Interface ® SAS User SAS® Enterprise Guide® EM SAS® Data Integration Data Processing SAS® Enterprise Miner™ Base SAS & SAS/ACCESS® to Hadoop™ SAS® In-Memory Statistics for Haodop Pig Hive Next-Gen ® SAS User In-Memory Data Access SAS Embedded Process Accelerators Impala Map Reduce File System SAS® Visual Analytics SAS Metadata Metadata Data Access VA SAS® LASR™ Analytic Server SAS® HighPerformance Analytic Procedures MPI Based HDFS Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics 24 SAS enables the entire lifecycle around HADOOP SAS enableS the entire lifecycle around HADOOP Done using either the Data Preparation, Data Exploration or Build Model Tools SAS Visual Analytics Decision Manager IDENTIFY / FORMULATE PROBLEM EVALUATE / MONITOR RESULTS SAS Scoring Accelerator for Hadoop SAS Code Accelerator for Hadoop SAS Visual Analytics SAS Visual Statistics SAS In-Memory Statistics for Hadoop DATA PREPARATION DEPLOY MODEL DATA EXPLORATION VALIDATE MODEL TRANSFORM & SELECT Decision Manager Done using either the Data Preparation, Data Exploration or Build Model Tools BUILD MODEL SAS High Performance Analytics Offerings supported by relevant clients like SAS Enterprise Miner, SAS/STAT etc. Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics 25 Data Mining Process 26 Data Mining Process • • • • A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard processes: – CRISP-DM (Cross-Industry Standard Process for Data Mining) – SEMMA (Sample, Explore, Modify, Model, and Assess) – KDD (Knowledge Discovery in Databases) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 27 Data Mining Process (SOP of DM) What main methodology are you using for your analytics, data mining, or data science projects ? Source: http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html 28 Data Mining Process Source: http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html 29 Data Mining: Core Analytics Process The KDD Process for Extracting Useful Knowledge from Volumes of Data Source: Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-34. 30 Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-34. 31 Data Mining Knowledge Discovery in Databases (KDD) Process (Fayyad et al., 1996) Source: Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-34. 32 Knowledge Discovery in Databases (KDD) Process Data mining: core of knowledge discovery process Data Mining Evaluation and Presentation Knowledge Patterns Selection and Transformation Cleaning and Integration Databases Data Warehouse Task-relevant Data Flat files Source: Jiawei Han and Micheline Kamber (2006), Data Mining: Concepts and Techniques, Second Edition, Elsevier 33 Data Mining Process: CRISP-DM 1 Business Understanding 2 Data Understanding 3 Data Preparation Data Sources 6 4 Deployment Model Building 5 Testing and Evaluation Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 34 Data Mining Process: CRISP-DM Step 1: Business Understanding Step 2: Data Understanding Step 3: Data Preparation (!) Step 4: Model Building Step 5: Testing and Evaluation Step 6: Deployment Accounts for ~85% of total project time • The process is highly repetitive and experimental (DM: art versus science?) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 35 Data Preparation – A Critical DM Task Real-world Data Data Consolidation · · · Collect data Select data Integrate data Data Cleaning · · · Impute missing values Reduce noise in data Eliminate inconsistencies Data Transformation · · · Normalize data Discretize/aggregate data Construct new attributes Data Reduction · · · Reduce number of variables Reduce number of cases Balance skewed data Well-formed Data Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 36 Data Mining Process: SEMMA Sample (Generate a representative sample of the data) Assess Explore (Evaluate the accuracy and usefulness of the models) (Visualization and basic description of the data) SEMMA Model Modify (Use variety of statistical and machine learning models ) (Select variables, transform variable representations) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 37 Data Mining Processing Pipeline (Charu Aggarwal, 2015) Data Preprocessing Data Collection Feature Extraction Cleaning and Integration Analytical Processing Building Block 1 Building Block 2 Output for Analyst Feedback (Optional) Feedback (Optional) Source: Charu Aggarwal (2015), Data Mining: The Textbook Hardcover, Springer 38 Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem 39 Source: https://www.thalesgroup.com/en/worldwide/big-data/big-data-big-analytics-visual-analytics-what-does-it-all-mean 40 MapReduce Paradigm 41 MapReduce Paradigm Big Data Map Map0 Map1 Map2 Map3 Reduce Reduce0 Reduce1 Reduce2 Reduce3 MapReduce Data Output Data 42 MapReduce Word Count Input Dog Love Cat Bird Love Bird Dog Bird Cat Source: https://www.edureka.co/blog/mapreduce-tutorial/ 43 MapReduce Word Count Input Output Dog Love Cat Bird Love Bird Dog Bird Cat Bird, 3 Cat, 2 Dog, 2 Love, 2 Source: https://www.edureka.co/blog/mapreduce-tutorial/ 44 MapReduce Word Count Input Dog Love Cat Bird Love Bird Dog Bird Cat Split Map Shuffle Reduce Bird, (1, 1, 1) Bird, 3 Dog Love Cat Dog, 1 Love, 1 Cat, 1 Cat, (1, 1) Cat, 2 Bird Love Bird Bird, 1 Love, 1 Bird, 1 Dog, (1, 1) Dog, 2 Love, (1, 1) Love, 2 Dog Bird Cat Output Bird, 3 Cat, 2 Dog, 2 Love, 2 Dog, 1 Bird, 1 Cat, 1 Source: https://www.edureka.co/blog/mapreduce-tutorial/ 45 Hadoop Ecosystem 46 The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. Source: http://hadoop.apache.org/ 47 MapReduce HDFS Processing Storage Source: http://hadoop.apache.org/ 48 Big Data with Hadoop Architecture Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf 49 Big Data with Hadoop Architecture Logical Architecture Processing: MapReduce Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf 50 Big Data with Hadoop Architecture Logical Architecture Storage: HDFS Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf 51 Big Data with Hadoop Architecture Process Flow Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf 52 Big Data with Hadoop Architecture Hadoop Cluster Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf 53 Hadoop Ecosystem Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing 54 HDP (Hortonworks Data Platform) A Complete Enterprise Hadoop Data Platform Source: http://hortonworks.com/hdp/ 55 Apache Hadoop Hortonworks Data Platform Source: http://hortonworks.com/hdp/ 56 Hadoop and Data Analytics Tools Source: http://hortonworks.com/hdp/ 57 Hadoop 1 Hadoop 2 Source: http://hortonworks.com/hadoop/tez/ 58 Big Data Solution EG EM VA Source: http://www.newera-technologies.com/big-data-solution.html 59 Traditional ETL Architecture Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf 60 Offload ETL with Hadoop (Big Data Architecture) Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf 61 Spark Ecosystem 62 Lightning-fast cluster computing Apache Spark is a fast and general engine for large-scale data processing. Source: http://spark.apache.org/ 63 Logistic regression in Hadoop and Spark Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Source: http://spark.apache.org/ 64 Ease of Use • Write applications quickly in Java, Scala, Python, R. Source: http://spark.apache.org/ 65 Word count in Spark's Python API text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) Source: http://spark.apache.org/ 66 Spark and Hadoop Source: http://spark.apache.org/ 67 Spark Ecosystem Source: http://spark.apache.org/ 68 Spark Ecosystem Spark Spark Streaming Kafka MLlib Flume (machine learning) Spark SQL GraphX H2O Hive Titan (graph) HBase Cassandra HDFS Source: Mike Frampton (2015), Mastering Apache Spark, Packt Publishing 69 Hadoop vs. Spark HDFS read HDFS write HDFS read HDFS write Iter. 1 Iter. 2 Iter. 1 Iter. 2 Input Input Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing 70 Steps to Install Hadoop on a Personal Computer (Windows/OS X) Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5 71 Hodoop: Linux Based Software LINUX LINUX LINUX LINUX Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5 72 Appliance Personal Computer (Windows / OS X) Virtual Machine (VirtualBox / VMWare) Linux Hadoop Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5 73 Connection to Hadoop Personal Computer (Windows / OS X) Access from host Browser Virtual Machine (VirtualBox / VMWare) Linux Hadoop Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5 74 Steps to Install Hadoop on a Personal Computer (Windows/OS X) Step 1. Download and Install VirtualBox Step 2. Download Appliance Step 3. Import Appliance Step 4. Configure Virtual Machine (VM) Step 5. Start Virtual Machine (VM) Step 6. Test Connection From Host Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5 75 Virtual Box https://www.virtualbox.org/ 76 Steps to Install Hadoop on a Personal Computer (Windows/OS X) Step 1. Download and Install VirtualBox Step 2. Download Appliance Hortonworks Sandbox Step 3. Import Appliance Step 4. Configure Virtual Machine (VM) Step 5. Start Virtual Machine (VM) Step 6. Test Connection From Host Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5 77 Hortonworks Sandbox The easiest way to get started with Enterprise Hadoop http://hortonworks.com/products/hortonworks-sandbox/#install 78 Get started on Hadoop with these tutorials based on the Hortonworks Sandbox http://hortonworks.com/tutorials/ 79 Apache Hadoop http://hadoop.apache.org/ 80 Apache Hadoop http://hadoop.apache.org/releases.html#Download 81 Apache Hadoop YARN Source: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html 82 Apache Spark http://spark.apache.org/ 83 References • EMC Education Services (2015), Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley • Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing • Mike Frampton (2015), Mastering Apache Spark, Packt Publishing • Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics, http://www.slideshare.net/deepakramanathan/sasmodernization-architectures-big-data-analytics 84