Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Course Overview: Introduction to Big Data Analytics J. H. Wang Feb. 21, 2017 Instructor & TA • Instructor – – – – – – J. H. Wang (王正豪) Associate Professor, CSIE, NTUT Office: R1534, Technology Building E-mail: [email protected] Tel: ext. 4238 Office Hour: 2:00-5:00 pm, every Wednesday and Thursday • TA – Mr. Lin (@ R1424, Technology Building) Big Data Analytics, Spring 2017 NTUT CSIE 2 Course Description • Course Web Page: – http://www.ntut.edu.tw/~jhwang/BDA/ – for the latest announcements and updates of schedule, slides, and homeworks • Time: 2:10-4:00pm, Tue., 3:10-4:00pm, Fri. • Classroom: R227, 6th Teaching Building • Textbooks: (selected chapters) – Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann Publishers, July 2011. – Tom White, Hadoop: The Definitive Guide, 4th ed., O’Reilly Media, 2015. • Prerequisites: – Basic knowledge of data structures (and algorithms), database systems, and operating systems – Programming experience is *required* for completing homeworks & projects Big Data Analytics, Spring 2017 NTUT CSIE 3 Target Audience • CSIE juniors and seniors • Students who are interested in big data analytics and willing to learn practical skills Big Data Analytics, Spring 2017 NTUT CSIE 4 Additional Reading Materials • Books – Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan & Claypool Publishers, 2010. – Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, Mining of Massive Datasets, 2nd Edition, Cambridge University Press, November 2014. – Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark: Lightning-Fast Big Data Analysis, O'Reilly Media, January 2015. • Online documents: Hadoop, Spark, … • Tutorials on big data analytics in major conferences • Academic papers Big Data Analytics, Spring 2017 NTUT CSIE 5 Grading Policy • Homework assignments and programming exercises: ~40% • Mid-term exam: ~25% • Term project: ~35% – Including proposal, presentation, and final report • All homeworks, projects, and reports must be submitted *before* the end of the semester (Jun. 26, 2017) Big Data Analytics, Spring 2017 NTUT CSIE 6 Homeworks and Projects • About 2-3 written assignments – Concepts • About 2-3 team-based programming exercises – Maximum number of students per team: 4 • The term project (to be detailed later) – Either team-based system development • e.g. extension to exercises – Or academic paper presentation • Only one person per team allowed Big Data Analytics, Spring 2017 NTUT CSIE 7 About the Term Project • The score you’ll get depends on the functionality, difficulty, and quality of your project – For system development: • System functions and correctness • You can either write your own program, or call existing open source code or library (but NOT executing existing binary only) – For academic paper presentation • Quality and your presentation of the paper is critical • Major methods/experimental results *must* be presented • Papers from top conferences are strongly suggested – Proposals, presentations, and reports are *required* for each team, and will be counted in the score Big Data Analytics, Spring 2017 NTUT CSIE 8 Homework Submission • Online Submission – Systems, programs, project proposals, and project reports in electronic files must be submitted to the website. – Submission instructions to be announced. Big Data Analytics, Spring 2017 NTUT CSIE 9 What This Course is about • This course can help you – Understand the general idea of big data – Obtain a rough idea of data mining – Have an idea of what Hadoop can do • And MORE than these! – Technical and practical skills for: • • • • Environment setup of distributed computing Data analysis Data mining methods Parallel programming Big Data Analytics, Spring 2017 NTUT CSIE 10 AlphaGo Big Data Analytics, Spring 2017 NTUT CSIE 11 Google Translate Big Data Analytics, Spring 2017 NTUT CSIE 12 Google FluTrends Big Data Analytics, Spring 2017 NTUT CSIE 13 Big Data: Some Examples • Topic detection and tracking • Trend analysis • Social network analysis • PageRank • Predictive analytics • Many others: healthcare, natural resources, education, public sector, insurance, transportation, finance and crime detection, … Big Data Analytics, Spring 2017 NTUT CSIE 14 Google News Big Data Analytics, Spring 2017 NTUT CSIE 15 Google Trends Big Data Analytics, Spring 2017 NTUT CSIE 16 Google Hottrends Big Data Analytics, Spring 2017 NTUT CSIE 17 Daily View Big Data Analytics, Spring 2017 NTUT CSIE 18 AMiner Big Data Analytics, Spring 2017 NTUT CSIE 19 What is Big Data? • Big data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. [source: Wikipedia] Big Data Analytics, Spring 2017 NTUT CSIE 20 Characteristics of Big Data • META Group (now Gartner) defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume, velocity, and variety [Doug Laney, 2001] – Volume: the quantity of generated and stored data – Velocity: the speed at which data is generated and processed – Variety: the type and nature of data Big Data Analytics, Spring 2017 NTUT CSIE 21 Characteristics of Big Data • “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” [Gartner, 2012] Big Data Analytics, Spring 2017 NTUT CSIE 22 The Four V’s of Big Data Big Data Analytics, Spring 2017 [source: IBM] NTUT CSIE 23 More Characteristics of Big Data • Five V’s: – Volume • 2.5 EB per day (300 TB, Library of Congress) – Velocity • In 60 seconds: 350,000 tweets, 300 hours of YouTube video, 171 million emails, 350 GB sensor data from a jet engine – Variety • Structured, Semi-structured, Unstructured – Veracity • Quality or fidelity – Value • Higher veracity, lower processing time -> higher value • Other V’s: – Variability: inconsistency of the data set can hamper processes to handle and manage it Big Data Analytics, Spring 2017 NTUT CSIE 24 Data analysis vs. Data analytics • “Analysis is the separation of a whole into its component parts, and analytics is the method of logical analysis.” [source: Merriam-Webster dictionary] • “Analysis is really a heuristic activity, where scanning through all the data the analyst gains some insight. “ [source: Quora.com] • “Analytics is about applying a mechanical or algorithmic process to derive the insights for example running through various data sets looking for meaningful correlations between them. ” [source: Quora.com] Big Data Analytics, Spring 2017 NTUT CSIE 25 Related Terms • Data science, predictive analytics • Business intelligence, FinTech • IoT, CPS, Industry 4.0 • Smart homes, smart cities • Data mining, machine learning, artificial intelligence • Cloud computing, data-intensive computing, parallel computing, distributed computing •… Big Data Analytics, Spring 2017 NTUT CSIE 26 Google Trends Comparison: Big data, IoT, Cloud computing Big Data Analytics, Spring 2017 NTUT CSIE 27 What is Big Data Analytics • The management of complete data lifecycle – – – – – – – Collecting Cleansing Organizing Storing Analyzing Visualization Governing Big Data Analytics, Spring 2017 NTUT CSIE 28 The Goal of the Course • Big Data Analytics is to analyze huge amount of diverse data with effective algorithms in an efficient way – Data preprocessing – Data mining – Parallel programming in distributed platforms Big Data Analytics, Spring 2017 NTUT CSIE 29 A Big Picture Data source Open APIs / crawler Data analyst Topic Data Extraction Similarity Estimation Analytical need Topic-relevant Search Distributed Storage Data/Task Dispatch Distributed Analysis Feature Extraction Regression/ Classification/ Clustering Machine Learning Feedback Visualization & Feedback Presentation Big Data Analytics, Spring 2017 NTUT CSIE 30 The Topics to be Covered • Introduction • Data mining concepts – – – – Data preprocessing Frequent pattern mining Data classification Data clustering • Distributed platform and parallel programming – Introducing distributed platforms: Hadoop, Spark – Parallel programming paradigm and concepts – MapReduce programming • Advanced topics – – – – Link analysis Mining social network graphs Dimensionality reduction Large-scale machine learning Big Data Analytics, Spring 2017 NTUT CSIE 31 Tentative Schedule • Before midterm – – – – – Introduction (1 wk) Data preprocessing (1 wk) Frequent pattern mining (2 wks) Data classification (2 wks) Data clustering (2 wks) • After midterm – Introducing distributed platforms: Hadoop, Spark (1 wk) – Parallel programming paradigm and concepts (1 wk) – MapReduce programming (2-3 wks) – Term Project Presentation (3-4 wks) Big Data Analytics, Spring 2017 NTUT CSIE 32 Notes on Homeworks • Rule 1: Plagiarism is prohibited. – Near-duplicate codes will get equal and minimum basic scores • Rule 2: Clear documentation is required in your system projects. – Instructions on downloading, installing, configuring, and executing your code and open source library, APIs, or codes must be submitted – Package control is recommended Big Data Analytics, Spring 2017 NTUT CSIE 33 More on Term Projects • Options for term projects – Option 1: team-based system project • e. g., extension to system exercises – Option 2: academic paper presentation • Only one person, NOT team-based • Tentative schedule for all teams: – Proposal: *required* one week after midterm (Apr. 28, 2017) – Presentations (including demos): *required* in the last three-four weeks (starting as early as May 30, 2017) – Final report: *required* before the end of the semester (Jun. 24, 2017) • Slides, source code, documentation Big Data Analytics, Spring 2017 NTUT CSIE 34 For System Development • You can write your own code in any programming language • You can also call existing open source APIs or libraries • But simply running existing binary executables or commercial tools is NOT acceptable • Selected datasets might be announced in the course • Any topic relevant to big data analytics – Analysis of various types of media (text, Web, social media) with big data characteristics (3V’s) using data mining methods (regression, classification, clustering) Big Data Analytics, Spring 2017 NTUT CSIE 35 Open Source Tools for Big Data Analytics • Apache Hadoop, Spark (in Java, Scala, Python, R) – For distributed computing and data analysis • Apache Pig, Hive, Flume, Hbase, Cassandra, Alluxio, Mahout, … – For data flow, SQL, streaming data, distributed databases, distributed storage, machine learning, … • Spark SQL, Streaming, Mlib, GraphX Big Data Analytics, Spring 2017 NTUT CSIE 36 Hadoop Architecture Big Data Analytics, Spring 2017 NTUT CSIE 37 MapReduce Big Data Analytics, Spring 2017 NTUT CSIE 38 Spark and Hadoop Big Data Analytics, Spring 2017 NTUT CSIE 39 Thanks for Your Attention! Big Data Analytics, Spring 2017 NTUT CSIE 40