Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Winter 2017 Big Data Processing & Analytics Vassilis Christophides [email protected] https://who.rocq.inria.fr/Vassilis.Christophides/Big/ Ecole CentraleSupélec Winter 2017 1 Winter 2017 The Data Avalanche: From Science to Business 2 1 1 Shifting Paradigm in Sciences Winter 2017 Thousand years ago: science was empirical describing natural phenomena 2 . Last few hundred years: theoretical branch a 4G c2 a 3 2 a using models, generalizations Last few decades: a computational branch simulating complex phenomena The fourth paradigm today (eScience): data exploration unify theory, experiment, and simulation Data captured by instruments or generated by simulator Processed by software Information/Knowledge stored in computer Scientist analyzes data using data management services and statistics © Jim Gray 3 Winter 2017 Large Synoptic Survey Telescope (LSST) –100-200 Petabyte image archive –20-40 Petabyte database catalog LSST will take more than 800 panoramic images each night recording the entire visible sky twice each week Ten-year time series (~2020-2030) imaging of the night sky – mapping the Universe ! http://www.lsst.org 8.4-meter diameter primary mirror = 10 square degrees! 3.2 billion-pixel camera 4 2 2 Large Hadron Collider (LHC) Winter 2017 Protons collide some 1 billion times per second where each collision produces about a megabyte of data Even after filtering out about 99% of it, scientists are left with around 30 petabytes each year to analyze for a wide range of physics experiments, including studies on the Higgs boson reconstructing particle trajectories, the particle types and their speeds 9km diameter, ≈100m below ground 27-kilometre ring of superconducting magnets http://home.web.cern.ch/topics/large-hadron-collider Human Brain Project (HBP) 5 Winter 2017 Generate and interpret strategically selected data needed to build multilevel atlases and unifying models of the brain Use anatomical frameworks to organize and convey spatially and temporally distributed functional information about the brain at all organizational levels, from genes to cognition, and at all the relevant spatial and temporal scales http://blogs.scientificamerican.com/sa-visual/2014/04/02/how-do-you-visualize6 the-brain/ 3 3 Data-driven Discovery Winter 2017 Innovation is no longer hindered by the ability to collect data but, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion © JOHN R. JOHNSON Disruption Time! 8 Winter 2017 Until recently, you were a folder Now, you are Your Data! 9 4 4 Winter 2017 Blurring the Boundaries of Real & Virtual Worlds Ubiquitous sensing & reasoning in physical and cyber worlds 10 Digital Disruption Already Happening ! Winter 2017 Largest telco companies owns no telco infrastructure (Skype) World’s largest movie houses owns no cinemas (Netflix) Largest software companies don’t write the apps (Apple, Google) World’s most valuable retailer has no inventory (Alibaba) Most popular media owner creates no content (Facebook) World’s largest taxi company owns no vehicles (Uber) Largest accommodation provider owns no real estate (Airbnb) Faster growing banks have actually no money (SocietyOne) http://www.independent.co.uk/news/business/comment/hamishmcrae/facebook-airbnb-uber-and-the-unstoppable-rise-of-thecontent-non-generators-10227207.html 11 5 5 The Data Tsounami Winter 2017 Mobile devices (tracking all objects all the time) Scientific instruments (collecting all sorts of data) Social media and networks (all of us are generating data) Sensor networks (measuring all kinds of data) Three main reasons: Processes are increasingly automated Systems are increasingly interconnected People are social and increasingly generate data exhausts by interacting online 12 The New Moore’s Law The Economist: digital information 10 times/5 years! Winter 2017 Data volume is increasing exponentially 44x increase from 2009 to 2020 From 0.8 ZB to 35ZB 13 6 6 Winter 2017 Big Data = Transactions+Interactions+Observations IoT sensors are reporting even more personal data than humans are! Petabytes Terabytes Gigabytes Megabytes Increasing Data Variety, Velocity, & Veracity hortonworks.com/blog/7-key-drivers-for-the-big-data-market 15 Winter 2017 What Makes Data, “Big” Data? 18 7 7 Definitions Winter 2017 No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… (McKinsey Global Inst.) “Big Data” is high-volume, highvelocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making (Gartner) 19 The Four V’s of Big Data Winter 2017 20 8 8 Winter 2017 Characteristics of Big Data: 1-Scale (Volume) Web data Mobile data ERP, CRM data Too big: petabyte-scale collections or lots of (not necessarily big) data sets 21 Winter 2017 Characteristics of Big Data: 2-Speed (Velocity) Financial data IoT data Social data Too fast: needs to be processed quickly and react promptly 22 9 9 Winter 2017 Characteristics of Big Data: 3-Complexity (Variety) Medical Imaging data Measurement data Video data Textual data Textual data Too diverse: does not fit neatly in an existing tool 23 Winter 2017 Characteristics of Big Data: 4-Quality (Veracity) Too crappie: needs to assess their quality 24 10 10 Winter 2017 Summary of 4 Big Data Characteristics Characteristic Description Properties Drivers Volume The amount of data generated Batch Processing (data intensity) that must be ingested, processed& analyzed to make data-driven decisions High number of data sources High resolution sensors Velocity How fast data is being produced and ingested and the speed at which data is transformed into insight Streaming, online Processing (Near) Real-time Analytics real-time, high-rate data acquisition, low cost of hardware Variety The degree of diversity (and structuring) of data from sources both inside and outside an organization Multi-Modality Complex interrelations Sequences Implicit Semantics Social media Scientific data Video M2M / IoT Veracity The quality and traceability of data Consistency Completeness Integrity Ambiguity Crowd data production, Human Sensing 25 We’ve Moved into a New Era of Data Analytics 12+ terabytes 5+ million of Tweets create daily 100’s Winter 2017 trade events per second Volume Velocity Variety Veracity of different types of data 1 in 3 Only decision makers trust their information 27 11 11 Declining % of Data an Organization Can Analyse Winter 2017 http://www.youtube.com/watch?v=B27SpLOOhWw 28 Winter 2017 Big Data Processing 29 12 12 Winter 2017 Big Data: Old Wine in a New Bottle? No, it is a different type of data wave: one needs to put together many sources of information, coming through many different channels, throwing away what is not important, working under resource constraints (time), serving real users’ needs Yes, most of these problems have been in the focus of data management research for years Big Data movement: exponential growth of data enthusiasts! Proliferation of data producers and consumers, e.g., on the Web, scientific, social, government, urban, home and personal spaces The main issue is to put all this together to satisfy concrete data analysis needs via innovative technology 30 Beyond Big Data Size! Winter 2017 Volume = Length Width Depth Length: Collect & Compare Width: Curate & Integrate Depth: Analyze & Understand Massive Data Analysis J. Freire & J. Simeon New York University Course 2013 31 13 13 The WRONG Picture! Winter 2017 32 Big Data vs Deep Insights data Winter 2017 knowledge Data exploration is hard regardless of whether data are big or small ! 33 14 14 The TRUE Picture! The time for developing an analysis (with small data) Winter 2017 The time for developing an analysis (with big data) Big Data Infrastructures: Exploiting the Power of Big Data T. Sellis School of CS & IT, 2015 Athens Exploring Big Data: What is Hard? 34 Winter 2017 Scalability for computations? NOT REALLY! Lots of work on distributed systems, parallel databases, … Cloud elasticity: Add more nodes! But there are no one-size-fits-all solution: often, you have to build your own… Rapidly-evolving technology Many different tools Different computation model: need new algorithms! 35 15 15 Advanced Analytics Requires a Robust, Comprehensive Information Platform Winter 2017 The bottleneck is the human (data scientist) ! ©2011 IBM Corporation 36 Big Data Research Agenda Acquisition, Storage, and Management of “Big Data” Data representation, storage, and retrieval New parallel data architectures, including clouds Data management policies, including privacy and secure access Communication and storage devices with extreme capacities Sustainable economic models for access and preservation Data Analytics Computational, mathematical, statistical, and algorithmic techniques for modelling high dimensional data Learning, inference, prediction, and knowledge discovery for large volumes of heterogeneous data sets Data mining to enable automated hypothesis generation, event correlation, and anomaly detection from data streams Information infusion of multiple data sources Winter 2017 Data Sharing and Collaboration Tools for distant data sharing, real time visualization, and software reuse of complex data sets Cross disciplinary model, information and knowledge sharing Remote operation and real time access to distant data sources and instruments Source Big Data R&D Initiative Howard Wactlar NIST Big Data Meeting June, 2012 37 16 16 Winter 2017 Big Data Mining 40 What to Do with Big Data? Data contains value and knowledge But to extract the knowledge data needs to be Stored Managed And ANALYZED Data Analysis include: Mine/summarize large datasets Extract knowledge from past data Predict trends in future data Winter 2017 Data Mining ≈ Big Data ≈ Data Analytics ≈ Data Science 41 17 17 A Bit of Terminology Winter 2017 Data mining is the old big data: an overused term including anything such as collecting, storing, curating and visualizing data machine learning / AI (which predates the term data mining) non-ML data mining (as in "knowledge discovery", where the focus is on new knowledge, not on learning of existing knowledge) "Business intelligence", "business analytics“ are marketing terms stressing that more data leads to better business decisions (periodic reporting as well as ad hoc queries, importance of tools and dashboards); Most "Big Data" today isn't ML: It's Extract, Transform, Load (ETL), so it is replacing data warehousing (except computational advertisement) Business Intelligence aims at descriptive statistics with data with high information density to measure things, detect trends etc. Big Data targets inductive statistics with data with low information density whose huge volume allow to infer laws (regressions…) and thus giving (with the limits of inference reasoning) to Big Data some predictive capabilities (called Deep Analytics) 42 Winter 2017 Data Analysis: ERP & CRM Examples Who are our lowest/highest margin customers ? What is the most effective distribution channel? What product prom-otions have the biggest impact on revenue? Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? Agrawal et al., VLDB 2010 Tutorial 43 18 18 The Data Analysis Spectrum Winter 2017 How can we make it happen? Source: Gartner Prescriptive Analytics Value What might happen? Predictive Analytics Why did it happen? What happened? Diagnostic Analytics Descriptive Analytics What is happening? Monitoring (Dashboards, Scorecards) Difficulty 45 Winter 2017 Classic ML Algorithms used for Decades K-means Logistic Regression KNN (N-nearest Neighbours) Naïve Bayes Decision Trees SVD (Singular Value Decomposition) 47 19 19 Growing Need for Big ML Tasks Winter 2017 make sense of images, audio find significant genes make sense of documents find similar users Big ML Software for Modern ML Algorithms Q. Ho, E. P. Xing 49 ML Computation vs. Classical Computing Programs ML Program: optimization-centric and iterative convergent Winter 2017 Traditional Program: operation-centric and deterministic Parallelization Strategies and Systems for Distributed Machine Learning E. Xing 50 20 20 Data Mining: Different Cultures Winter 2017 Data mining overlaps with: Databases (DB): Large-scale data, simple queries Machine Learning (ML): Small data, Complex models Computer Science Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models – ML algorithms = “engine” to solve ML models Result is the parameters of the model Big Data urges for a cross-culture curriculum stressing on Machine Scalability Statistics Learning/ Algorithms /AI Data Pattern Computing architectures MiningRecognition Automation for handling large data Database systems Hadoop is not Suited to Iterative ML Typically we want to analyse a dataset by accessing data several times Many trial-and-error steps, easy to get lost… Most existing data mining/ML methods were designed without considering data access and communication of intermediate results They iteratively use data by assuming they are readily available 51 Winter 2017 Hadoop is not efficient at iterative programs need many map-reduce phases HDFS disk I/O becomes bottleneck! 52 21 21 Why Need new Big ML Systems? ML practitioner’s view Winter 2017 Systems view Focus on Focus on High iteration throughput (more correctness, iterations per sec) fewer iterations to converge strong fault-tolerant atomic ops … but assume an ideal system … but assume ML algo is a black box for (t = 1 to T) { Fast-but-unstable Asynchronous Parallel Slow-but-correct doThings()z Bulk Sync. Parallel parallelUpdate(x,θ) doOtherThings() } Oversimplify ML issues Oversimplify systems issues e.g. ML algos “still work” without e.g. need machines to perform proof under different exec. models consistently e.g. “easy to rewrite” in chosen e.g. need to sync parameters any abstraction (MapR, vertex, etc.) time Big ML Software for Modern ML Algorithms Q. Ho, E. P. Xing 53 An Alg/Sys INTERFACE for Big ML Winter 2017 This is an on-going research topic methods not ready Alone, neither side has full picture ... New opportunities exist in the middle! tools are not convenient platforms rapidly change, … Big ML Software for Modern ML Algorithms Q. Ho, E. P. Xing 54 22 22 The Big ML Research Winter 2017 Roughly there are two types of approaches Parallelize existing (single-machine) algorithms (data, model, hybrid) Design new algorithms particularly for massively parallel settings of course there are things in between To have technical breakthroughs in big-data analytics, we should know both algorithms and systems well, and consider them together 55 Winter 2017 The Big ML “Stack” - More than Just Software! Big ML Software for Modern ML Algorithms Q. Ho, E. P. Xing 56 23 23 MapReducable? Winter 2017 57 Big Data Analytics Platforms Online Machine Learning Winter 2017 Big Machine Learning (Mahout, MillWheel, R/Hadoop) (SAMOA, Rapid Miner, OIIDM) IoT Data Analysis (Parstream, Vitria, Splunk, virdata) Big Time Series Analytics (Striim, Storm, Spark, Google RT, Apache S4, MS , Azure, AWS Kinesis) (InfluxDB, AT&T M2X, IBM Informix TS, OpenTSDB) Source: Vmware 59 24 24 Winter 2017 What this Course is About? 60 The Need: Making Sense of the Big Data Universe Winter 2017 Frameworks: New computing paradigms: Cloud, Hadoop – Map/Reduce New storage solutions: NoSQL, column stores, Big Table New languages: JAQL, Pig Latin We will survey these and how they relate to previous technologies Analysis: New frameworks demands new approaches to explore data We will study algorithms to process and mining data in Big-Data environments Tackling Big Data M. Cooper & P. Mell NIST Information Technology Laboratory Computer Security Division 61 25 25 Winter 2017 What You Will learn Understand different models of computation: MapReduce Streams and online algorithms Mine different types of data: Data is high dimensional Data is infinite/never-ending Use different mathematical ‘tools’: Hashing (LSH, Bloom filters) Dynamic programming (frequent itemsets) Solve real-world problems: Duplicate document detection Market Basket Analysis 62 Winter 2017 Tentative Course Schedule Lecture 1 (13/01): Course Overview Lecture 2 (20/01): The Map-Reduce Software Ecosystem Lecture 3 (27/01): Finding Similar Items Lecture 4 (17/02): Massive Data Warehousing Lecture 5-6 (24/02-03/03): Mining Association Rules Lecture 7-8 (10-17/04): Analysing Data Streams © NY Times 63 26 26 Course Text Books Jure Leskovec, Anand Rajaraman, Jeff Ullman. “Mining of Massive Datasets” Cambridge University Press, 2014 http://www.cambridge.org/gr/academic/subjects/compute r-science/knowledge-management-databases-and-datamining/mining-massive-datasets-2nd-edition Free download http://www.mmds.org Donald Miner, Adam Shook “MapReduce Design Patterns” O'Reilly Media 2013 http://shop.oreilly.com/product/0636920025122.do Free download of chapter samples http://cdn.oreillystatic.com/oreilly/booksamplers/9781 449327170_sampler.pdf Winter 2017 64 Winter 2017 Course Organization Three Programming Exercises (40%): Map/Reduce & Haddop Final Examination (60%): 66 27 27 Winter 2017 Words of Caution We can only cover a small part of the big data universe Do not expect all possible architectures, programming models, theoretical results, or vendors to be covered This really is an algorithms course, not a basic programming course But you will need to do a lot of non-trivial programming There are few certain answers, as people in research and leading tech companies are trying to understand how to deal with big data We are working with cutting edge technology Bugs, lack of documentation, new Hadoop API In short: you have to be able to deal with inevitable frustrations and plan your work accordingly… …but if you can do that and are willing to invest the time, it will be a rewarding experience 67 Winter 2017 References CS246: Mining Massive Datasets Jure Leskovec, Stanford University, 1014 CS9223 – Massive Data Analysis J. Freire & J. Simeon New York University Course 2013 CS 6240: Parallel Data Processing in MapReduce Mirek Riedewald Northeastern University 2014 Big Data Infrastructures: Exploiting the Power of Big Data T. Sellis School of CS & IT, 2015 Athens CS525: Special Topics in DBs Large-Scale Data Management Advanced Analytics on Hadoop Mohamed Eltabakh Spring 2013 Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University August 30, 2014 Knowledge Discovery and Data Mining Evgueni Smirnov Maastricht School on Data Mining Department of Knowledge Engineering, Maastricht University, Maastricht, The Netherlands August 27 - August 30, 2013 68 28 28 Winter 2017 69 The Big Data Analysis Pipeline Winter 2017 Major Processing Steps Major Challenges http://cacm.acm.org/magazines/2014/7/176204-big-data-and-its70 technical-challenges/abstract 29 29 Winter 2017 71 Winter 2017 72 30 30 Winter 2017 Big Data Processing & Analysis Framework 73 Winter 2017 http://www.pinterest.com/pin/272327108689099496/ 74 31 31 Winter 2017 75 32 32