Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fall 2016 Big Data Processing & Analytics V. CHRISTOPHIDES University of Crete & INRIA Paris 1 Fall 2016 The Data Avalanche: From Science to Business 2 1 1 Shifting Paradigm in Sciences Fall 2016 Thousand years ago: science was empirical describing natural phenomena 2 . Last few hundred years: theoretical branch a 4G c2 a 3 2 a using models, generalizations Last few decades: a computational branch simulating complex phenomena The fourth paradigm today (eScience): data exploration unify theory, experiment, and simulation Data captured by instruments or generated by simulator Processed by software Information/Knowledge stored in computer Scientist analyzes data using data management services and statistics © Jim Gray 3 Fall 2016 Large Synoptic Survey Telescope (LSST) –100-200 Petabyte image archive –20-40 Petabyte database catalog LSST will take more than 800 panoramic images each night recording the entire visible sky twice each week Ten-year time series imaging of the night sky – mapping the Universe ! http://www.lsst.org 8.4-meter diameter primary mirror = 10 square degrees! 3.2 billion-pixel camera 4 2 2 Large Hadron Collider (LHC) Fall 2016 Protons collide some 1 billion times per second where each collision produces about a megabyte of data Even after filtering out about 99% of it, scientists are left with around 30 petabytes each year to analyze for a wide range of physics experiments, including studies on the Higgs boson 9km diameter, ≈100m below ground 27-kilometre ring of superconducting magnets http://home.web.cern.ch/topics/large-hadron-collider Human Brain Project (HBP) 5 Fall 2016 Generate and interpret strategically selected data needed to build multilevel atlases and unifying models of the brain Use anatomical frameworks to organize and convey spatially and temporally distributed functional information about the brain at all organizational levels, from genes to cognition, and at all the relevant spatial and temporal scales http://blogs.scientificamerican.com/sa-visual/2014/04/02/how-do-you-visualize6 the-brain/ 3 3 Fall 2016 Data-driven Discovery Innovation is no longer hindered by the ability to collect data but, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion © JOHN R. JOHNSON Disruption Time! 8 Fall 2016 Until recently, you were a folder Now, you are Your Data! 9 4 4 Fall 2016 Blurring the Boundaries of Real & Virtual Worlds Ubiquitous sensing & reasoning in physical and cyber worlds 10 Digital Disruption Already Happening ! Fall 2016 Largest telco companies owns no telco infrastructure (Skype) World’s largest movie houses owns no cinemas (Netflix) Largest software companies don’t write the apps (Apple, Google) World’s most valuable retailer has no inventory (Alibaba) Most popular media owner creates no content (Facebook) World’s largest taxi company owns no vehicles (Uber) Largest accommodation provider owns no real estate (Airbnb) Faster growing banks have actually no money (SocietyOne) http://www.independent.co.uk/news/business/comment/hamishmcrae/facebook-airbnb-uber-and-the-unstoppable-rise-of-thecontent-non-generators-10227207.html 11 5 5 The Data Tsounami Fall 2016 Mobile devices (tracking all objects all the time) Scientific instruments (collecting all sorts of data) Social media and networks (all of us are generating data) Sensor networks (measuring all kinds of data) Three main reasons: Processes are increasingly automated Systems are increasingly interconnected People are social and increasingly generate data exhausts by interacting online 12 The New Moore’s Law The Economist: digital information 10 times/5 years! Fall 2016 Data volume is increasing exponentially 44x increase from 2009 to 2020 From 0.8 ZB to 35ZB 13 6 6 Fall 2016 Big Data = Transactions+Interactions+Observations IoT sensors are reporting even more personal data than humans are! Petabytes Terabytes Gigabytes Megabytes Increasing Data Variety, Velocity, & Veracity hortonworks.com/blog/7-key-drivers-for-the-big-data-market The Emerging Data Economy 15 Fall 2016 Source: CISCO ISBG 2012 16 7 7 Where Analytics Generate Value? Fall 2016 http://fr.slideshare.net/wclquang/the-analytics-valuechain-key-to-delivering-business-value-in-iot 17 Fall 2016 What Makes Data, “Big” Data? 18 8 8 Definitions Fall 2016 No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… (McKinsey Global Inst.) “Big Data” is high-volume, highvelocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making (Gartner) 19 The Four V’s of Big Data Fall 2016 20 9 9 Fall 2016 Characteristics of Big Data: 1-Scale (Volume) Too big: petabyte-scale collections or lots of (not necessarily big) data sets 21 Fall 2016 Characteristics of Big Data: 2-Speed (Velocity) Too fast: needs to be processed quickly 22 10 10 Fall 2016 Characteristics of Big Data: 3-Complexity (Variety) Too diverse: does not fit neatly in an existing tool 23 Fall 2016 Characteristics of Big Data: 4-Quality (Veracity) Too crappie: needs to assess their quality 24 11 11 Fall 2016 Other Big Data Qualities variability (data whose meaning can be constantly shifting in relation to the context in which they are generated) (McNulty, 2014) exhaustivity (an entire system is captured, n = all, rather than being sampled) (Mayer-Schonberger and Cukier, 2013) fine-grained (in resolution) and uniquely indexical (in identification) (Dodge and Kitchin, 2005) relationality (containing common fields that enable the conjoining of different datasets) (Boyd and Crawford, 2012) extensionality (can add/change new fields easily) and scaleability (can expand in size rapidly) (Marz and Warren, 2012) 25 Fall 2016 Summary of 4 Big Data Characteristics Characteristic Description Properties Drivers Volume The amount of data generated or data intensity that must be ingested processed& analyzed to make data-driven decisions Batch Processing High number of data sources High resolution sensors Velocity How fast data is being produced and ingested and the speed at which data is transformed into insight Streaming, online Processing (Near) Real-time Analytics real-time, high-rate data acquisition low hardware cost Variety The degree of diversity of data from sources both inside and outside an organization Structuring degree Multi-Modality Complex interrelations Sequences Implicit Semantics Mobile Social media Video Scientific data M2M / IoT Veracity The quality and traceability of data Consistency Completeness Integrity Ambiguity IID assumption Crowd data production, Human Sensing 26 12 12 We’ve Moved into a New Era of Data Analytics 12+ terabytes 5+ million of Tweets create daily 100’s Fall 2016 trade events per second Volume Velocity Variety Veracity of different types of data 1 in 3 Only decision makers trust their information 27 Declining % of Data an Organization Can Analyse http://www.youtube.com/watch?v=B27SpLOOhWw Fall 2016 28 13 13 Fall 2016 Big Data Processing 29 Fall 2016 Big Data: Old Wine in a New Bottle? No, it is a different type of data wave: one needs to put together many sources of information, coming through many different channels, throwing away what is not important, working under resource constraints (time), serving real users’ needs Yes, most of these problems have been in the focus of data management research for years Big Data movement: exponential growth of data enthusiasts! Proliferation of data producers and consumers, e.g., on the Web, scientific, social, government, urban, home and personal spaces The main issue is to put all this together to satisfy concrete data analysis needs via innovative technology 30 14 14 The WRONG Picture! Fall 2016 31 Beyond Big Data Size! Fall 2016 Volume = Length Width Depth Length: Collect & Compare Width: Curate & Integrate Depth: Analyze & Understand Massive Data Analysis J. Freire & J. Simeon New York University Course 2013 32 15 15 Big Data vs Deep Insights data Fall 2016 knowledge Data exploration is hard regardless of whether data are big or small ! 33 The TRUE Picture! The time for developing an analysis (with small data) Fall 2016 The time for developing an analysis (with big data) Big Data Infrastructures: Exploiting the Power of Big Data T. Sellis School of CS & IT, 2015 Athens 34 16 16 Exploring Big Data: What is Hard? Fall 2016 Scalability for computations? NOT REALLY! Lots of work on distributed systems, parallel databases, … Cloud elasticity: Add more nodes! But there are no one-size-fits-all solution: often, you have to build your own… Rapidly-evolving technology Many different tools Different computation model: need new algorithms! 35 Advanced Analytics Requires a Robust, Comprehensive Information Platform ©2011 IBM Corporation Fall 2016 36 17 17 Big Data Research Agenda Acquisition, Storage, and Management of “Big Data” Data Analytics Computational, mathematical, statistical, and algorithmic techniques for modelling high dimensional data Data representation, storage, and retrieval New parallel data architectures, including clouds Data management policies, including privacy and secure access Learning, inference, prediction, and knowledge discovery for large volumes of dynamic data sets Communication and storage devices with extreme capacities Data mining to enable automated hypothesis generation, event correlation, and anomaly detection Sustainable economic models for access and preservation Information infusion of multiple data sources Fall 2016 Data Sharing and Collaboration Tools for distant data sharing, real time visualization, and software reuse of complex data sets Cross disciplinary model, information and knowledge sharing Remote operation and real time access to distant data sources and instruments Source Big Data R&D Initiative Howard Wactlar NIST Big Data Meeting June, 2012 37 Fall 2016 Big Data Mining 39 18 18 What to Do with Big Data? Data contains value and knowledge But to extract the knowledge data needs to be Stored Managed And ANALYZED Data Analysis include: Mine/summarize large datasets Extract knowledge from past data Predict trends in future data Fall 2016 Data Mining ≈ Big Data ≈ Data Analytics ≈ Data Science 40 A Bit of Terminology Fall 2016 Data mining is the old big data: an overused term including anything such as collecting, storing, curating and visualizing data machine learning / AI (which predates the term data mining) non-ML data mining (as in "knowledge discovery", where the focus is on new knowledge, not on learning of existing knowledge) "Business intelligence", "business analytics“ are marketing terms stressing that more data leads to better business decisions (periodic reporting as well as ad hoc queries, importance of tools and dashboards); Most "Big Data" today isn't ML: It's Extract, Transform, Load (ETL), so it is replacing data warehousing (except computational advertisement) Business Intelligence aims at descriptive statistics with data with high information density to measure things, detect trends etc. Big Data targets inductive statistics with data with low information density whose huge volume allow to infer laws (regressions…) and thus giving (with the limits of inference reasoning) to Big Data some predictive capabilities (called Deep Analytics) 41 19 19 Data Analysis: ERP & CRM Examples Who are our lowest/highest margin customers ? What is the most effective distribution channel? What product prom-otions have the biggest impact on revenue? Fall 2016 Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? Agrawal et al., VLDB 2010 Tutorial 42 Fall 2016 Data Analysis in Urban Computing Where are our lowest/highest margin passengers? What is the distribution of trip lengths? What would the impacts be of a fare change? What is the quickest route from midtown to downtown at 4pm on Monday? Where should drivers go to get passengers? What impact will the introduction of additional medallions have? Agrawal et al., VLDB 2010 Tutorial 43 20 20 The Data Analysis Spectrum Fall 2016 How can we make it happen? Source: Gartner Prescriptive Analytics Value What might happen? Predictive Analytics Why did it happen? What happened? Descriptive Analytics Diagnostic Analytics What is happening? Monitoring (Dashboards, Scorecards) Difficulty Data Mining Methods 45 Fall 2016 Use some variables to predict unknown Find human-interpretable patterns or future values of other variables that describe the data 46 21 21 Large-Scale, Real-World Analytics Fall 2016 Question Method How can I identify fraudulent activity? Classification, Logistic Regression How do I segment my customers? K-means Clustering Does this product appeal to some segments more than others? Log-likelihood Which campaign is working better? Mann-Whitney U Test How is product ownership distributed across customer segments? SQL, Cumulative Distribution Functions How do I target my marketing efforts Logistic Regression towards customers most likely to churn? What new products should I offer my customers? Cosine similarity, k-Nearest Neighbors, Matrix multiplication What are my customers saying about the new product launch? NLP, sparse vectors Tools and Technologies for Big Data Steven Hillion V.P. Analytics EMC Data Computing Division 2011 Data Mining: Different Cultures 47 Fall 2016 Data mining overlaps with: Databases (DB): Large-scale data, simple queries Machine Learning (ML): Small data, Complex models Computer Science Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models Result is the parameters of the model Big Data urges for a cross-culture curriculum stressing on Scalability Machine Statistics Learning/ Algorithms /AI Computing architectures Data Pattern Automation for handling large data Recognition Mining Database systems 48 22 22 Big Data: Small vs. Big Analysis Fall 2016 If you want to analyze the whole set by accessing data several times, it can be much harder Many trial-and-error steps, easy to get lost… Most existing data mining/ML methods were designed without considering data access and communication of intermediate results They iteratively use data by assuming they are readily available How to integrate these two worlds together? Efficient in analyzing/mining data Do not scale Efficient in managing big data Does not analyze or mine the data 49 Big Data: Small vs. Big Analysis Fall 2016 Lots of intensive computations for complex math operations need new support in parallel/distributed settings Matrix multiplication QR decomposition (QR factorization) Singular Value Decomposition (SVD) Linear regression So we are facing many challenges methods not ready tools are not convenient platforms rapidly change, … 50 23 23 Big Data Mining Fall 2016 Focused Services This is an on-going research topic (Deep Insights) Roughly there are two types of approaches Parallelize existing (single-machine) algorithms Design new algorithms particularly for Deep Analytics distributed settings of course there are things in between To have technical breakthroughs for bigdata analytics, we should know both algorithms and systems well, and consider them together Big Data Platform 51 Big Data Mining Challenges Fall 2016 Analytics Architecture combine historical with fresh data at the same time Statistical Significance: Bigger Data are not always Better Data! achieve robust statistical results, and not be fooled by randomness Distributed Mining: Just because it is accessible doesn’t make it ethical! paralyze data mining techniques with practical & theoretical guarantees Stream Mining: “Big” will evolve/change! learn from evolving data streams Sampling and Compression: Not all Data are equivalent! subsampling is easy on one machine, but may not be in a distributed setting Hidden Big Data currently only 3% of the potentially useful data is tagged, and even less is analyzed 52 24 24 Big Data Analytics Platforms Online Machine Learning Fall 2016 Big Machine Learning (Mahout, MillWheel, R/Hadoop) (SAMOA, Rapid Miner, OIIDM) IoT Data Analysis (Parstream, Vitria, Splunk, virdata) (Striim, Storm, Spark, Google RT, Apache S4, MS , Azure, AWS Kinesis) Big Time Series Analytics (InfluxDB, AT&T M2X, IBM Informix TS, OpenTSDB) Source: Vmware 53 Fall 2016 What this Course is About? 54 25 25 The Need: Making Sense of the Big Data Universe Fall 2016 Frameworks: New computing paradigms: Cloud, Hadoop – Map/Reduce New storage solutions: NoSQL, column stores, Big Table New languages: JAQL, Pig Latin We will study these and how they relate to previous technologies Analysis: New frameworks demands new approaches to explore data We will study algorithms to process and mining data in BigData environments Tackling Big Data M. Cooper & P. Mell NIST Information Technology Laboratory Computer Security Division 55 Fall 2016 What You Will learn Understand different models of computation: MapReduce Streams and online algorithms Mine different types of data: Data is high dimensional Data is infinite/never-ending Use different mathematical ‘tools’: Hashing (LSH, Bloom filters) Dynamic programming (frequent itemsets) Solve real-world problems: Duplicate document detection Market Basket Analysis 56 26 26 Fall 2016 Tentative Course Schedule Lecture 1-2 (20-22/09): Course Overview Lecture 3-4 (27-29/09): The Map-Reduce Software Ecosystem Lecture 5-6 (04-06/10): Massive Data Processing Lecture 7-8 (11-13/10): Finding Similar Items Lecture 9-10 (18-20/10): Extracting Association Rules Lecture 11-14 (25-27/10&01-03/11): Analysing Data Streams Lecture 15-17 (08-10-15/11): Web Entity Resolution Lab 1 Introduction to Hadoop Map-Reduce (28-09) Lab 2 Advanced Map-Reduce Programming (05-10) Lab 3 MapReduce Design Patterns (12-10) Lab 4 (19-10) Programming Assignment 1 Lab 5 (26-10) Programming Assignment 2 Lab 6 (2 -11) Hive Lab 7 (9-11) Pig Lab 8 (16-11) SparK Students presentations (22-23-24-29-30) © NY Times 57 Course Text Books Jure Leskovec, Anand Rajaraman, Jeff Ullman. “Mining of Massive Datasets” Cambridge University Press, 2014 http://www.cambridge.org/gr/academic/subjects/compute r-science/knowledge-management-databases-and-datamining/mining-massive-datasets-2nd-edition Free download http://www.mmds.org Donald Miner, Adam Shook “MapReduce Design Patterns” O'Reilly Media 2013 http://shop.oreilly.com/product/0636920025122.do Free download of chapter samples http://cdn.oreillystatic.com/oreilly/booksamplers/9781 449327170_sampler.pdf Fall 2016 58 27 27 Fall 2016 Course Organization Three Programming Exercises (40%): Map/Reduce & Haddop A Research Project (30%): Presentation of 2 research papers and written report Final Examination (30%): 60 Fall 2016 Words of Caution We can only cover a small part of the big data universe Do not expect all possible architectures, programming models, theoretical results, or vendors to be covered This really is an algorithms course, not a basic programming course But you will need to do a lot of non-trivial programming There are few certain answers, as people in research and leading tech companies are trying to understand how to deal with big data We are working with cutting edge technology Bugs, lack of documentation, new Hadoop API In short: you have to be able to deal with inevitable frustrations and plan your work accordingly… …but if you can do that and are willing to invest the time, it will be a rewarding experience 61 28 28 Fall 2016 References CS246: Mining Massive Datasets Jure Leskovec, Stanford University, 1014 CS9223 – Massive Data Analysis J. Freire & J. Simeon New York University Course 2013 CS 6240: Parallel Data Processing in MapReduce Mirek Riedewald Northeastern University 2014 Big Data Infrastructures: Exploiting the Power of Big Data T. Sellis School of CS & IT, 2015 Athens CS525: Special Topics in DBs Large-Scale Data Management Advanced Analytics on Hadoop Mohamed Eltabakh Spring 2013 Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University August 30, 2014 Knowledge Discovery and Data Mining Evgueni Smirnov Maastricht School on Data Mining Department of Knowledge Engineering, Maastricht University, Maastricht, The Netherlands August 27 - August 30, 2013 62 Fall 2016 63 29 29 The Big Data Analysis Pipeline Fall 2016 Major Processing Steps Major Challenges http://cacm.acm.org/magazines/2014/7/176204-big-data-and-its64 technical-challenges/abstract Fall 2016 65 30 30 Fall 2016 66 Fall 2016 Big Data Processing and Analysis Framework 67 31 31 Fall 2016 http://www.pinterest.com/pin/272327108689099496/ 68 Fall 2016 69 32 32