Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Big Data Ahmed K. Ezzat, Big Data Overview 1 Outline 1. What is Big Data? 2. Why Big Data matters? 3. Big Data and Open Source 4. Big Data Strategies (Howie): Shared nothing parallel RDBMS NoSQL (K/V stores) MapReduce 5. Big Data and Visualization 6. Big Data Integration Challenge (Data virtualization / Federation) 7. Summary 2 What is Big Data? 3 1. What is Big Data? EMC/IDC definition: Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large Volumes of a wide Variety of data, by enabling high Velocity capture, discovery, and/or analysis. IBM /Microsoft definition: three characteristics define Big Data: Volume (Terabytes (1000 GB - 240) Zettabytes (billion TB - 270)) Variety (Structured + Semi-structured + Unstructured) Velocity (Batch + Streaming Data) 4 1. What is Big Data? Volume + Velocity No Consistency CAP Theorem (Eric Brewer, 2000) For highly scalable distributed system, you can only have two of the following: Consistency, high Availability, network Partition tolerance (network failure tolerance) http://www.julianbrowne.com/article/viewer/brewers-cap-theorem Implication: Big data solutions must stop worrying about consistency if they want high availability 5 1. What is Big Data? Big Data Terminology: CRUD: Create Read Update Destroy CRAP: Create Read Analytical Processing RDBMS provide ACID NoSQL provides BASE (Basically Available, Soft state, Eventual consistency): Basically Available: allow parts of the system to fail Soft state: an object may have multiple simultaneous values Eventual Consistency: consistency happen over time 6 1. What is Big Data? CAP Theorem with ACID and BASE Visualized: 7 1. What is Big Data? Big Data: The 3 Vs characterizing Big Data limits the ability to perform analysis using traditional RDBMS. Big Data Science: the study of techniques covering acquisition, conditioning, and evaluation of Big Data; including information technology and math science. Big Data Frameworks: software libraries and algorithms that enable distributed processing and analysis of big data across clusters of CPUs, or GPUs, etc. Big Data Infrastructure: instances of one/more Big Data Frameworks that include Management APIs and servers (physical or virtual) to solve specific Big Data problem or to serve as general purpose analysis and processing engine. 8 1. What is Big Data? What Technologies typically used with Big Data: Machine Learning (ML): is the study of systems that improve their performance with experience (typically learning from the data) Artificial Intelligence (AI): is the science of automating complex behavior such as learning, problem solving, and decision making Data Mining (DM): is the process of extracting useful information from massive quantities of complex data. Many of the technique used are statistical in nature, but are very different from classical statistics. 9 1. What is Big Data? What is Data Science: Data Science: automatically extracting knowledge from data: Applications in Business: Many… Scientific Applications: Mathematics & Statistics Machine Learning Doman Expertise Astronomy High-energy physics Biology, Genomics Social science Medicine Government 10 1. What is Big Data? What is Data Mining: Overlap of machine learning, statistics, AI, databases but with more stress on: Scalability Algorithms and architectures Automation for handling large data 11 1. What is Big Data? Programming Platforms: SQL vs. NoSQL debate Popular Programming Abstractions: MapReduce Dryad Pipelines: Hadoop is not an island; requires orchestration between many diverse technologies Graphs: Pregel Iterations: many analysis tasks involve iterations (Vectorwise) DISC: Data Intensive Scalable Computing How do you get the Big Data into your platform: Bandwidth and application topology bottlenecks 12 1. What is Big Data? What is Big Data Processing: Data size is large enough that cannot be processed in a single node What do we mean by “processed”? CRUD: Create Read Update Destroy + potentially huge amount of actual processing Big data examples come from machine generated domain, e.g., web crawling or tracking, real-time sensor data, logfiles, etc. So for Big Data, CRUD Crud or perhaps CRAP Create Read Analytical Processing 13 1. What is Big Data? Where Big Data is processed: In the cloud Specific Big Data scientific platforms Smaller scale virtualized platforms …. How is Big Data stored: File Systems: GFS, HDFS, etc. Tabular: Big Table, etc. Key/Value Store: Dynamo, etc. In-memory (Spark – amplab, UC Berkeley) …. 14 1. What is Big Data? Data Is Infinite/Never-Ending (Data Streams): In SQL, input is under the control of the programmer Stream Management is important if input rate is controlled externally The system cannot store the entire stream in time How do you make critical calculations using limited amount of memory? 15 1. What is Big Data? Applications of Streams Processing: Mining query streams Example: What queries are more frequent today than yesterday? Mining Click Streams Yahoo! Wants to know which of its pages are getting an unusual number of hits in the past hour? Sensors need monitoring, especially when there are many sensors of the same type, feeding into a central controller! Telephone call records are summarized into customer bills? IP packets can b monitored at a switch ad used for optimal routing and to detect denial-of-service attacks! Sliding Windows: where queries are about a window of length (N); most recent received elements. 16 1. What is Big Data? Different types of Data: Data is high dimensional Data is a graph Data is infinite/never ending Data is labeled Different Models of Computation: MapReduce Streams and online algorithms In-memory processing 17 1. What is Big Data? What Processes typically used in Data Analysis: Prediction (explaining specific attribute of the data in terms of other attributes): Classification (predict discrete value) and regression (estimate numeric value) Rule-based, case-based, and model-based learning Modeling (describing the relationships between many attributes and many entities): Representation and heuristic search Clustering (modeling group structure of data) Bayesian networks (modeling probabilistic relationships) Detection (identifying relevant patterns in massive/complex datasets): Anomaly Detection (detecting outliers, novelties, etc.) Pattern Detection (e.g., event surveillance, anomalous patterns) Applications to biosurveillance, crime prevention, etc. 18 1. What is Big Data? Models vs. Analytic Processing: To a DBMS person, data mining is an extreme form of analytic processing – queries that examine large amounts of data. Result is the answer of the query. Given a billion numbers, a DBMS person would compute their average and standard deviation. To a statistician, data mining is the inference of models. Result is the parameters of the model. Given a billion numbers, a statistician might fit the billion to the best Gaussian distribution and report the mean and standard deviation of the distribution. 19 1. What is Big Data? Google’s Computing Infrastructure: 200+ processors 200+ terabyte database 10 10 total clock cycles 0.1 second r.p.t. 5¢ average advertising revenue 20 1. What is Big Data? Google’s Computing Infrastructure: ~3 million processors in 200+ processors X86 processors, IDE disks, Ethernet communications. Reliability is through redundancy and software management. Partitioned Workload: Data: Web Pages, indices distributed across processors Function: crawling, index generation, index search, documents retrieval, Ad placement Data-Intensive Scalable Computer (DISC): Large-scale computer centered around data: collecting, maintaining, indexing, computing Similar systems at Yahoo! And Microsoft 21 1. What is Big Data? DISC Environment: Architecture: Operating System: Hadoop Apsara by Aliyun: http://www.aliyun.com Programming Model: Cloud computing MapReduce Data Analysis: Data Mining 22 1. What is Big Data? DISC Challenge Computation accesses 1 TB in 5 minutes: Data distributed over 100+ disks Compute using 100+ processors Connected by 1 Gbps Ethernet System Requirements: Lots of disks Lots of processors Located in close proximity Interactive Access Robust Fault Tolerance 23 1. What is Big Data? Data Mining Role in Big Data Non-trivial discovery of implicit, previously unknown, and useful knowledge from massive data 24 1. What is Big Data? Data Mining Tasks in Big Data Association rule discovery Classification Clustering Recommendation systems: Collaborative filtering Link analysis and graph mining Managing Web advertisements Descriptive Methods: find human-interpretable patterns that describe the data Predictive Methods: Use some variables to predict unknown or future values of other variable 25 1. What is Big Data? Real-world Problems to be addressed: Recommendation systems Association rules Link analysis Duplicate detection (Dedup) Various Tools: Linear algebra (Singular Value Decomposition (SVD), etc.) Optimization (stochastic gradient descent) Dynamic programming (frequent itemsets) Hashing (Locality Sensitive Hashing (LSH), Bloom filters) 26 1. What is Big Data? How It All Fits Together: 27 Why Big Data Matters? 28 2. Why Big Data Matters? Big Data Enable new things! Google – 1st big success of Big Data Social Network (Facebook, Twitter, LinkedIn, etc.) Location analytics Healthcare Personalized medicine Semantics and AI! …. 29 2. Why Big Data Matters? Big Data Bubble Big Data Gartner Hype Cycle 30 2. Why Big Data Matters? Provides groundbreaking opportunities for enterprise information Management and decision making The rate of information growth appears to be exceeding Moore’s law The amount of data is exploding; companies are capturing and digitizing more information 35 zettabytes (billion TB) of data will be generated and consumed by the end of the decade Hadoop (HDFS, HBase, MapReduce) and Memcached are gaining a lot of momentum. 31 2. Why Big Data Matters? 69% Extremely important for competitive advantage 70% Helped manage costs or improve operations 32 2. Why Big Data Matters? Where Data Mining and Analytics are applied: 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% CRM/ consumer analytics Banking Health care/ HR Fraud Detection Direct Marketing/ Fundraising Finance Telecom / Cable Science Insurance Advertising Education Web usage mining Credit Scoring Retail Medical/ Pharma Manufacturing e-Commerce Social Networks Search / Web content mining Government/Military Biotech/Genomics Investment / Stocks Entertainment/ Music Security / Anti-terrorism Travel / Hospitality Social Policy/Survey analysis Junk email / Anti-spam Other 33 2. Why Big Data Matters? Data Types Analyzed/Mined: 34 2. Why Big Data Matters? Largest Data Set Analyzed/Mined: 2011 median dataset size ~10-20 GB, vs. 8-10 GB in 2010. Increase in 10 GB to 1 PB range 35 2. Why Big Data Matters? Which Algorithms Used for Data Analysis/Mining in 2011: % analysts who used it 0% 10% 20% 30% 40% 50% 60% 70% Decision Trees Regression Clustering Statistics Visualization Time series/Sequence analysis Support Vector (SVM) Association rules Ensemble methods Text Mining Neural Nets Boosting Bayesian Bagging Factor Analysis Anomaly/Deviation detection Social Network Analysis Survival Analysis Genetic algorithms Uplift modeling 36 Big Data and Open Source 37 3. Big Data and Open Source STORM and KAFKA: Storm (http://storm-project.net/): distributed real-time streamprocessing “computation” system. Kafka (http://kafka.apache.org/documentation.html#design): highthroughput distributed messaging system. SPARK (http://spark.incubator.apache.org/): very fast cluster (in-memory) processing system DREMEL and DRILL: Dremel (http://research.google.com/pubs/pub36632.html): Google platform for fast and interactive Hadoop queries. Drill (http://wiki.apache.org/incubator/DrillProposal): Apache open source version of Dremel. 38 3. Big Data and Open Source GREMLIN, GIRAPH, Neo4j: Gremlin(https://github.com/tinkerpop/gremlin/wiki): is a graph traversal language. Giraph (http://giraph.apache.org/): interactive graph processing system. Open source version of Google Pregel. Neo4j (http://www.neo4j.org/): fully ACID robust property graph database – single node. R (http://www.r-project.org/): is an open source statistical programming language. R works well with Hadoop. D3 visualization (Data Driven Documents - http://d3js.org/): a javascript library for manipulating documents based on data. 39 3. Big Data and Open Source Software Tools for Data Analysis: WEKA data mining toolkit (http://www.cs.waikato.ac.nz/ml/weka): popular data mining open source toolkit. ASL (Accelerated Statistical Learning - http://www.autonlab.org): another statistical data mining from CMU. Implement your own ML methods, you can use Matlab, R, C++, etc. 40 Big Data Strategies 41 4. Big Data Strategies? Big Data Strategies: Shared nothing “except Network” parallel RDBMS. Network needs to be fast at minimum for throughput but preferably for latency as well. Partitioning/Sharding: Data distribution among shards ideally little/no inter-shard/inter-node Columnar: Useful for aggregation Useful for compression 42 4. Big Data Strategies? Big Data Strategies: NoSQL (Key/value Stores): low latency, not ACID, No Joins K/V or Document (complex K/V store): MongoDB, CouchDB, etc. Column oriented (tabular K/V store): BigTable, Hbase, Cassandra, etc. K/V Store Characteristics: relatively low latency, not ACID, and no joins 43 4. Big Data Strategies? Big Data Strategies: MapReduce Platform: general purpose implicit parallel programming paradigm Hadoop addresses the crAP partition Is composed of HDFS and map reduce itself Not limited to Java streams interface (Python, Ruby, etc.) Example: % $HADOOP_HOME/bin/hadoop jar \ $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output /bin/cat \ reducer /bin/wc 44 Big Data and Visualization 45 5. Big Data and Visualization Introduction: Collecting information is no longer a problem, but extracting value from information collections has become progressively more difficult. Visualization links the human eye and computer, helping to identify patterns and to extract insights from large amounts of information Visualization technology shows considerable promise from increasing the value of large-scales collections of information Visualization has been used to communicate ideas, to monitor trends implicit in data, and to explore large volumes of data from hypothesis generation. Visualization can be classified as scientific visualization, software visualization, and information visualization 46 5. Big Data and Visualization Visualization Classification: Scientific Visualization Scientific visualization helps understanding physical phenomena in data (Nielson, 1991) Mathematical model plays an essential role Isosurfaces, volume rendering, and glyphs are commonly used techniques Isosurfaces depict the distribution of certain attributes Volume rendering allows views to see the entire volume of 3-D data in a single image (Nielson, 1991) Glyphs provides a way to display multiple attributes through combinations of various visual cues (Chernoff, 1973) 47 5. Big Data and Visualization Software Visualization and Information Visualization Software visualization helps people understand and use computer software effectively (Stasko et al. 1998) Program visualization helps programmers manage complex software (Baecker & Price, 1998) Visualizing the source code (Baecer & Marcus, 1990) data structure, and the changes made to the software (Erick et al., 1992) Algorithm animation is used to motivate and support the learning of computational algorithms Information visualization helps users identify patterns, correlations, or clusters Structured information Graphical representation to reveal patterns. e.g. Spotfire, SAS/GRAPH, SPSS Integration with various data mining techniques (Thealing et al., 2002; Johnston, 2002) Unstructured Information Need to identify variables and construct visualizable structures. e.g. antage Point, SemioMap, and Knowledgist 48 5. Big Data and Visualization Framework for Information Visualization: Research on taxonomies of visualization Chuah and Roth (1996) listed the tasks of information visualization Bertin (1967) and Mackinlay (1986) described the characteristics of basic visual variables and their applications. Card and Mackinlay (1997) constructed a data type-based taxonomy. Chi (2000) proposed a taxonomy based on technologies: Four stages: value, analytic abstraction, visual abstraction, and view Shnederman (1996) identified two aspects of visualization: representation and user-interface interface C.Chen (1999) indicated that information analysis also helps support a visualization system Three research dimensions support the development of an information visualization system Information representation User interface interaction Information analysis 49 5. Big Data and Visualization Information Representation: Shneiderman (1996) proposed seven types of representation methods: 1-D 2-D 3-D Multidimensional Tree Network Temporal approaches 50 5. Big Data and Visualization 1-D TileBars (Hearst, 95) 51 5. Big Data and Visualization 2-D Represent information as two-dimensional visual objects 3-D To represent information as 3-D visual objects 52 5. Big Data and Visualization Multidimensional To represent information as multidimensional objects and project them into a 3-D or 2-D space: Dimensionality reduction algorithm need to be used: Multidimensional scaling (MDS) Hierarchical clustering K-means algorithms Principal components analysis 53 5. Big Data and Visualization Tree To represent hierarchical relationship Challenge: nodes grows exponentially 54 5. Big Data and Visualization Network To represent complex relationships that a simple tree is insufficient. 55 5. Big Data and Visualization Temporal To represent data based on temporal order Location and animation are common 56 Big Data Integration Challenge 57 6. Big Data Integration Challenges Analytics Push the Limits of Traditional Data Management Data Virtualization Solves Big Data Integration Problem The answer to too many silos is not to integrate into a data warehouse because of scalability limitations, replication processing and storage cost would be too high The answer is data federation or virtualization 58 Summary 59 12. Summary Big Data is best described in terms of 3Vs (Volume, Variety, Velocity) Big Data matters as it can give the enterprise an advantage in a competitive market Hadoop and may analytics tools are available in the open source community We covered different strategies to process Big Data Visualization is critical in Big Data to enable extracting value from information We also covered how to integrate the Big Data silos to be able to query and do analysis 60 END 61