Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Analytics Outline Introduction MapReduce & Hadoop Architecture Higher-level query languages: Pig & Hive Pregel GraphLab PowerGraph Spark 2 Outline Introduction MapReduce & Hadoop Architecture Higher-level query languages: Pig & Hive Pregel GraphLab PowerGraph Spark 3 The Age of Big Data 28 Million Wikipedia Pages 1 Billion Facebook Users 6 Billion Flickr Photos 72 Hours each Minute YouTube “…growing at 50 percent a year…” “… data a new class of economic asset, like currency or gold.” 34 Graphs are Everywhere Collaborative Filtering Users Social Network Netflix Movies Probabilistic Analysis Docs Text Analysis Wiki Words 5 Introduction Graphs are analyzed in many important contexts Ranking search results based on the hyperlink structure of web Module detection of protein-protein interaction networks Social networks analysis Many graphs of interest are difficult to analyze Often spanning millions of vertices Billions of edges 6 How can we solve the problem? How to Assign? Billions of Vertices/Edges (Graph) 7 Thousand/millions of Computers (Distributed System) Outline Introduction MapReduce & Hadoop Architecture Higher-level query languages: Pig & Hive Pregel GraphLab PowerGraph Spark 8 Powerful tool for tackling large-data problems Bioinformatics DNA sequence assembly protein-protein interaction networks Recommendation system Search Engine Text processing Machine Translation MapReduce How are you? 79 What is MapReduce? Programming model for data-intensive computing on commodity clusters Pioneered by Google Processes 20 PB of data per day Popularized by Apache Hadoop project Used by Yahoo!, Facebook, Amazon, … 10 What is MapReduce Used For? • At Google: – – – Index building for Google Search Article clustering for Google News Statistical machine translation • At Yahoo!: – – Index building for Yahoo! Search Spam detection for Yahoo! Mail • At Facebook: – – – Data mining Ad optimization Spam detection 11 What is MapReduce Used For? In research: Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Climate simulation (Washington) Bioinformatics (Maryland) Particle physics (Nebraska) <Your application here> 12 MapReduce Goals • Scalability to large data volumes: – – Scan 100 TB on 1 node @ 50 MB/s = 24 days Scan on 1000-node cluster = 35 minutes • Cost-efficiency: – – – Commodity nodes (cheap, but unreliable) Commodity network (low bandwidth) Easy to use (fewer programmers) 13 Hadoop – How was it Born? Hadoop implements Map/Reduce computational Apache Zookeeper Apache Tomcat 14 Typical Hadoop Cluster 15 Hadoop Evolution Graph 16 Hadoop Distributed File System (HDFS) Files split into 64MB blocks Blocks replicated across several datanodes (often 3) Namenode stores metadata (file names, locations, etc) Optimized for large files, sequential reads Namenode 1 2 4 2 1 3 1 4 3 File1 1 2 3 4 3 2 4 Datanodes 17 HBase A distributed data store that can scale horizontally to 1,000s of commodity servers Designed to operate on top of the Hadoop distributed file system (HDFS). HBase depends on ZooKeeper and by default it uses a ZooKeeper instance as the authority on cluster state 18 MapReduce Data Flow (WordCount) Input the quick brown fox the fox ate the mouse how now brown cow Map Map Shuffle & Sort Reduce the, 1 brown, 1 fox, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 Reduce ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 fox, 1 the, 1 Map how, 1 now, 1 brown, 1 Map Output quick, 1 ate, 1 mouse, 1 cow, 1 19 Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values)) 20 Word Count with Combiner Input the quick brown fox the fox ate the mouse how now brown cow Map Map Shuffle & Sort Reduce the, 1 brown, 1 fox, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 Reduce ate, 1 cow, 1 mouse, 1 quick, 1 the, 2 fox, 1 Map how, 1 now, 1 brown, 1 Map Output quick, 1 ate, 1 mouse, 1 cow, 1 21 Word Count Map function Reduce function Run this program as a MapReduce job 22 Example: Search • Input: (lineNumber, line) records • Output: lines matching a given pattern • Map: if(line matches pattern): output(line) • Reduce: identity function – Alternative: no reducer (map-only job) 23 Hadoop Workflow 1. Load data into HDFS 2. Develop code locally 3. Submit MapReduce job You Hadoop Cluster 4. Retrieve data from HDFS 24 Architecture overview user Master node Job tracker Slave node 1 Slave node 2 Slave node N Task tracker Task tracker Task tracker Workers Workers Workers 25 Task Scheduling in MapReduce MapReduce adopts a master-slave architecture TT The master node in MapReduce is referred to as Job Tracker (JT) Task Slots JT Each slave node in MapReduce is referred to as Task Tracker (TT) Tasks Queue T0 T1 T1 MapReduce adopts a pull scheduling strategy rather than a push one T2 TT Task Slots 30 Fault Tolerance in MapReduce 1. If a task crashes: Retry on another node 2. If a node crashes: Relaunch its current tasks on other nodes 3. If a task is going slowly (straggler): – – Launch second copy of task on another node Take the output of whichever copy finishes first, and kill the other one 27 Outline Introduction MapReduce & Hadoop Architecture Higher-level query languages: Pig & Hive Pregel GraphLab PowerGraph Spark 28 Motivation 29 Pig Latin Need a high-level, general data flow language 30 Pig Started at Yahoo! Research Runs about 50% of Yahoo!’s jobs Features: Expresses sequences of MapReduce jobs Provides relational (SQL) operators (JOIN, GROUP BY, etc) Easy to plug in Java functions 31 An Example Problem find the top 5 most visited pages by users aged 18-25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 32 In MapReduce 33 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt In Pig Latin Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 34 Implementation SQL automatic rewrite + optimize Pig user or or Hadoop Map-Reduce cluster Pig Latin Sweet spot between map-reduce and SQL 35 Compilation into Map-Reduce Map1 Load Visits Group by url Every group or join operation forms a map-reduce boundary Reduce1 Foreach url generate count Map2 Load Url Info Join on url Other operations pipelined into map and reduce phases Group by category Foreach category generate top10(urls) Reduce2 Map3 Reduce3 36 Hive Developed at Facebook Used for most Facebook jobs Relational database built on Hadoop Maintains table schemas SQL-like query language Supports complex data types, some query optimization 37 Outline Introduction MapReduce & Hadoop Architecture Higher-level query languages: Pig & Hive Pregel GraphLab PowerGraph Spark 38 Twitter network visualization, by Akshay Java, 2009 PROPERTIES OF REAL WORLD GRAPHS 39 Natural Graphs [Image from WikiCommons] Properties of Natural Graphs Regular Mesh Natural Graph Power-Law Degree Distribution 41 PageRank 34 43 Computing PageRank Properties of PageRank Can be computed iteratively Effects at each iteration are local Sketch of algorithm: Start with seed PRi values Each page distributes PRi “credit” to all pages it links to Each target page adds up “credit” from multiple inbound links to compute PRi+1 Iterate until values converge Iterative Computation is Difficult • System is not optimized for iteration: Iterations Data Data Data CPU 1 CPU 1 CPU 3 Data Data Data Data Data CPU 2 CPU 3 Data Data Data Data Data Data CPU 2 CPU 3 Disk Penalty Data Data Data Startup Penalty Data CPU 1 Disk Penalty CPU 2 Data Startup Penalty Data Disk Penalty Startup Penalty Data Data Data Data Data Data Data Data 44 45 MODEL OF COMPUTATION A Directed Graph is given to Pregel It runs the computation at each vertex Until all nodes vote for halt Then Returns the results All Vote to Halt Output 46 Vertex State Machine Algorithm termination is based on every vertex voting to halt In superstep 0, every vertex is in the active state A vertex deactivates itself by voting to halt It can be reactivated by receiving an (external) message Pregel: Bulk Synchronous Parallel Compute Communicate Barrier 47 http://dl.acm.org/citation.cfm?id=1807184 Giraph Dataflow 2 Compute/Iterate Split 4 Load / Send Graph Part 1 Compute / Send Messages Part 0 Part 1 Part 2 Part 3 Part 0 Part 1 Compute / Send Messages Split Send stats / iterate! Worker 1 Split 3 Part 0 Storing the graph Output format Worker 0 Load / Send Graph Split 2 Worker 1 Maste r Split 1 In-memory graph Maste r Split 0 Worker 0 Input format Worker 1 Loading the graph 3 Worker 0 1 Part 2 Part 3 Part 2 Part 3 MapReduce and Partitioning • Map-Reduce splits the keys randomly between mappers/reducers • But on natural graphs, high-degree vertices (keys) may have million-times more edges than the average Extremely uneven distribution Time of iteration = time of slowest job. 49 Curse of the Slow Job Iterations Data Data CPU 1 Data CPU 1 Data CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Barrier Data Barrier Data Barrier Data 50 http://www.www2011india.com/proceeding/proceedings/p607.pdf Outline Introduction MapReduce & Hadoop Architecture Higher-level query languages: Pig & Hive Pregel GraphLab PowerGraph Spark 51 MapReduce’s (Hadoop’s) poor performance on huge graphs has motivated the development of special graph-computation systems 52 Parallel Computing and ML Not all algorithms are efficiently data parallel Data-Parallel Feature Extraction Cross Validation Complex Parallel Structure Kernel Methods Belief Propagation SVM Tensor Factorization Sampling Deep Belief Networks Neural Networks Lasso 53 GraphLab A New Framework for Parallel Machine Learning Graph Analytics Graphical Models Computer Vision Clustering Topic Modeling Collaborative Filtering GraphLab Version 2.1 API (C++) MPI/TCP-IP PThreads Hadoop/HDFS Linux Cluster Services (Amazon AWS) GraphLab easily incorporates external toolkits Automatically detects and builds external toolkits 55 The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model 58 Data Graph Data associated with vertices and edges Graph: • Social Network Vertex Data: • User profile • Current interests estimates Edge Data: • Relationship (friend, classmate, relative) 59 Distributed Graph Partition the graph across multiple machines. 60 Distributed Graph Ghost vertices maintain adjacency structure and replicate remote data. “ghost” vertices 61 Distributed Graph Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …) “ghost” vertices 62 The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model 63 Update Functions User-defined program: applied to a vertex and transforms data in scope of vertex Pagerank(scope){ Update function applied (asynchronously) // Update the current vertex data in parallel until convergence vertex.PageRank =a ForEach inPage: vertex.PageRank += (1- a ) ´ inPage.PageRank Many schedulers available to prioritize computation // Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; Why Dynamic? } Dynamic computation 64 Shared Memory Dynamic Schedule Scheduler a CPU 1 b a h e a b h f i d c g j k CPU 2 Process repeats until scheduler is empty 65 Distributed Scheduling Each machine maintains a schedule over the vertices it owns. a h f b a f e h i d c g g j k Distributed Consensus used to identify completion 66 Ensuring Race-Free Code How much can computation overlap? 67 The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model 69 Serializability Example Write Stronger / Weaker consistency levels availableRead User-tunable consistency levels trades off parallelism & consistency Overlapping regions are only read. Update functions one vertex apart can be run in parallel. Edge Consistency 70 CoEM (Rosie Jones, 2005) Named Entity Recognition Task Is “Cat” an animal? Is “Istanbul” a place? Vertices: 2 Million Edges: 200 Million the cat <X> ran quickly Australia travelled to <X> Istanbul <X> is pleasant 0.3% of Hadoop time Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Distributed GL 32 EC2 Nodes 80 secs 72 PageRank 5.5 hrs Hadoop 1 hr Twister GraphLab 8 min 40M Webpages, 1.4 Billion Links (100 iterations) Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. ‘10] 74 Outline Introduction MapReduce & Hadoop Architecture Higher-level query languages: Pig & Hive Pregel GraphLab PowerGraph Spark 75 Power-Law Graphs are Difficult to Partition CPU 1 CPU 2 • Power-Law graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04] • Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs. [Abou-Rjeili et al. 06] 76 Program For This Run on This Machine 1 Machine 2 • Split High-Degree vertices • New Abstraction Equivalence on Split Vertices 77 The Graph-Parallel Abstraction • A user-defined Vertex-Program runs on each vertex • Graph constrains interaction along edges – Using messages (e.g. Pregel [PODC’09, SIGMOD’10]) – Through shared state (e.g., GraphLab [UAI’10, VLDB’12]) • Parallelism: run multiple vertex programs simultaneously 78 Challenges of High-Degree Vertices Sequentially process edges Sends many messages (Pregel) Asynchronous Execution requires heavy locking (GraphLab) Touches a large fraction of graph (GraphLab) Edge meta-data too large for single machine Synchronous Execution prone to stragglers (Pregel) 79 Graph Partitioning • Graph parallel abstractions rely on partitioning: – Minimize communication – Balance computation and storage Y Machine 1 Data transmitted across network O(# cut edges) Machine 2 80 Random Partitioning • Both GraphLab and Pregel resort to random (hashed) partitioning on natural graphs Machine 1 Machine 2 81 • GAS Decomposition: distribute vertex-programs – Move computation to data – Parallelize high-degree vertices • Vertex Partitioning: – Effectively distribute large power-law graphs 82 Distributed Execution of a PowerGraph Vertex-Program Machine 1 Machine 2 Master Gather Apply Scatter Y’ Y’Y’ Y’ Σ1 + Σ + Σ2 + Mirror Y Σ3 Σ4 Mirror Machine 3 Mirror Machine 4 83 Minimizing Communication in PowerGraph Y Communication is linear in the number of machines each vertex spans A vertex-cut minimizes machines each vertex spans Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000] 84 New Approach to Partitioning • Rather than cut edges: Y Must synchronize many edges New YTheorem: For any edge-cutCPU we1can directly CPU 2 construct a vertex-cut which requires • we cut vertices: strictly less communication and storage. Must synchronize a single vertex Y Y CPU 1 CPU 2 85 Topic Modeling • English language Wikipedia – 2.6M Documents, 8.3M Words, 500M Tokens – Computationally intensive algorithm Million Tokens Per Second 0 Smola et al. PowerGraph 20 40 60 80 100 120 140 160 100 Yahoo! Machines Specifically engineered for this task 64 cc2.8xlarge EC2 Nodes 200 lines of code & 4 human hours 86 Example Topics Discovered from Wikipedia 87 Triangle Counting on The Twitter Graph Identify individuals with strong communities. Counted: 34.8 Billion Triangles Hadoop [WWW’11] 1536 Machines 423 Minutes 64 Machines 1.5 Minutes 282 x Faster S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11 88 Outline Introduction MapReduce & Hadoop Architecture Higher-level query languages: Pig & Hive Pregel GraphLab PowerGraph Spark 89 Spark Motivation • MapReduce simplified “big data” analysis on large, unreliable clusters • But as soon as organizations started using it widely, users wanted more: – More complex, multi-stage applications – More interactive queries – More low-latency online processing 90 Spark Motivation Query 2 Job 2 Query 1 Job 1 Stage 3 Stage 2 Stage 1 Complex jobs, interactive queries and online processing all need one thing that MR lacks: Efficient primitives for data sharing … Query 3 Iterative job Interactive mining Stream processing 91 Spark Motivation Stage 3 Stage 2 Stage 1 Complex jobs, interactive queries and online processing all need one thing that MR lacks: Efficient primitives for data sharing Query 1 Iterative job Interactive mining Job 2 Job 1 Problem: in MR, only way Queryto 2 share data across … jobs is stable storage (e.g. file Query 3 system) -> slow! Stream processing 92 2. What is Apache Spark? 93 Examples HDFS read HDFS write HDFS read iter. 1 HDFS write . . . iter. 2 Input HDFS read Input query 1 result 1 query 2 result 2 query 3 result 3 . . . 94 Goal: In-Memory Data Sharing iter. 1 iter. 2 . . . Input query 1 one-time processing Input Distributed memory query 2 query 3 . . . 10-100× faster than network and disk 95 Solution: Resilient Distributed Datasets (RDDs) • Partitioned collections of records that can be stored in memory across the cluster • Manipulated through a diverse set of transformations (map, filter, join, etc) • Fault recovery without costly replication – Remember the series of transformations that built an RDD (its lineage) to recompute lost data 96 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) BaseTransformed RDD RDD results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() tasks Driver Cache 1 Worker Block 1 messages.filter(_.contains(“foo”)).count Cache 2 messages.filter(_.contains(“bar”)).count Worker . . . Cache 3 Result: scaled full-text tosearch 1 TB data of Wikipedia in 5-7 sec in <1(vs sec170 (vssec 20 for secon-disk for on-disk data) data) Scala programming language Worker Block 2 Block 3 97 Fault Recovery RDDs track lineage information that can be used to efficiently reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File Filtered RDD filter (func = _.contains(...)) Mapped RDD map (func = _.split(...)) 98 Iteratrion time (s) Fault Recovery Results 140 120 100 80 60 40 20 0 Failure happens 119 81 1 57 56 58 2 3 4 58 5 6 Iteration 57 59 57 59 7 8 9 10 99 Example: Logistic Regression Find best line separating two sets of points random initial line target 100 Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) 101 Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s 102 GraphX: Encoding Property Graphs as RDDs 103 GraphX: Graph System Optimizations 104 PageRank Benchmark GraphX performs comparably to state-of-the-art graph processing systems. 105 Thanks for your attention! 106 References Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107113. Miner, Donald, and Adam Shook. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, Inc., 2012. White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012. http://hadoop.apache.org/ http://hadoop.apache.org/pig http://hadoop.apache.org/hive http://www.cloudera.com/ 107 References Hadoop: http://hadoop.apache.org/common Pig: http://hadoop.apache.org/pig Hive: http://hadoop.apache.org/hive Spark: http://spark-project.org Hadoop video tutorials: www.cloudera.com/hadoop-training Amazon Elastic MapReduce: http://aws.amazon.com/elasticmapreduce/ 108