Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Analysis and Mining Weixiong Rao 饶卫雄 Tongji University 同济大学软件学院 2015 Fall [email protected] *Some of the slides are from Dr Jure Leskovec’s and Prof. Zachary G. Ives 2017/5/3 1 DAM is here! Product Recommendation 2017/5/3 2 Web Search Ranking 2017/5/3 3 Spam e-Mail Detection 2017/5/3 4 Traditional DAM Oracle DB IBM DW product on very powerful servers SAP ERP Salesforce CRM Flat Files from Legancy System DAM tools 2017/5/3 5 Big Data Typical large enterprise: 5,000-50,000 servers, Terabytes of data, millions of Txn per day. In contrast, many Internet companies Millions of servers, petabytes of data Google: Facebook: A billion Facebook users Billion+ Facebook pages Twitter: 2017/5/3 Lots and lots of Web pages Billions of Google queries per day Hundreds of million Twitter accounts Hundreds of million Tweets per day 6 Nowsdays DAM solutions Google, Facebook, LinkedIn, eBay, Amazon... didnot use the traditional data warehouse products for DAM. Why? CAP theorem Different assumptions lead to different solutions What? Massive parallism 2017/5/3 Hadoop MapReduce paradigm UC Berkeley shark/spark 7 What’s DAM? Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. 2017/5/3 8 What’s big DAM? Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage search, sharing, transfer, analysis and visualization Our course: How to do DAM in the Big data context 2017/5/3 Data Mining ≈ Predictive Analytics ≈Data Science ≈ Business Intelligence Big data mining ≈ Massive data analysis 9 Let’s focus on big DAM -what matters when dealing with data? 2017/5/3 10 Let’s focus on big DAM - cultures of data minging? Data mining overlaps with: Databases: Large-scale data, simple queries Machine learning: Small data, Complex models CS Theory: (Randomized) Algorithms Different cultures: 2017/5/3 To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models Result is the parameters of the model 11 Let’s focus on big data mining This class overlaps with machine learning, statistics, artificial intelligence, databases but more stress on Scalability (big data) Algorithms Computing architectures Automation for handling real big data The required background 2017/5/3 Data structure and Algorithm design Probability and Linear algebra Operating System Java program design 12 What will we learn? We will learn to mine different types of data: Data is high dimensional Data is a graph Data is infinite/never-ending Data is labeled We will learn to use different models of computation: 2017/5/3 Matlab + Hadoop + Spark Streams and online algorithms Single machine in-memory 13 What will we learn? We will learn to solve real-world problems: Recommender systems Market Basket Analysis Spam detection Duplicate document detection We will learn various “tools”: Optimization (stochastic gradient descent) Dynamic programming (frequent itemsets) Hashing (LSH, Bloom filters) *From Dr Jure Leskovec’s slides. 2017/5/3 14 The course landscape Apps ML alg. Matlab + Hadoop + Apache Spark Data High dim. data 2017/5/3 Graph data Infinite data 15 About the course Teaching Assistants (TAs) Office Hours: Weixiong: every Tuesday 13-15PM (SSE building 422 room) TAs: ? Course Website: ? soon Textbook: 2017/5/3 16 Workload for the course 4 Homework: 20% 3 Quizzs: 30% Final exam: 25% Project: 25% Not Finalized! 2017/5/3 17 Platforms for Big Data Mining Parallel DBMS technologies Proposed in the late eighties Matured over the last two decades Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises Hadoop Spark 2017/5/3 UC Berkeley 18 Parallel DBMS (PDBMS) technologies Popularly used for more than two decades Research Projects: Gamma, Grace, … Commercial: Multi-billion dollar industry but access to only a privileged few Relational Data Model 2017/5/3 Indexing Familiar SQL interface Advanced query optimization Well understood and studied Very reliable! 19 MapReduce Overview: Data-parallel programming model An associated parallel and distributed implementation for commodity clusters Pioneered by Google 2017/5/3 Processes 20 PB of data per day (circa 2008) Popularized by open-source Hadoop project Used by Yahoo!, Facebook, Amazon, and the list is growing … 20 Open Discussion btw PDBMS Vs MR PDBMS community: 1. MapReduce: A major step backwards 2. A Comparison of Approaches to Large-Scale Data Analysis 3. MapReduce and Parallel DBMSs: Friends or Foes? MR community: 1. MapReduce: A Flexible Data Processing Tool 2017/5/3 21 PDBMS Vs MR PDBMS Schema Support MR Not out of the box Indexing Programming Model Declarative (SQL) Imperative (C/C++, Java, …) Extensions through Pig and Hive Query Optimization Flexibility Fault Tolerance 2017/5/3 Coarse grained techniques 22 Single Node Architecture 2017/5/3 23 Motivation: Google Example 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk ~4 months to read the web Takes even more to do something useful with the data! Recently standard architecture for such problems emerged: 2017/5/3 Cluster of commodity Linux nodes Commodity network (ethernet) to connect them 24 Cluster Architecture 2017/5/3 25 Google server room in Council Bluffs, Iowa Data centers consume up to 1.5 percent of all the world’s electricity The huge fans sound like jet engines jacked through Marshall amps. 2017/5/3 26 A central cooling plant in Google’s Douglas County, Georgia, data center http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/ 2017/5/3 27 Large-scale Computing Large-scale computing for data mining problems on commodity hardware Challenges: How do you distribute computation? How can we make it easy to write distributed programs? Machines fail (fault tolerance): 2017/5/3 One server may stay up 3 years (1,000 days) If you have 1,000 servers, expect to loose 1/day With 1M machines 1,000 machines fail every day! 28 Basic Idea Issue: Copying data over a network takes time Idea: Bring computation to data Store files multiple times for reliability MapReduce addresses these problems Storage Infrastructure – File system NEXT Programming model 2017/5/3 Google: GFS. Hadoop: HDFS MapReduce 29 Storage Infrastructure Problem: If nodes fail, how to store data persistently? Answer: Distributed File System: Provides global file namespace Typical usage pattern 2017/5/3 Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common Key assumption 30 Distributed File System Chunk servers Master node File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks a.k.a. Name Node in Hadoop’s HDFS Stores metadata about where files are stored Might be replicated Client library for file access 2017/5/3 Talks to master to find chunk servers Connects directly to chunk servers to access data 31 Distributed File System Reliable distributed file system Data kept in “chunks” spread across machines Each chunk replicated on different machines 2017/5/3 Seamless recovery from disk or machine failure 32 Basic Idea Issue: Copying data over a network takes time Idea: Bring computation to data Store files multiple times for reliability MapReduce addresses these problems Storage Infrastructure – File system NEXT Programming model 2017/5/3 Google: GFS. Hadoop: HDFS MapReduce 33 What is HDFS (Hadoop Distributed File System)? HDFS is a distributed file system What HDFS does well: Makes some unique tradeoffs that are good for MapReduce Very large read-only or append-only files (individual files may contain Gigabytes/Terabytes of data) Sequential access patterns What HDFS does not do well: 2017/5/3 Storing lots of small files Low-latency access Multiple writers Writing to arbitrary offsets in the file University of Pennsylvania 34 34 HDFS versus NFS Network File System (NFS) Single machine makes part of its file system available to other machines Sequential or random access PRO: Simplicity, generality, transparency CON: Storage capacity and throughput limited by single server 2017/5/3 Hadoop Distributed File System (HDFS) Single virtual file system spread over many machines Optimized for sequential read and local accesses PRO: High throughput, high capacity CON: Specialized for particular types of applications 35 How data is stored in HDFS foo.txt: 3,9,6 bar.data: 2,4 block #9 of foo.txt? Name node 9 Read block 9 9 Client 9 9 9 2 3 4 3 6 6 4 2 Data nodes Files are stored as sets of (large) blocks 3 4 2 Default block size: 64 MB (ext4 default is 4kB!) Blocks are replicated for durability and availability What are the advantages of this design? Namespace is managed by a single name node 2017/5/3 Actual data transfer is directly between client & data node Pros and cons of this decision? 36 The Namenode foo.txt: 3,9,6 bar.data: 2,4 blah.txt: 17,18,19,20 xyz.img: 8,5,1,11 Name node fsimage edits State stored in two files: fsimage and edits Created abc.txt Appended block 21 to blah.txt Deleted foo.txt Appended block 22 to blah.txt Appended block 23 to xyz.img ... fsimage: Snapshot of file system metadata edits: Changes since last snapshot Normal operation: 2017/5/3 When namenode starts, it reads fsimage and then applies all the changes from edits sequentially Pros and cons of this design? 37 The Secondary Namenode What if the state of the namenode is lost? Solution #1: Metadata backups Data in the file system can no longer be read! Namenode can write its metadata to a local disk, and/or to a remote NFS mount Solution #2: Secondary Namenode 2017/5/3 Purpose: Periodically merge the edit log with the fsimage to prevent the log from growing too large Has a copy of the metadata, which can be used to reconstruct the state of the namenode But: State lags behind somewhat, so data loss is likely if the namenode fails 38 Accessing data in HDFS [ahae@carbon total 209588 drwxrwxr-x 2 drwxrwxr-x 5 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 -rw-rw-r-- 1 [ahae@carbon ~]$ ls -la /tmp/hadoop-ahae/dfs/data/current/ ahae ahae ahae ahae ahae ahae ahae ahae ahae ahae ahae ahae ahae ~]$ ahae ahae ahae ahae ahae ahae ahae ahae ahae ahae ahae ahae ahae 4096 4096 11568995 90391 4 11 67108864 524295 67108864 524295 67108864 524295 158 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 2013-10-08 15:46 15:39 15:44 15:44 15:40 15:40 15:44 15:44 15:44 15:44 15:44 15:44 15:40 . .. blk_-3562426239750716067 blk_-3562426239750716067_1020.meta blk_5467088600876920840 blk_5467088600876920840_1019.meta blk_7080460240917416109 blk_7080460240917416109_1020.meta blk_-8388309644856805769 blk_-8388309644856805769_1020.meta blk_-9220415087134372383 blk_-9220415087134372383_1020.meta VERSION HDFS implements a separate namespace Files in HDFS are not visible in the normal file system Only the blocks and the block metadata are visible HDFS cannot be (easily) mounted 2017/5/3 Some FUSE drivers have been implemented for it 39 Accessing data in HDFS [ahae@carbon ~]$ /usr/local/hadoop/bin/hadoop fs -ls /user/ahae Found 4 items -rw-r--r-1 ahae supergroup 1366 2013-10-08 15:46 /user/ahae/README.txt -rw-r--r-1 ahae supergroup 0 2013-10-083 15:35 /user/ahae/input -rw-r--r-1 ahae supergroup 0 2013-10-08 15:39 /user/ahae/input2 -rw-r--r-1 ahae supergroup 212895587 2013-10-08 15:44 /user/ahae/input3 [ahae@carbon ~]$ File access is through the hadoop command Examples: 2017/5/3 hadoop fs -put [file] [hdfsPath] hadoop fs -ls [hdfsPath] hadoop fs -get [hdfsPath] [file] hadoop fs -rm [hdfsPath] hadoop fs -mkdir [hdfsPath] Stores a file in HDFS List a directory Retrieves a file from HDFS Deletes a file in HDFS Makes a directory in HDFS 40 Alternatives to the command line Getting data in and out of HDFS through the command-line interface is a bit cumbersome Alternatives have been developed: 2017/5/3 FUSE file system: Allows HDFS to be mounted under Unix WebDAV share: Can be mounted as filesystem on many OSes HTTP: Read access through namenode's embedded web svr FTP: Standard FTP interface ... 41 Accessing HDFS directly from Java Programs can read/write HDFS files directly Files are represented as URIs Not needed in MapReduce; I/O is handled by the framework Example: hdfs://localhost/user/ahae/example.txt Access is via the FileSystem API 2017/5/3 To get access to the file: FileSystem.get() For reading, call open() -- returns InputStream For writing, call create() -- returns OutputStream 42 What about permissions? Since 0.16.1, Hadoop has rudimentary support for POSIX-style permissions But: POSIX model is not a very good fit rwx for users, groups, 'other' -- just like in Unix 'hadoop fs' has support for chmod, chgrp, chown Many combinations are meaningless: Files cannot be executed, and existing files cannot really be written to Permissions were not really enforced Hadoop does not verify whether user's identity is genuine Useful more to prevent accidental data corruption or casual misuse of information 2017/5/3 43 Where are things today? Since v.20.20x, Hadoop has some security Kerberos RPC (SASL/GSSAPI) HTTP SPNEGO authentication for web consoles HDFS file permissions actually enforced Various kinds of delegation tokens Network encryption For more details, see: https://issues.apache.org/jira/secure/attachment/1 2428537/security-design.pdf Big changes are coming 2017/5/3 Project Rhino (e.g., encrypted data at rest) 44 Recap: HDFS HDFS: A specialized distributed file system Architecture: Blocks, namenode, datanodes Good for large amounts of data, sequential reads Bad for lots of small files, random access, non-append writes File data is broken into large blocks (64MB default) Blocks are stored & replicated by datanodes Single namenode manages all the metadata Secondary namenode: Housekeeping & (some) redundancy Usage: Special command-line interface 2017/5/3 Example: hadoop fs -ls /path/in/hdfs 45 Basic Idea Issue: Copying data over a network takes time Idea: Bring computation to data Store files multiple times for reliability MapReduce addresses these problems Storage Infrastructure – File system Programming model 2017/5/3 Google: GFS. Hadoop: HDFS NEXT MapReduce 46 Recall HashTable Hash Function maps input keys to buckets. 2017/5/3 47 From HashTable to Distributed Hash Table (DHT) Node-1 Node-2 ... Node-n Disibuted Hash Function maps input keys to physical nodes. 2017/5/3 48 From DHT to MapReduce Node-1 Node-2 ... Node-n Map() 2017/5/3 Reduce() 49 The MapReduce programming model MapReduce is a distributed programming model In many circles, considered the key building block for much of Google’s data analysis A programming language built on it: Sawzall, http://labs.google.com/papers/sawzall.html … Sawzall has become one of the most widely used programming languages at Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x10 15 bytes of data (2.8PB) and wrote 9.9x1012 bytes (9.3TB). Other similar languages: Yahoo’s Pig Latin and Pig; Microsoft’s Dryad Cloned in open source: Hadoop, http://hadoop.apache.org/ 2017/5/3 50 The MapReduce programming model Simple distributed functional programming primitives Modeled after Lisp primitives: map (apply function to all items in a collection) and reduce (apply function to set of items with a common key) We start with: A user-defined function to be applied to all data, map: (key,value) (key, value) Another user-specified operation reduce: (key, {set of values}) result A set of n nodes, each with data All nodes run map on all of their data, producing new data with keys 2017/5/3 This data is collected by key, then shuffled, and finally reduced Dataflow is through temp files on GFS 51 Simple example: Word count map(String key, String value) { // key: document name, line no // value: contents of line for each word w in value: emit(w, "1") } reduce(String key, Iterator values) { // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); emit(key, result) } Goal: Given a set of documents, count how often each word occurs 2017/5/3 Input: Key-value pairs (document:lineNumber, text) Output: Key-value pairs (word, #occurrences) What should be the intermediate key-value pairs? 52 Simple example: Word count Key range the node is responsible for (2, is an apple) (3, not an orange) Reducer Mapper (is, (is,1}) 1) (is,1){1, (not, 1)(not, 1) (not, {1, 1}) Reducer (is, 2) (not, 2) (orange, 1)(orange, 1) (orange, 1) (orange, {1, 1, 1}) {1,(the, 1, 1}) (the, 1)(the, 1) 1) (unlike, {1}) (unlike, 1) Reducer (orange, 3) (the, 3) (unlike, 1) (3-4) (4, because the) (5, orange) (6, unlike the apple) (8, not green) (apple, 1)(apple, 1) (apple, 1) (apple, {1, 1, 1}) (an,1){1, (an, (an,1}) 1) (because, {1}) (because, 1) (green, {1}) (green, 1) (1-2) (1, the apple) (7, is orange) Mapper (apple, 3) (an, 2) (because, 1) (green, 1) Mapper (5-6) 2017/5/3 2 The mappers process the KV-pairs one by one (H-N) (O-U) Reducer Mapper (V-Z) (7-8) 1 Each mapper receives some of the KV-pairs as input (A-G) 3 Each KV-pair output by the mapper is sent to the reducer that is responsible for it 4 The reducers sort their input by key and group it 5 The reducers process their input one group at a time 53 MapReduce dataflow Mapper Reducer Mapper Reducer Mapper Reducer Mapper Reducer "The Shuffle" 2017/5/3 Output data Input data Intermediate (key,value) pairs What is meant by a 'dataflow'? What makes this so scalable? 54 Steps of MapReduce 3 steps of MapReduce Sequentially read a lot of data Map: Extract something you care about Group by key: Sort and shuffle Reduce: Aggregate, summarize, filter or transform Output the result 2017/5/3 55 The Map Step 2017/5/3 56 The Reduce Step 2017/5/3 57 More Details Input: a set of key-value pairs Programmer specifies two methods: Map(k, v) → <k’, v’>* Takes a key-value pair and outputs a set of key-value pairs E.g., key is the filename, value is a single line in the file There is one Map call for every (k,v) pair Reduce(k’, <v’>*) → <k’, v’’>* 2017/5/3 All values v’ with same key k’ are reduced together and processed in v’ order There is one Reduce function call per unique key k’ 58 MapReduce: A Diagram 2017/5/3 59 MapReduce: In Parallel 2017/5/3 60 More details on the MapReduce data flow Coordinator Map computation partitions Data partitions by key (Default MapReduce uses Filesystem) Reduce computation partitions Redistribution by output’s key ("shuffle") 2017/5/3 61 More examples Distributed grep – all lines matching a pattern Map: filter by pattern Reduce: output set Count URL access frequency Map: output each URL as key, with count 1 Reduce: sum the counts Reverse web-link graph Map: output (target,source) pairs when link to target found in souce Reduce: concatenates values and emits (target,list(source)) Inverted index Map: Emits (word,documentID) Reduce: Combines these into (word,list(documentID)) 2017/5/3 62 What do we need to write a MR program? A mapper A reducer Accepts intermediate (key,value) pairs Produces final (key,value) pairs for the output A driver Accepts (key,value) pairs from the input Produces intermediate (key,value) pairs, which are then shuffled Specifies which inputs to use, where to put the outputs Chooses the mapper and the reducer to use Hadoop takes care of the rest!! 2017/5/3 Default behaviors can be customized by the driver 63 The Mapper Input format (file offset, line) Intermediate format can be freely chosen import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.io.*; public class FooMapper extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) { context.write(new Text("foo"), value); } } Extends abstract 'Mapper' class Input/output types are specified as type parameters Accepts (key,value) pair of the specified type Writes output pairs by calling 'write' method on context Mixing up the types will cause problems at runtime (!) Implements a 'map' function 2017/5/3 64 The Reducer Intermediate format (same as mapper output) Output format import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.io.*; public class FooReducer extends Reducer<Text, Text, IntWritable, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws java.io.IOException, InterruptedException { for (Text value: values) Note: We may get context.write(new IntWritable(4711), value); multiple values for } the same key! } Extends abstract 'Reducer' class Must specify types again (must be compatible with mapper!) Implements a 'reduce' function Values are passed in as an 'Iterable' Caution: These are NOT normal Java classes. Do not store them in collections - content can change between iterations! 2017/5/3 65 The Driver import import import import import org.apache.hadoop.mapreduce.*; org.apache.hadoop.io.*; org.apache.hadoop.fs.Path; org.apache.hadoop.mapreduce.lib.input.FileInputFormat; org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class FooDriver { public static void main(String[] args) throws Exception { Job job = new Job(); job.setJarByClass(FooDriver.class); FileInputFormat.addInputPath(job, new Path("in")); FileOutputFormat.setOutputPath(job, new Path("out")); job.setMapperClass(FooMapper.class); job.setReducerClass(FooReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); Mapper&Reducer are in the same Jar as FooDriver Input and Output paths Format of the (key,value) pairs output by the reducer System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specifies how the job is to be executed 2017/5/3 Input and output directories; mapper & reducer classes 66 Manual compilation Goal: Produce a JAR file that contains the classes for mapper, reducer, and driver This can be submitted to the Job Tracker, or run directly through Hadoop Step #1: Put hadoop-core-1.0.3.jar into classpath: export CLASSPATH=$CLASSPATH:/path/to/hadoop/hadoop-core-1.0.3.jar Step #2: Compile mapper, reducer, driver: javac FooMapper.java FooReducer.java FooDriver.java Step #3: Package into a JAR file: jar cvf Foo.jar *.class Alternative: "Export..."/"Java JAR file" in Eclipse 2017/5/3 67 Standalone mode installation What is standalone mode? Installation on a single node No daemons running (no Task Tracker, Job Tracker) Hadoop runs as an 'ordinary' Java program Used for debugging How to install Hadoop in standalone mode? 2017/5/3 See Textbook Appendix A Already done in your VM image 68 Running a job in standalone mode Step #1: Create & populate input directory Step #2: Run Hadoop Configured in the Driver via addInputPath() Put input file(s) into this directory (ok to have more than 1) Output directory must not exist yet As simple as this: hadoop jar <jarName> <driverClassName> Example: hadoop jar foo.jar upenn.nets212.FooDriver In verbose mode, Hadoop will print statistics while running Step #3: Collect output files 2017/5/3 69 Recap: Writing simple jobs for Hadoop Write a mapper, reducer, driver Package into a JAR file Custom serialization Must use special data types (Writable) Explicitly declare all three (key,value) types Must contain class files for mapper, reducer, driver Create manually (javac/jar) or automatically (ant) Running in standalone mode 2017/5/3 hadoop jar foo.jar FooDriver Input and output directories in local file system 70 Common mistakes to avoid Mapper and reducer should be stateless Don't use static variables - after map + reduce return, they should remember nothing about the processed data! Reason: No guarantees about which key-value pairs will be processed by which workers! Don't try to do your own I/O! Don't try to read from, or write to, files in the file system The MapReduce framework does all the I/O for you: 2017/5/3 HashMap h = new HashMap(); map(key, value) { if (h.contains(key)) { h.add(key,value); emit(key, "X"); } } Wrong! map(key, value) { File foo = new File("xyz.txt"); while (true) { s = foo.readLine(); ... } } Wrong! All the incoming data will be fed as arguments to map and reduce Any data your functions produce should be output via emit 71 More common mistakes to avoid map(key, value) { emit("FOO", key + " " + value); } Wrong! reduce(key, value[]) { /* do some computation on all the values */ } Mapper must not map too much data to the same key In particular, don't map everything to the same key!! Otherwise the reduce worker will be overwhelmed! It's okay if some reduce workers have more work than others Example: In WordCount, the reduce worker that works on the key 'and' has a lot more work than the reduce worker that works on 'syzygy'. 2017/5/3 72 Designing MapReduce algorithms Key decision: What should be done by map, and what by reduce? map can do something to each individual key-value pair, but it can't look at other key-value pairs Example: Filtering out key-value pairs we don't need map can emit more than one intermediate key-value pair for each incoming key-value pair Example: Incoming data is text, map produces (word,1) for each word reduce can aggregate data; it can look at multiple values, as long as map has mapped them to the same (intermediate) key Example: Count the number of words, add up the total cost, ... Need to get the intermediate format right! If reduce needs to look at several values together, map must emit them using the same key! 2017/5/3 73 Some additional details To make this work, we need a few more parts… The file system (distributed across all nodes): The driver program (executes on one node): Stores the inputs, outputs, and temporary results Specifies where to find the inputs, the outputs Specifies what mapper and reducer to use Can customize behavior of the execution The runtime system (controls nodes): 2017/5/3 Supervises the execution of tasks Esp. JobTracker 74 Some details Fewer computation partitions than data partitions All data is accessible via a distributed filesystem with replication Worker nodes produce data in key order (makes it easy to merge) The master is responsible for scheduling, keeping all nodes busy The master knows how many data partitions there are, which have completed – atomic commits to disk Locality: Master tries to do work on nodes that have replicas of the data Master can deal with stragglers (slow machines) by reexecuting their tasks somewhere else 2017/5/3 75 What if a worker crashes? We rely on the file system being shared across all the nodes Two types of (crash) faults: Node wrote its output and then crashed Here, the file system is likely to have a copy of the complete output Node crashed before finishing its output The JobTracker sees that the job isn’t making progress, and restarts the job elsewhere on the system (Of course, we have fewer nodes to do work…) But what if the master crashes? 2017/5/3 76 Other challenges Locality Task granularity Schedule some backup tasks Saving bandwidth How many map tasks? How many reduce tasks? Dealing with stragglers Try to schedule map task on machine that already has data E.g., with combiners Handling bad records 2017/5/3 "Last gasp" packet with current sequence number 77 Scale and MapReduce From a particular Google paper on a language built over MapReduce: 2017/5/3 … Sawzall has become one of the most widely used programming languages at Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x1015 bytes of data (2.8PB) and wrote 9.9x1012 bytes (9.3TB). 78 MapReduce: Simplified Data Processing on Large Clusters Appeared in: OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. The slides are from Jeff Dean, Sanjay Ghemawat Google, Inc. 2017/5/3 79 Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use hundreds or thousands of CPUs ... but this needs to be easy MapReduce provides: 2017/5/3 Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring 80 Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other languages 2017/5/3 81 Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 2017/5/3 82 Model is Widely Applicable MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation ... ... ... 2017/5/3 83 Implementation Overview Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs 2017/5/3 84 Execution 2017/5/3 85 Parallel Execution 2017/5/3 86 Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines 2017/5/3 87 2017/5/3 88 2017/5/3 89 2017/5/3 90 2017/5/3 91 2017/5/3 92 2017/5/3 93 2017/5/3 94 2017/5/3 95 2017/5/3 96 2017/5/3 97 2017/5/3 98 Fault tolerance: Handled via re-execution On worker failure: Master failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine Semantics in presence of failures: see paper 2017/5/3 99 Refinement: Redundant Execution Slow workers significantly lengthen completion time Solution: Near end of phase, spawn backup copies of tasks Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!) Whichever one finishes first "wins" Effect: Dramatically shortens job completion time 2017/5/3 100 Refinement: Locality Optimization Master scheduling policy: Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (== GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rack Effect: Thousands of machines read input at local disk speed 2017/5/3 Without this, rack switches limit read rate 101 Refinement: Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On seg fault: If master sees two failures for same record: Send UDP packet to master from signal handler Include sequence number of record being processed Next worker is told to skip the record Effect: Can work around bugs in third-party libraries 2017/5/3 102 Other Refinements (see paper) Sorting guarantees within each reduce partition Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters 2017/5/3 103 Performance Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz MXeons with Hyperthreading R Dual 160 GB IDE disks_ S Gigabit Ethernet per machine o r Bisection bandwidth approximately 100 Gbps t Two benchmarks: MR_Grep MR_Sort 2017/5/3 Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records) Sort 1010 100-byte records (modeled after TeraSort benchmark) 104 MR_Grep Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs 2017/5/3 105 MR_Sort Backup tasks reduce job completion time significantly System deals well with failures Normal 2017/5/3 No backup tasks 200 processes killed 106 Experience: Rewrite of Production Indexing System Rewrote Google's production indexing system using MapReduce Set of 10, 14, 17, 21, 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow machines Easy to make indexing faster by adding more machines 2017/5/3 107 Usage: MapReduce jobs run in August 2004 2017/5/3 Number of jobs Average job completion time Machine days used Input data read Intermediate data produced Output data written Average worker machines per job Average worker deaths per job Average map tasks per job Average reduce tasks per job Unique map implementations Unique reduce implementations Unique map/reduce combinations 29,423 634 secs 79,186 days 3,288 TB 758 TB 193 TB 157 1.2 3,351 55 395 269 426 108 Related Work Programming model inspired by functional language primitives Partitioning/shuffling similar to many large-scale sorting systems NOW-Sort ['97] Re-execution for fault tolerance BAD-FS ['04] and TACC ['97] Locality optimization has parallels with Active Disks/Diamond work Active Disks ['01], Diamond ['04] Backup tasks similar to Eager Scheduling in Charlotte system Charlotte ['96] Dynamic load balancing solves similar problem as River's distributed queues River ['99] 2017/5/3 109 Conclusions MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal w/ messy details 2017/5/3 110