Download 0-overview

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Data Analysis and Mining
Weixiong Rao 饶卫雄
Tongji University 同济大学软件学院
2015 Fall
[email protected]
*Some of the slides are from Dr Jure Leskovec’s and Prof. Zachary G. Ives
2017/5/3
1
DAM is here!
Product Recommendation
2017/5/3
2
Web Search Ranking
2017/5/3
3
Spam e-Mail Detection
2017/5/3
4
Traditional DAM
Oracle DB
IBM DW product on
very powerful servers
SAP ERP
Salesforce
CRM
Flat Files from
Legancy System
DAM tools
2017/5/3
5
Big Data


Typical large enterprise:
 5,000-50,000 servers, Terabytes of data, millions of Txn per day.
In contrast, many Internet companies
 Millions of servers, petabytes of data
 Google:



Facebook:



A billion Facebook users
Billion+ Facebook pages
Twitter:


2017/5/3
Lots and lots of Web pages
Billions of Google queries per day
Hundreds of million Twitter accounts
Hundreds of million Tweets per day
6
Nowsdays DAM solutions


Google, Facebook, LinkedIn, eBay,
Amazon... didnot use the traditional data
warehouse products for DAM.
Why? CAP theorem


Different assumptions lead to different solutions
What?

Massive parallism


2017/5/3
Hadoop MapReduce paradigm
UC Berkeley shark/spark
7
What’s DAM?


Analysis of data is a process of inspecting,
cleaning, transforming, and modeling data with
the goal of discovering useful information,
suggesting conclusions, and supporting
decision making.
Data mining is a particular data analysis
technique that focuses on modeling and
knowledge discovery for predictive rather than
purely descriptive purposes.
2017/5/3
8
What’s big DAM?

Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications.


The challenges include capture, curation, storage search,
sharing, transfer, analysis and visualization
Our course: How to do DAM in the Big data context


2017/5/3
Data Mining ≈ Predictive Analytics ≈Data Science ≈ Business
Intelligence
Big data mining ≈ Massive data analysis
9
Let’s focus on big DAM
-what matters when dealing with data?
2017/5/3
10
Let’s focus on big DAM
- cultures of data minging?

Data mining overlaps with:




Databases: Large-scale data, simple queries
Machine learning: Small data, Complex
models
CS Theory: (Randomized) Algorithms
Different cultures:


2017/5/3
To a DB person, data mining is an extreme
form of analytic processing – queries that
examine large amounts of data
 Result is the query answer
To a ML person, data-mining is the inference
of models
 Result is the parameters of the model
11
Let’s focus on big data mining

This class overlaps with machine learning, statistics,
artificial intelligence, databases but more stress on





Scalability (big data)
Algorithms
Computing architectures
Automation for handling real big data
The required background




2017/5/3
Data structure and Algorithm design
Probability and Linear algebra
Operating System
Java program design
12
What will we learn?

We will learn to mine different types of data:





Data is high dimensional
Data is a graph
Data is infinite/never-ending
Data is labeled
We will learn to use different models of
computation:



2017/5/3
Matlab + Hadoop + Spark
Streams and online algorithms
Single machine in-memory
13
What will we learn?

We will learn to solve real-world problems:





Recommender systems
Market Basket Analysis
Spam detection
Duplicate document detection
We will learn various “tools”:



Optimization (stochastic gradient descent)
Dynamic programming (frequent itemsets)
Hashing (LSH, Bloom filters)
*From Dr Jure Leskovec’s slides.
2017/5/3
14
The course landscape
Apps
ML alg.
Matlab
+
Hadoop
+
Apache
Spark
Data
High dim. data
2017/5/3
Graph data
Infinite data
15
About the course

Teaching Assistants (TAs)


Office Hours:



Weixiong: every Tuesday 13-15PM (SSE building 422
room)
TAs: ?
Course Website:


?
soon
Textbook:
2017/5/3
16
Workload for the course




4 Homework: 20%
3 Quizzs: 30%
Final exam: 25%
Project: 25%
Not Finalized!
2017/5/3
17
Platforms for Big Data Mining

Parallel DBMS technologies




Proposed in the late eighties
Matured over the last two decades
Multi-billion dollar industry: Proprietary DBMS
Engines
intended as Data Warehousing solutions for very
large enterprises

Hadoop

Spark

2017/5/3
UC Berkeley
18
Parallel DBMS (PDBMS) technologies

Popularly used for more than two decades



Research Projects: Gamma, Grace, …
Commercial: Multi-billion dollar industry but access to only
a privileged few
Relational Data Model





2017/5/3
Indexing
Familiar SQL interface
Advanced query optimization
Well understood and studied
Very reliable!
19
MapReduce

Overview:



Data-parallel programming model
An associated parallel and distributed
implementation for commodity clusters
Pioneered by Google



2017/5/3
Processes 20 PB of data per day (circa 2008)
Popularized by open-source Hadoop project
Used by Yahoo!, Facebook, Amazon, and the list
is growing …
20
Open Discussion btw PDBMS Vs MR

PDBMS community:
1. MapReduce: A major step backwards
2. A Comparison of Approaches to Large-Scale Data Analysis
3. MapReduce and Parallel DBMSs: Friends or Foes?

MR community:
1. MapReduce: A Flexible Data Processing Tool
2017/5/3
21
PDBMS Vs MR
PDBMS
Schema Support
MR
Not out of the box
Indexing
Programming Model
Declarative
(SQL)
Imperative (C/C++,
Java, …) Extensions
through Pig and Hive
Query Optimization
Flexibility
Fault Tolerance
2017/5/3
Coarse grained
techniques
22
Single Node Architecture
2017/5/3
23
Motivation: Google Example


20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk



~4 months to read the web
Takes even more to do something useful with
the data!
Recently standard architecture for such
problems emerged:


2017/5/3
Cluster of commodity Linux nodes
Commodity network (ethernet) to connect them
24
Cluster Architecture
2017/5/3
25
Google server room in Council Bluffs, Iowa
Data centers consume up to 1.5 percent of all the world’s electricity
The huge fans sound like jet engines jacked through Marshall amps.
2017/5/3
26
A central cooling plant in Google’s Douglas County,
Georgia, data center
http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/
2017/5/3
27
Large-scale Computing


Large-scale computing for data mining
problems on commodity hardware
Challenges:



How do you distribute computation?
How can we make it easy to write distributed
programs?
Machines fail (fault tolerance):



2017/5/3
One server may stay up 3 years (1,000 days)
If you have 1,000 servers, expect to loose 1/day
With 1M machines 1,000 machines fail every day!
28
Basic Idea


Issue: Copying data over a network takes time
Idea:



Bring computation to data
Store files multiple times for reliability
MapReduce addresses these problems

Storage Infrastructure – File system



NEXT
Programming model

2017/5/3
Google: GFS.
Hadoop: HDFS
MapReduce
29
Storage Infrastructure

Problem:


If nodes fail, how to store data persistently?
Answer:

Distributed File System:


Provides global file namespace
Typical usage pattern



2017/5/3
Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are common
Key assumption
30
Distributed File System

Chunk servers





Master node




File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
a.k.a. Name Node in Hadoop’s HDFS
Stores metadata about where files are stored
Might be replicated
Client library for file access


2017/5/3
Talks to master to find chunk servers
Connects directly to chunk servers to access data
31
Distributed File System

Reliable distributed file system


Data kept in “chunks” spread across machines
Each chunk replicated on different machines

2017/5/3
Seamless recovery from disk or machine failure
32
Basic Idea


Issue: Copying data over a network takes time
Idea:



Bring computation to data
Store files multiple times for reliability
MapReduce addresses these problems

Storage Infrastructure – File system



NEXT
Programming model

2017/5/3
Google: GFS.
Hadoop: HDFS
MapReduce
33
What is HDFS (Hadoop Distributed File System)?

HDFS is a distributed file system


What HDFS does well:



Makes some unique tradeoffs that are good for
MapReduce
Very large read-only or append-only files (individual files
may contain Gigabytes/Terabytes of data)
Sequential access patterns
What HDFS does not do well:




2017/5/3
Storing lots of small files
Low-latency access
Multiple writers
Writing to arbitrary offsets in the file
University of Pennsylvania
34
34
HDFS versus NFS
Network File System (NFS)




Single machine makes part of its
file system available to other
machines
Sequential or random access
PRO: Simplicity, generality,
transparency
CON: Storage capacity and
throughput limited by single
server
2017/5/3
Hadoop Distributed File System (HDFS)




Single virtual file system
spread over many machines
Optimized for sequential
read and local accesses
PRO: High throughput, high
capacity
CON: Specialized for
particular types of
applications
35
How data is stored in HDFS
foo.txt: 3,9,6
bar.data: 2,4
block #9 of
foo.txt?
Name node
9
Read block 9
9
Client

9
9
9
2
3
4
3
6
6
4
2
Data nodes
Files are stored as sets of (large) blocks




3
4 2
Default block size: 64 MB (ext4 default is 4kB!)
Blocks are replicated for durability and availability
What are the advantages of this design?
Namespace is managed by a single name node


2017/5/3
Actual data transfer is directly between client & data node
Pros and cons of this decision?
36
The Namenode
foo.txt: 3,9,6
bar.data: 2,4
blah.txt: 17,18,19,20
xyz.img: 8,5,1,11
Name node

fsimage
edits
State stored in two files: fsimage and edits



Created abc.txt
Appended block 21 to blah.txt
Deleted foo.txt
Appended block 22 to blah.txt
Appended block 23 to xyz.img
...
fsimage: Snapshot of file system metadata
edits: Changes since last snapshot
Normal operation:


2017/5/3
When namenode starts, it reads fsimage and then applies
all the changes from edits sequentially
Pros and cons of this design?
37
The Secondary Namenode

What if the state of the namenode is lost?


Solution #1: Metadata backups


Data in the file system can no longer be read!
Namenode can write its metadata to a local disk, and/or to
a remote NFS mount
Solution #2: Secondary Namenode



2017/5/3
Purpose: Periodically merge the edit log with the fsimage to
prevent the log from growing too large
Has a copy of the metadata, which can be used to
reconstruct the state of the namenode
But: State lags behind somewhat, so data loss is likely if
the namenode fails
38
Accessing data in HDFS
[ahae@carbon
total 209588
drwxrwxr-x 2
drwxrwxr-x 5
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
-rw-rw-r-- 1
[ahae@carbon

~]$ ls -la /tmp/hadoop-ahae/dfs/data/current/
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
~]$
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
ahae
4096
4096
11568995
90391
4
11
67108864
524295
67108864
524295
67108864
524295
158
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
2013-10-08
15:46
15:39
15:44
15:44
15:40
15:40
15:44
15:44
15:44
15:44
15:44
15:44
15:40
.
..
blk_-3562426239750716067
blk_-3562426239750716067_1020.meta
blk_5467088600876920840
blk_5467088600876920840_1019.meta
blk_7080460240917416109
blk_7080460240917416109_1020.meta
blk_-8388309644856805769
blk_-8388309644856805769_1020.meta
blk_-9220415087134372383
blk_-9220415087134372383_1020.meta
VERSION
HDFS implements a separate namespace
 Files in HDFS are not visible in the normal file system
 Only the blocks and the block metadata are visible
 HDFS cannot be (easily) mounted

2017/5/3
Some FUSE drivers have been implemented for it
39
Accessing data in HDFS
[ahae@carbon ~]$ /usr/local/hadoop/bin/hadoop fs -ls /user/ahae
Found 4 items
-rw-r--r-1 ahae supergroup
1366 2013-10-08 15:46
/user/ahae/README.txt
-rw-r--r-1 ahae supergroup
0 2013-10-083 15:35 /user/ahae/input
-rw-r--r-1 ahae supergroup
0 2013-10-08 15:39 /user/ahae/input2
-rw-r--r-1 ahae supergroup 212895587 2013-10-08 15:44 /user/ahae/input3
[ahae@carbon ~]$


File access is through the hadoop command
Examples:





2017/5/3
hadoop fs -put [file] [hdfsPath]
hadoop fs -ls [hdfsPath]
hadoop fs -get [hdfsPath] [file]
hadoop fs -rm [hdfsPath]
hadoop fs -mkdir [hdfsPath]
Stores a file in HDFS
List a directory
Retrieves a file from HDFS
Deletes a file in HDFS
Makes a directory in HDFS
40
Alternatives to the command line


Getting data in and out of HDFS through the
command-line interface is a bit cumbersome
Alternatives have been developed:





2017/5/3
FUSE file system: Allows HDFS to be mounted
under Unix
WebDAV share: Can be mounted as filesystem on
many OSes
HTTP: Read access through namenode's
embedded web svr
FTP: Standard FTP interface
...
41
Accessing HDFS directly from Java

Programs can read/write HDFS files directly



Files are represented as URIs


Not needed in MapReduce;
I/O is handled by the framework
Example: hdfs://localhost/user/ahae/example.txt
Access is via the FileSystem API



2017/5/3
To get access to the file: FileSystem.get()
For reading, call open() -- returns InputStream
For writing, call create() -- returns OutputStream
42
What about permissions?

Since 0.16.1, Hadoop has rudimentary support for
POSIX-style permissions



But: POSIX model is not a very good fit


rwx for users, groups, 'other' -- just like in Unix
'hadoop fs' has support for chmod, chgrp, chown
Many combinations are meaningless: Files cannot be
executed, and existing files cannot really be written to
Permissions were not really enforced

Hadoop does not verify whether user's identity is genuine
Useful more to prevent accidental data corruption or casual
misuse of information
2017/5/3
43

Where are things today?

Since v.20.20x, Hadoop has some security







Kerberos RPC (SASL/GSSAPI)
HTTP SPNEGO authentication for web consoles
HDFS file permissions actually enforced
Various kinds of delegation tokens
Network encryption
For more details, see:
https://issues.apache.org/jira/secure/attachment/1
2428537/security-design.pdf
Big changes are coming

2017/5/3
Project Rhino (e.g., encrypted data at rest)
44
Recap: HDFS

HDFS: A specialized distributed file system



Architecture: Blocks, namenode, datanodes





Good for large amounts of data, sequential reads
Bad for lots of small files, random access, non-append
writes
File data is broken into large blocks (64MB default)
Blocks are stored & replicated by datanodes
Single namenode manages all the metadata
Secondary namenode: Housekeeping & (some)
redundancy
Usage: Special command-line interface

2017/5/3
Example: hadoop fs -ls /path/in/hdfs
45
Basic Idea


Issue: Copying data over a network takes time
Idea:



Bring computation to data
Store files multiple times for reliability
MapReduce addresses these problems

Storage Infrastructure – File system



Programming model

2017/5/3
Google: GFS.
Hadoop: HDFS
NEXT
MapReduce
46
Recall HashTable
Hash Function maps input keys to buckets.
2017/5/3
47
From HashTable to
Distributed Hash Table (DHT)
Node-1
Node-2
...
Node-n
Disibuted Hash Function maps input keys to physical nodes.
2017/5/3
48
From DHT to MapReduce
Node-1
Node-2
...
Node-n
Map()
2017/5/3
Reduce()
49
The MapReduce programming model


MapReduce is a distributed programming model
In many circles, considered the key building block for much of
Google’s data analysis
 A programming language built on it: Sawzall,
http://labs.google.com/papers/sawzall.html



… Sawzall has become one of the most widely used programming languages at
Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were
32,580 Sawzall jobs launched, using an average of 220 machines each. While running
those jobs, 18,636 failures occurred (application failure, network outage, system crash,
etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x10 15
bytes of data (2.8PB) and wrote 9.9x1012 bytes (9.3TB).
Other similar languages: Yahoo’s Pig Latin and Pig; Microsoft’s
Dryad
Cloned in open source: Hadoop,
http://hadoop.apache.org/
2017/5/3
50
The MapReduce programming model




Simple distributed functional programming primitives
Modeled after Lisp primitives:
 map (apply function to all items in a collection) and
 reduce (apply function to set of items with a common key)
We start with:
 A user-defined function to be applied to all data,
map: (key,value)  (key, value)
 Another user-specified operation
reduce: (key, {set of values})  result
 A set of n nodes, each with data
All nodes run map on all of their data, producing new data with
keys


2017/5/3
This data is collected by key, then shuffled, and finally reduced
Dataflow is through temp files on GFS
51
Simple example: Word count
map(String key, String value) {
// key: document name, line no
// value: contents of line
for each word w in value:
emit(w, "1")
}

reduce(String key, Iterator values) {
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
emit(key, result)
}
Goal: Given a set of documents, count how
often each word occurs



2017/5/3
Input: Key-value pairs (document:lineNumber, text)
Output: Key-value pairs (word, #occurrences)
What should be the intermediate key-value pairs?
52
Simple example: Word count
Key range the node
is responsible for
(2, is an apple)
(3, not an orange)
Reducer
Mapper
(is,
(is,1})
1)
(is,1){1,
(not,
1)(not,
1)
(not,
{1, 1})
Reducer
(is, 2)
(not, 2)
(orange, 1)(orange,
1) (orange,
1)
(orange,
{1, 1, 1})
{1,(the,
1, 1})
(the, 1)(the, 1)
1)
(unlike,
{1})
(unlike,
1)
Reducer
(orange, 3)
(the, 3)
(unlike, 1)
(3-4)
(4, because the)
(5, orange)
(6, unlike the apple)
(8, not green)
(apple, 1)(apple,
1) (apple,
1)
(apple,
{1, 1, 1})
(an,1){1,
(an,
(an,1})
1)
(because,
{1})
(because,
1)
(green,
{1})
(green,
1)
(1-2)
(1, the apple)
(7, is orange)
Mapper
(apple, 3)
(an, 2)
(because, 1)
(green, 1)
Mapper
(5-6)
2017/5/3
2 The mappers
process the
KV-pairs
one by one
(H-N)
(O-U)
Reducer
Mapper
(V-Z)
(7-8)
1 Each mapper
receives some
of the KV-pairs
as input
(A-G)
3 Each KV-pair output
by the mapper is sent
to the reducer that is
responsible for it
4 The reducers
sort their input
by key
and group it
5 The reducers
process their
input one group
at a time
53
MapReduce dataflow
Mapper
Reducer
Mapper
Reducer
Mapper
Reducer
Mapper
Reducer
"The Shuffle"
2017/5/3
Output data
Input data
Intermediate
(key,value) pairs
What is meant by a 'dataflow'?
What makes this so scalable?
54
Steps of MapReduce
3 steps of MapReduce
 Sequentially read a lot of data




Map: Extract something you care about
Group by key: Sort and shuffle
Reduce: Aggregate, summarize, filter or
transform
Output the result
2017/5/3
55
The Map Step
2017/5/3
56
The Reduce Step
2017/5/3
57
More Details


Input: a set of key-value pairs
Programmer specifies two methods:

Map(k, v) → <k’, v’>*

Takes a key-value pair and outputs a set of key-value
pairs



E.g., key is the filename, value is a single line in the file
There is one Map call for every (k,v) pair
Reduce(k’, <v’>*) → <k’, v’’>*


2017/5/3
All values v’ with same key k’ are reduced together
and processed in v’ order
There is one Reduce function call per unique key k’
58
MapReduce: A Diagram
2017/5/3
59
MapReduce: In Parallel
2017/5/3
60
More details on the MapReduce data flow
Coordinator
Map computation
partitions
Data partitions
by key
(Default MapReduce
uses Filesystem)
Reduce
computation
partitions
Redistribution
by output’s key
("shuffle")
2017/5/3
61
More examples




Distributed grep – all lines matching a pattern
 Map: filter by pattern
 Reduce: output set
Count URL access frequency
 Map: output each URL as key, with count 1
 Reduce: sum the counts
Reverse web-link graph
 Map: output (target,source) pairs when link to target
found in souce
 Reduce: concatenates values and emits (target,list(source))
Inverted index
 Map: Emits (word,documentID)
 Reduce: Combines these into (word,list(documentID))
2017/5/3
62
What do we need to write a MR program?

A mapper



A reducer



Accepts intermediate (key,value) pairs
Produces final (key,value) pairs for the output
A driver



Accepts (key,value) pairs from the input
Produces intermediate (key,value) pairs, which are then
shuffled
Specifies which inputs to use, where to put the outputs
Chooses the mapper and the reducer to use
Hadoop takes care of the rest!!

2017/5/3
Default behaviors can be customized by the driver
63
The Mapper
Input format
(file offset, line)
Intermediate format
can be freely chosen
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
public class FooMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) {
context.write(new Text("foo"), value);
}
}


Extends abstract 'Mapper' class

Input/output types are specified as type parameters

Accepts (key,value) pair of the specified type
Writes output pairs by calling 'write' method on context
Mixing up the types will cause problems at runtime (!)
Implements a 'map' function


2017/5/3
64
The Reducer
Intermediate format
(same as mapper output)
Output format
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
public class FooReducer extends Reducer<Text, Text, IntWritable, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws java.io.IOException, InterruptedException
{
for (Text value: values)
Note: We may get
context.write(new IntWritable(4711), value);
multiple values for
}
the same key!
}


Extends abstract 'Reducer' class
 Must specify types again (must be compatible with mapper!)
Implements a 'reduce' function
 Values are passed in as an 'Iterable'
 Caution: These are NOT normal Java classes. Do not store them
in collections - content can change between iterations!
2017/5/3
65
The Driver
import
import
import
import
import
org.apache.hadoop.mapreduce.*;
org.apache.hadoop.io.*;
org.apache.hadoop.fs.Path;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FooDriver {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(FooDriver.class);
FileInputFormat.addInputPath(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"));
job.setMapperClass(FooMapper.class);
job.setReducerClass(FooReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
Mapper&Reducer are
in the same Jar as
FooDriver
Input and Output
paths
Format of the (key,value)
pairs output by the
reducer
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Specifies how the job is to be executed

2017/5/3
Input and output directories; mapper & reducer
classes
66
Manual compilation

Goal: Produce a JAR file that contains the classes
for mapper, reducer, and driver


This can be submitted to the Job Tracker, or run directly through
Hadoop
Step #1: Put hadoop-core-1.0.3.jar into classpath:
export CLASSPATH=$CLASSPATH:/path/to/hadoop/hadoop-core-1.0.3.jar

Step #2: Compile mapper, reducer, driver:
javac FooMapper.java FooReducer.java FooDriver.java

Step #3: Package into a JAR file:
jar cvf Foo.jar *.class

Alternative: "Export..."/"Java JAR file" in Eclipse
2017/5/3
67
Standalone mode installation

What is standalone mode?





Installation on a single node
No daemons running (no Task Tracker, Job
Tracker)
Hadoop runs as an 'ordinary' Java program
Used for debugging
How to install Hadoop in standalone mode?


2017/5/3
See Textbook Appendix A
Already done in your VM image
68
Running a job in standalone mode

Step #1: Create & populate input directory




Step #2: Run Hadoop




Configured in the Driver via addInputPath()
Put input file(s) into this directory (ok to have more than 1)
Output directory must not exist yet
As simple as this: hadoop jar <jarName>
<driverClassName>
Example: hadoop jar foo.jar upenn.nets212.FooDriver
In verbose mode, Hadoop will print statistics while running
Step #3: Collect output files
2017/5/3
69
Recap: Writing simple jobs for Hadoop

Write a mapper, reducer, driver



Package into a JAR file



Custom serialization  Must use special data
types (Writable)
Explicitly declare all three (key,value) types
Must contain class files for mapper, reducer,
driver
Create manually (javac/jar) or automatically (ant)
Running in standalone mode


2017/5/3
hadoop jar foo.jar FooDriver
Input and output directories in local file system
70
Common mistakes to avoid

Mapper and reducer should be stateless



Don't use static variables - after map +
reduce return, they should remember
nothing about the processed data!
Reason: No guarantees about which
key-value pairs will be processed by
which workers!
Don't try to do your own I/O!


Don't try to read from, or write to,
files in the file system
The MapReduce framework does all
the I/O for you:


2017/5/3
HashMap h = new HashMap();
map(key, value) {
if (h.contains(key)) {
h.add(key,value);
emit(key, "X");
}
}
Wrong!
map(key, value) {
File foo =
new File("xyz.txt");
while (true) {
s = foo.readLine();
...
}
}
Wrong!
All the incoming data will be fed as arguments to map and reduce
Any data your functions produce should be output via emit
71
More common mistakes to avoid
map(key, value) {
emit("FOO", key + " " + value);
}
Wrong!

reduce(key, value[]) {
/* do some computation on
all the values */
}
Mapper must not map too much data to the same key



In particular, don't map everything to the same key!!
Otherwise the reduce worker will be overwhelmed!
It's okay if some reduce workers have more work than others
 Example: In WordCount, the reduce worker that works on the
key 'and' has a lot more work than the reduce worker that
works on 'syzygy'.
2017/5/3
72
Designing MapReduce algorithms

Key decision: What should be done by map, and what
by reduce?




map can do something to each individual key-value pair, but
it can't look at other key-value pairs
 Example: Filtering out key-value pairs we don't need
map can emit more than one intermediate key-value pair for
each incoming key-value pair
 Example: Incoming data is text, map produces (word,1) for
each word
reduce can aggregate data; it can look at multiple values, as
long as map has mapped them to the same (intermediate) key
 Example: Count the number of words, add up the total cost, ...
Need to get the intermediate format right!

If reduce needs to look at several values together, map
must emit them using the same key!
2017/5/3
73
Some additional details

To make this work, we need a few more parts…

The file system (distributed across all nodes):


The driver program (executes on one node):




Stores the inputs, outputs, and temporary results
Specifies where to find the inputs, the outputs
Specifies what mapper and reducer to use
Can customize behavior of the execution
The runtime system (controls nodes):


2017/5/3
Supervises the execution of tasks
Esp. JobTracker
74
Some details



Fewer computation partitions than data partitions
 All data is accessible via a distributed filesystem with
replication
 Worker nodes produce data in key order (makes it easy to
merge)
 The master is responsible for scheduling, keeping all
nodes busy
 The master knows how many data partitions there are,
which have completed – atomic commits to disk
Locality: Master tries to do work on nodes that have replicas
of the data
Master can deal with stragglers (slow machines) by reexecuting their tasks somewhere else
2017/5/3
75
What if a worker crashes?


We rely on the file system being shared across all
the nodes
Two types of (crash) faults:




Node wrote its output and then crashed
 Here, the file system is likely to have a copy of the
complete output
Node crashed before finishing its output
 The JobTracker sees that the job isn’t making progress,
and restarts the job elsewhere on the system
(Of course, we have fewer nodes to do work…)
But what if the master crashes?
2017/5/3
76
Other challenges

Locality


Task granularity


Schedule some backup tasks
Saving bandwidth


How many map tasks? How many reduce tasks?
Dealing with stragglers


Try to schedule map task on machine that already
has data
E.g., with combiners
Handling bad records

2017/5/3
"Last gasp" packet with current sequence number
77
Scale and MapReduce

From a particular Google paper on a
language built over MapReduce:

2017/5/3
… Sawzall has become one of the most widely
used programming languages at Google. …
[O]n one dedicated Workqueue cluster with 1500
Xeon CPUs, there were 32,580 Sawzall jobs
launched, using an average of 220 machines
each.
While running those jobs, 18,636 failures occurred
(application failure, network outage, system crash,
etc.) that triggered rerunning some portion of the
job. The jobs read a total of 3.2x1015 bytes of data
(2.8PB) and wrote 9.9x1012 bytes (9.3TB).
78
MapReduce:
Simplified Data Processing on Large Clusters
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
The slides are from
Jeff Dean, Sanjay Ghemawat
Google, Inc.
2017/5/3
79
Motivation: Large Scale Data Processing


Many tasks: Process lots of data to produce
other data
Want to use hundreds or thousands of CPUs


... but this needs to be easy
MapReduce provides:




2017/5/3
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
80
Programming model


Input & Output: each a set of key/value pairs
Programmer specifies two functions:



map (in_key, in_value)  list(out_key, intermediate_value)
 Processes input key/value pair
 Produces set of intermediate pairs
reduce (out_key, list(intermediate_value))  list(out_value)
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)
Inspired by similar primitives in LISP and other
languages
2017/5/3
81
Example: Count word occurrences


map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
2017/5/3
82
Model is Widely Applicable

MapReduce Programs In Google Source
Tree
Example uses:
distributed grep
distributed sort
web link-graph reversal
term-vector per host
web access log stats
inverted index construction
document clustering
machine learning
statistical machine translation
...
...
...
2017/5/3
83
Implementation Overview

Typical cluster:






100s/1000s of 2-CPU x86 machines, 2-4 GB of
memory
Limited bisection bandwidth
Storage is on local IDE disks
GFS: distributed file system manages data
(SOSP'03)
Job scheduling system: jobs made up of tasks,
scheduler assigns tasks to machines
Implementation is a C++ library linked into
user programs
2017/5/3
84
Execution
2017/5/3
85
Parallel Execution
2017/5/3
86
Task Granularity And Pipelining

Fine granularity tasks: many more map tasks than
machines




Minimizes time for fault recovery
Can pipeline shuffling with map execution
Better dynamic load balancing
Often use 200,000 map/5000 reduce tasks w/ 2000
machines
2017/5/3
87
2017/5/3
88
2017/5/3
89
2017/5/3
90
2017/5/3
91
2017/5/3
92
2017/5/3
93
2017/5/3
94
2017/5/3
95
2017/5/3
96
2017/5/3
97
2017/5/3
98
Fault tolerance: Handled via re-execution

On worker failure:





Master failure:


Detect failure via periodic heartbeats
Re-execute completed and in-progress map tasks
Re-execute in progress reduce tasks
Task completion committed through master
Could handle, but don't yet (master failure unlikely)
Robust: lost 1600 of 1800 machines once, but
finished fine
Semantics in presence of failures: see paper
2017/5/3
99
Refinement: Redundant Execution

Slow workers significantly lengthen
completion time




Solution: Near end of phase, spawn backup
copies of tasks


Other jobs consuming resources on machine
Bad disks with soft errors transfer data very slowly
Weird things: processor caches disabled (!!)
Whichever one finishes first "wins"
Effect: Dramatically shortens job completion
time
2017/5/3
100
Refinement: Locality Optimization

Master scheduling policy:




Asks GFS for locations of replicas of input file
blocks
Map tasks typically split into 64MB (== GFS block
size)
Map tasks scheduled so GFS input block replica
are on same machine or same rack
Effect: Thousands of machines read input at
local disk speed

2017/5/3
Without this, rack switches limit read rate
101
Refinement: Skipping Bad Records

Map/Reduce functions sometimes fail for
particular inputs


Best solution is to debug & fix, but not always
possible
On seg fault:



If master sees two failures for same record:


Send UDP packet to master from signal handler
Include sequence number of record being processed
Next worker is told to skip the record
Effect: Can work around bugs in third-party
libraries
2017/5/3
102
Other Refinements (see paper)





Sorting guarantees within each reduce
partition
Compression of intermediate data
Combiner: useful for saving network
bandwidth
Local execution for debugging/testing
User-defined counters
2017/5/3
103
Performance

Tests run on cluster of 1800 machines:





4 GB of memory
Dual-processor 2 GHz MXeons with Hyperthreading
R
Dual 160 GB IDE disks_
S
Gigabit Ethernet per machine
o
r
Bisection bandwidth approximately
100 Gbps
t
Two benchmarks:
MR_Grep
MR_Sort
2017/5/3
Scan 1010 100-byte records to extract records matching a
rare pattern (92K matching records)
Sort 1010 100-byte records (modeled after TeraSort
benchmark)
104
MR_Grep

Locality optimization
helps:



1800 machines read 1
TB of data at peak of
~31 GB/s
Without this, rack
switches would limit to
10 GB/s
Startup overhead is
significant for short jobs
2017/5/3
105
MR_Sort


Backup tasks reduce job completion time significantly
System deals well with failures
Normal
2017/5/3
No backup tasks
200 processes killed
106
Experience: Rewrite of Production Indexing
System





Rewrote Google's production indexing
system using MapReduce
Set of 10, 14, 17, 21, 24 MapReduce
operations
New code is simpler, easier to understand
MapReduce takes care of failures, slow
machines
Easy to make indexing faster by adding more
machines
2017/5/3
107
Usage: MapReduce jobs run in August 2004













2017/5/3
Number of jobs
Average job completion time
Machine days used
Input data read
Intermediate data produced
Output data written
Average worker machines per job
Average worker deaths per job
Average map tasks per job
Average reduce tasks per job
Unique map implementations
Unique reduce implementations
Unique map/reduce combinations
29,423
634 secs
79,186 days
3,288 TB
758 TB
193 TB
157
1.2
3,351
55
395
269
426
108
Related Work






Programming model inspired by functional language primitives
Partitioning/shuffling similar to many large-scale sorting systems
 NOW-Sort ['97]
Re-execution for fault tolerance
 BAD-FS ['04] and TACC ['97]
Locality optimization has parallels with Active Disks/Diamond
work
 Active Disks ['01], Diamond ['04]
Backup tasks similar to Eager Scheduling in Charlotte system
 Charlotte ['96]
Dynamic load balancing solves similar problem as River's
distributed queues
 River ['99]
2017/5/3
109
Conclusions



MapReduce has proven to be a useful
abstraction
Greatly simplifies large-scale computations at
Google
Fun to use: focus on problem, let library deal
w/ messy details
2017/5/3
110