Download framework and API

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Cloud Computing
Mapreduce (2)
Keke Chen
Outline
 Hadoop streaming example
 Hadoop java API
 Framework
 important APIs
 Mini-project
A nice book
 Hadoop: The definitive Guide
 You can read it online from campus network
- ohiolink  ebook center safari online
Using hadoop mapreduce
 Programming with java library
 http://hadoop.apache.org/common/docs/cur
rent/mapred_tutorial.html
 Help you understand controls at detailed
level
 More convenient: hadoop-streaming
 Allow you to plug in map/reduce functions
 Map/reduce functions can be
scripts/exe/unix commands
Hadoop streaming
 Simple and powerful interface for
programming
 Application developers do not need to learn
hadoop java APIs
 Good for simple, adhoc tasks
Note:
 Map/Reduce uses the local linux file
system for processing and hosting
temporary data
 HDFS is used to host application data
HDFS
Node
Local file
system
Hadoop streamining
 http://hadoop.apache.org/common/docs
/current/streaming.html
 /usr/local/hadoop/bin/hadoop jar \
/usr/local/hadoop/hadoop-streaming-1.0.3.jar \
-input myInputDirs -output myOutputDir \
-mapper myMapper -reducer myReducer
 Reducer can be empty: -reducer None
 myMapper and myReducer can be any executable
 Mapper/reducer will take stdin and output to stdout
 Files in myInputDirs are fed into mapper as stdin
 Mapper’s output will be the input of reducer
Packaging files with job submission
 /usr/local/hadoop/bin/hadoop jar \
/usr/local/hadoop/hadoop-streaming-1.0.3.jar \
-input “/user/hadoop/inputdata” \
-output “/user/hadoop/outputdata” \
-mapper “python myPythonScript.py
myDictionary.txt” \
-reducer “/bin/wc” \
Input parameter
for the script
-file myPythonScript.py \
-file myDictionary.txt
 -file is good for small files
Using hadoop library classes
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.reduce.tasks=12 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Large files and archives
 Upload large files to HDFS first
 Use –files option in streaming, which will
download files to local working directory
 -files hdfs://host:fs_port/user/testfile.txt#testlink
 -archives
hdfs://host:fs_port/user/testfile.jar#testlink
 Cache1.txt, cache2.txt are in testfile.jar
 Then, locally testlink/cache1.txt, textlink/cache2.txt
Some key files
 Let HADOOP_HOME be your Hadoop
installation directory
 Hadoop streaming jar is at
$HADOOP_HOME/share/hadoop/tools/lib
 Hadoop mapreduce examples are at
$HADOOP_HOME/share/hadoop/mapreduce
Wordcount
 Problem: counting frequencies of words
for a large document collection.
 Implement mapper and reducer
respectively, using python
 Some good python tutorials at
http://wiki.python.org/
Mapper.py
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print ‘%s\t1’ % (word)
Reducer.py
import sys
word2count={}
for line in sys.stdin:
line = line.strip()
word, count = line.split(‘\t’, 1)
try:
count = int(count)
word2count[word] = word2count.get(word, 0)+ count
except ValueError:
pass
for word in word2count:
print ‘%s\t%s’% (word, word2count[word])
Running wordcount
hadoop jar hadoop-streaming.jar \
-mapper "python mapper.py" \
-reducer "python reducer.py" \
-input text -output output2 \
-file /localpath/mapper.py -file
/localpath/reducer.py
Running wordcount
hadoop jar hadoop-streaming.jar \
-mapper "python mapper.py" \
-reducer "python reducer.py" \
-input text -output output2 \
-file mapper.py -file reducer.py \
-jobconf mapred.reduce.tasks=2 \
-jobconf mapred.map.tasks=4
 If mapper/reducer takes files as
parameters
hadoop jar hadoop-streaming.jar \
-mapper "python mapper.py" \
-reducer "python reducer.py myfile" \
-input text -output output2 \
-file /localpath/mapper.py -file
/localpath/reducer.py
-file /localpath/myfile
Web interfaces
 Tracking mapreduce jobs
http://localhost:50030/
 http://localhost:50060/ - web UI for
task tracker(s)
 http://localhost:50070/ - web UI for
HDFS name node(s)
 lynx http://localhost:50030/
Hadoop Java APIs
 hadoop.apache.org/common/docs/curre
nt/api/
 benefits
 Jave code is more efficient than streaming
 More parameters for control and tuning
 Better for iterative MR programs
Important base classes
 Mapper<keyIn, valueIn, keyOut,
valueOut>
 Function map(Object, Writable, Context)
 Reducer<keyIn, valueIn, keyOut,
valueOut>
 Function reduce(WritableComparable,
Iterator, Context)
 Combiner
 Partitioner
The framework
public class Wordcount{
public static class MapClass extends
Mapper<Object, Text, Text, LongWritable> {
public void setup(Mapper.Context context){…}
public void map(Object key, Text value, Context context)
throws IOException {…}
}
public static class ReduceClass
Reducer<Text, LongWritable, Text, LongWritable> {
public void setup(Reducer.Context context){…}
public void reduce(Text key, Iterator<LongWritable>
values, Context context) throws IOException{…}
}
}
public static void main(String[] args) throws Exception{}
The wordcount example in java
 https://hadoop.apache.org/docs/r2.7.2/
hadoop-mapreduce-client/hadoopmapreduce-clientcore/MapReduceTutorial.html
 Old/New framework
 Old framework for version prior to 0.20
Compile the source code
 Copy the WordCount.java from the
source code directory to your directory,
say “mr_examples”
 Go to “mr_examples” and compile it
javac –cp `yarn classpath` -d wc
WordCount.java
jar –cvf wordcount.jar –C wc .
Mapper of wordcount
public static class WCMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
}
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
WordCount Reducer
public static class WCReducer
extends Reducer<Text,IntWritable,Text,IntWritable>
{
private IntWritable result = new IntWritable();
}
public void reduce(Text key, Iterable<IntWritable>
values, Context context
) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
Function parameters
 Define map/reduce parameters
according to your application
 Have to use writable classes in
org.apache.hadoop.io
 E.g. Text, LongWritable, IntWritable etc.
 Template parameters and the function
parameters should be matched
 Map’s output and reduce’s input parameters
should be matched.
Configuring map/reduce
 Passing global parameter settings to
each map/reduce process
 In main function, set parameters in a
Configuration object
Configuration conf = new Configuration();
Job job = new Job(conf, "cloudvista");
conf.set(“configfile”, “your_config_file_at_HDFS”);
job.setJarByClass(Wordcount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(WCMapper.class);
//job.setCombinerClass(WCReducer.class);
job.setReducerClass(WCReducer.class);
//job.setPartitionerClass(WCPartitioner.class);
job.setNumReduceTasks(num_reduce);
FileInputFormat.setInputPaths (job, input);
FileOutputFormat.setOutputPath (job, new Path(output_path ));
System.exit(job.waitForCompletion(true)?0:1);
How to run your app
1. Compile to jar file
2. Command line
hadoop jar your_jar your_parameters
 Normally you need to pass in




Number of reducers
Input files
Output directory
Any other application specific parameters
Access Files in HDFS?
Example: In map function
Public void setup(Mapper.Context context){
Configuration conf = context.getConfiguration();
string filename = conf.get(“configfile"); // “configfile” parameter set
in main()
Path p = new Path(filename); // Path is used for opening the file.
FileSystem fs = FileSystem.get(conf);//determines local or HDFS
FSInputStream file = fs.open(p);
while (file.available() > 0){
…
}
file.close();
}
Combiner
 Apply reduce function to the intermediate
results locally after the map generates the
result
key1
Map1
Key n
combine
Key1, value1
Key2, value2
…
Keyn, valueN
Map’s local
reduces
Partitioner
 If map’s output will generate N keys
(N>R, R:# of reduces)
 By default, N keys are randomly distributed
to R reduces
 You can use partitioner to define how the
keys are distributed to the reduces.
More basic examples
 Download the source from the Hadoop
website
 Unzip and find the example source code
at the directory:
hadoop-mapreduce-project/hadoopmapreduce-examples