Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MapReduce in Action 数据挖掘研究组 Data Mining Group @ Xiamen University College of Information Science and Technology Team 306 Led by Chen Lin Contents LOGO 1. Basic MapReduce Programs 2. Advanced MapReduce 3. Beyond the horizon 4. discussion YOUR SITE HERE Basic MapReduce Programs LOGO Job Configuration Master Jobtracker Master Jobtracker Job YOUR SITE HERE Basic MapReduce Programs LOGO Job Configuration? Java Class Implement Interface Environment Configuration YOUR SITE HERE LOGO Mapper Reducer Partitioner Interface Combiner InputFormat OutputFormat YOUR SITE HERE LOGO How many Map/Reduce Tasks? InputPath OutputPath Configure jvm: Mapred.child.java.opts {mapred.local.dir} YOUR SITE HERE Basic MapReduce Program LOGO <K1,V2> Inputsplit InputFormat Map List<K1,V1> Reduce OutputFormat K1,List<V1> Text YOUR SITE HERE Basic MapReduce LOGO YOUR SITE HERE PARTITIONERS AND COMBINERS LOGO Combiners an optimization in MapReduce that allow for local aggregation before the shue and sort phase Partitioner determines which reducer will be responsible for processing a particular key, and the execution framework uses this information to copy the data to the right location during the shue and sort phase YOUR SITE HERE Basic MapReduce Program LOGO InputFormat CREATING CUSTOM INPUTFORMAT KeyValue Text Text Input Format Sequence File NLine YOUR SITE HERE InputFormat LOGO • TextInputFormat - Each line in the text fi les is a record. Key is the byte offset of the line, and value is the content of the line. • KeyValueTextInputFormat - Each line in the text fi les is a record. The fi rst separator character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input.line property, and the default is the tab (\t) character. • NLineInputFormat - Same as TextInputFormat, but each split is guaranteed to have exactly N lines. The mapred.line.input.format. Lines/map property, which defaults to one, sets N. YOUR SITE HERE Basic MapReduce Program LOGO types for the key/value pairs 4 YOUR SITE HERE Summary for basic Program LOGO What’s a complete MapReduce job ?? code for mapper, reducer, combiner, partitioner, along with job conguration parameters The execution framework handles everything else YOUR SITE HERE Advanced MapReduce LOGO Chaining MapReduce jobs LOCAL AGGREGATION SECONDARY SORTING Work on Hadoop Files YOUR SITE HERE Chaining MapReduce jobs LOGO You’ve been doing data processing tasks which a single MapReduce job can accomplish. But…… As you get more comfortable writing MapReduce programs and take on more ambitious data processing tasks you’ll find many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce job YOUR SITE HERE LOCAL AGGREGATION LOGO in Hadoop, intermediate results are written to local disk before being sent over the network. Reductions in the amount of intermediate data translate should increase in algorithmic efficiency use of the combiner is possible to substantially reduce both the number and size of key-value pairs that need to be shuffled from the mappers to the reducers YOUR SITE HERE seudo-code for computing the mean of values associated with the same string. LOGO YOUR SITE HERE LOCAL AGGREGATION , Is it right ?? LOGO YOUR SITE HERE LOCAL AGGREGATION LOGO 1. combiners must have the same input and output key-value type 2. Combiners are optimizations that cannot change the correctness of the algorithm Hadoop makes no guarantees on how many times combiners are called; it could be zero, one, or multiple times YOUR SITE HERE LOCAL AGGREGATION , right usage ! LOGO YOUR SITE HERE SECONDARY SORTING LOGO we also need to sort by value sometimes (k1;m1; v8) (k1;m2; v1) (k1;m3; v7) ::: (k2;m1; v2) (k2;m2; v6) (k2;m3; v9) k1 (m1; k8) (k1; m1) (k8) YOUR SITE HERE Beyond the horizon LOGO It’s a shame The rest I will talk about Plays an important role in MapReduce, but, they are beyond my horizon. So, need all your help, to master them together…. YOUR SITE HERE Beyond the horizon LOGO YOUR SITE HERE Beyond the horizon LOGO YOUR SITE HERE Joining data from different sources Joining data from different sources LOGO Joey Leung,555-555-55 Edward,123-456-7890 Jose Madriz,281-330-8004 David Stork,408-555-0000 ….... A,12.95,02-Jun-2008 B,88.25,20-may-2008 C,32.00,30-Nov-2007 D,25.02,22-Jan-2009 Joey Leung,555-555-5555,B,88.25,20-May-2008 Edward,123-456-7890,C,32.00,30-Nov-2007 Jose Madriz,281-330-8004,A,12.95,02-Jun-2008 Jose Madriz,281-330-8004,D,25.02,22-Jan-2009 YOUR SITE HERE 数据挖掘研究组 Data Mining Group @ Xiamen University LOGO Thank you! YOUR SITE HERE