Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
University of Pennsylvania CIS 505 Software Systems Linh Thi Xuan Phan Department of Computer and Information Science University of Pennsylvania Lecture 21: Spark April 17, 2017 Plan for today • Iterative processing NEXT – Example: PageRank – Challenges with MapReduce • Spark – Resilient Distributed Datasets – Implementation – Broadcast variables and accumulators 2 Why PageRank? Team sports • Suppose someone googles for "Team sports" • Which pages should we return? – Idea: Let's return pages that contain both words frequently (why?) – Problem: Millions of pages contain these words! – Which ones should we display ('rank') first? 3 Why PageRank? “A-Team” page "Hollywood Series to Recycle” page Mr. T’s page "Team Sports" Page Cheesy TV shows page Yahoo directory Wikipedia • Idea: Hyperlinks encode a lot of human judgment! – What does it mean when a web page links another page? – Intra-domain links: Often created primarily for navigation – Inter-domain links: Confer some measure of authority • Can we 'boost' pages with many inbound links? – Theoretically yes – but should all pages 'count' the same? 4 PageRank: Intuition Shouldn't E's vote be worth more than F's? A G H How many levels should we consider? E I J F B C D • Imagine a contest for The Web's Best Page – Initially, each page has one vote – Each page votes for all the pages it has a link to – To ensure fairness, pages voting for more than one page must split their vote equally between them – Voting proceeds in rounds; in each round, each page has the number of votes it received in the previous round – In practice, it's a little more complicated - but not much! 5 Page, Brin, Motwani, Winograd, "The PageRank Citation Ranking: Bringing Order to the Web", Stanford University technical report, 1998 PageRank • Each page i is given a rank xi • Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages linking to it: Rank of page i Every page j that links to i Rank of page j Number of links out from page j 6 Iterative PageRank • How do we compute the rank values? – Problem: The formula from the previous slide only describes a goal, but not how to get there! • Idea: Use an iterative algorithm! – At first, set all the ranks to an initial guess (e.g., 1/n, where n is the number of pages) – Then, in several rounds, compute a new value based on the values in the previous round – Repeat until convergence (0) i x 1 n • Is it really that simple? – Almost... for various practical reasons, the algorithm ususally includes a damping factor 7 PageRank: Simple example 0.575 0.33 ... 0.33 0.17 1.19 0.33 0.17 0.33 0.33 Step #1 0.292 0.433 Step #2a Step #2b 1.16 0.644 Convergence! • Let's look at a simple example – Three pages with four links between them • Remember the algorithm: – – – – Initialize all the ranks to be equal Propagate the weights along the edges Add up incoming weights, apply damping factor Repeat until convergence 8 Plan for today • Iterative processing – Example: PageRank – Challenges with MapReduce NEXT • Spark – Resilient Distributed Datasets – Implementation – Broadcast variables and accumulators 9 Challenges with MapReduce map map red red Iterative computation Init state, convert input to input + state map Test for convergence red map red Discard state, output results • PageRank requires multiple stages of MapReduce – Why is this a problem? • Challenge #1: Performance – Simple control flow: read → map → reduce → write (no loop support) – Data is written to disk even if it is reused immediately! – What does this mean for performance? • Challenge #2: Usability – MapReduce operates at the level of tuples; lots of boilerplate code – What would the code for a PageRank job look like? MapReduce usability • #include "mapreduce/mapreduce.h" • // User’s map function • class SplitWords: public Mapper { • public: • virtual void Map(const MapInput& input) • { • const string& text = input.value(); • const int n = text.size(); • for (int i = 0; i < n; ) { • // Skip past leading whitespace • while (i < n && isspace(text[i])) • i++; • // Find word end • int start = i; • while (i < n && !isspace(text[i])) • i++; • if (start < i) • Emit(text.substr( start,i-start),"1"); • } • } • }; • { • • • • • • • • • • • } • }; // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt( input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value)); • REGISTER_REDUCER(Sum); • REGISTER_MAPPER(SplitWords); • class Sum: public Reducer { • public: • virtual void Reduce(ReduceInput* input) • • • • } • • • • • • // Specify the output files MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text"); out->set_reducer_class("Sum"); • • // Do partial sums within map out->set_combiner_class("Sum"); • • • • • • • • • • // Tuning parameters spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); in->set_format("text"); in->set_filepattern(argv[i]); in->set_mapper_class("SplitWords"); // Now run it MapReduceResult result; if (!MapReduce(spec, &result)) abort(); return 0; } • int main(int argc, char** argv) { • ParseCommandLineFlags(argc, argv); • MapReduceSpecification spec; • for (int i = 1; i < argc; i++) { • MapReduceInput* in= spec.add_input(); • Lots of code for a (relatively) simple algorithm! – Can we do better? Challenges with MapReduce • Challenge #3: Interactivity – Suppose you wanted to do some interactive 'data mining' on a data set: Ask a query, look at the result, ask another query, ... – Even if you had a much simpler query language to formulate these queries – would you want to use MapReduce for this? – Probably not! Latency is much too high for interactive use? • Need to load the entire data set from GFS, etc. • Let's look at another, more recent system that addresses these three challenges! 12 Plan for today • Iterative processing – Example: PageRank – Challenges with MapReduce • Spark NEXT – Resilient Distributed Datasets – Implementation – Broadcast variables and accumulators 13 What is ? • Another 'big data' computing framework – Much more recent than MapReduce! (2004 vs. 2010) – Incorporates some of the 'lessons learned' from MapReduce • Initially developed at UC Berkeley – Zaharia, Chowdhury, Franklin, Shenker, and Stoica, "Spark: Cluster Computing with Working Sets", HotCloud 2010 – NSDI paper in 2012 • Now a top-level Apache project – http://spark.apache.org/ 14 Spark vs. MapReduce • How does Spark address the three challenges we discussed earlier? • Performance: In-memory computing – Not limited by disk I/O (particularly for iterative computations!) – 10x-100x faster than MapReduce • Usability: Embedded DSL in Scala – Type-safe; automatically optimized – 5-10x less code than MapReduce • Interactivity: Shell support – Queries can be submitted interactively from a Scala shell 15 Resilient distributed datasets • Key innovation: Resilient distributed dataset (RDD) – Think of a huge block of RAM that is partitioned across machines • RDDs are basically read-only collections – Initial RDDs can be created by loading data from stable storage – Computations on existing RDDs result in new RDDs – User can write RDDs back to disk if and when they want to • How does this help? – RDDs enable reuse! – Example: Intermediate values in PageRank • What would be the RDDs? • Which ones would (have to) be written to disk? 16 Transformations map(f: T→U) RDD[T] → RDD[U] filter(f: T→Bool) RDD[T] → RDD[T] flatMap(f: T→Seq[U]) RDD[T] → RDD[U] sample(fraction: Float) RDD[T] → RDD[T] (deterministic sampling) groupByKey() RDD[(K,V)] → RDD[(K, Seq[V])] reduceByKey(f: (V,V) → V) RDD[(K,V)] → RDD[(K,V)] union() (RDD[T], RDD[T]) → RDD[T] join() (RDD[(K,V)],RDD[(K,W)]) → RDD[(K, (V,W))] cogroup() (RDD[(K,V)],RDD[(K,W)]) → RDD[(K, (Seq[V],Seq[W]))] crossProduct() (RDD[T], RDD[U]) → RDD[(T,U)] mapValues(f: V → W) RDD[(K,V)] → RDD[(K,W)] (preserves partitioning) sort(c: Comparator[K]) RDD[(K,V)] → RDD([K,V]) partitionBy(p: Partitioner[K]) RDD[(K,V)] → RDD([K,V]) • Transformations can perform computation on RDDs • Coarse-grained transformations! – Actual 'join', 'filter', ... operators, not just plain Java code – This makes it much easier to optimize queries! 17 Actions count() RDD[T] → Long collect() RDD[T] → Seq[T] reduce(f: (T,T) → T) RDD[T] → T lookup(k: K) RDD[(K,V)] → Seq[V] (on hash/range partitioned RDDs) save(path: String) Outputs RDD to a storage system, e.g., HDFS • Transformations do not cause Spark to do work – Example: errors = lines.filter(_.startsWith("ERROR")) – Spark would simply remember how the new RDD ('errors') can be computed from the old one ('lines') • Work is done only once the user applies an action – Example: errors.count() – Why is this a good idea? • Consider: errors.filter(_.contains("HDFS")).map(_.split("\t")(3)).collect() 18 Lineage • Spark keeps track of how RDDs have been constructed – Result is a lineage graph – Vertexes represent RDDs, edges represent transformations • What could this be useful for? – Efficiency: Not all RDDs have to be 'materialized' (i.e., kept in RAM as a full copy); enables query optimization – Fault tolerance: When a machine fails, the corresponding piece of the RDD can be recomputed efficiently • How would a multi-stage MapReduce program achieve this? – Caching: Since RDDs can be recomputed if need be, Spark can 'evict' them from memory and only keep a 'working set' of RDDs • However, the user can give Spark 'hints' as to which RDDs should be kept 19 Example: WordCount val file = spark.textFile(“hdfs://...”) val counts = file.flatMap(line => line.split(“ ”)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.save(“out.txt”) • Let's look at some example code: WordCount – Data is loaded from HDFS into an initial RDD called 'file' – Three transformations are applied (blue) • This results in three more RDDs, of which only the last one is named ('counts') • Notice how this particular program invokes 'map' and 'reduce' to produce a similar data flow as MapReduce itself! (But this isn't "hard-coded") • Notice how each transformation accepts a function argument (orange) – A final action writes the result back to a file (green) • Much more compact than MapReduce code! 20 Example: PageRank val links = spark.textFile(...).map(...).persist() var ranks = ... // RDD of (URL, initialRank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey((x,y) => x+y) .mapValues(sum => a/N + (1-a)*sum) } Load the graph as an RDD of (URL,outlinks) pairs Build an RDD of (targetURL, float) pairs with the contributions sent by each page Sum contributions by URL and get new ranks • A few things to note: – Code is also very nice and compact! – Notice how 'links' is reused across iterations! • The 'persist' link tells Spark not to evict it from the 'cache' – What would the provenance look like? 21 Hadoop 150 14 50 Spark 80 100 23 Iteration time (s) 200 171 PageRank performance 0 30 60 Number of machines • How does this affect performance? – Experiment from the paper: PageRank with a 54GB Wikipedia dump (about 4 million articles) – Spark is several times faster than Hadoop (the open-source implementation of MapReduce) 22 Plan for today • Iterative processing – Example: PageRank – Challenges with MapReduce • Spark – Resilient Distributed Datasets – Implementation NEXT – Broadcast variables and accumulators 23 Spark: Implementation • Single-master architecture, similar to MapReduce – Developer writes 'driver' program that connects to cluster of workers – Driver defines RDDs, invokes actions, tracks lineage – Workers are long-lived processes that store pieces of RDDs in memory and perform computations on them – Many of the details will sound familiar: Scheduling, fault detection and recovery, handling stragglers, etc. 24 Spark execution process RDD Objects DAG Scheduler Task Scheduler Worker Cluster manager DAG rdd1.join(rdd2) .groupBy(…) .filter(…) build operator DAG Threads Task TaskSet Block manager split graph into stages of tasks launch tasks via cluster manager execute tasks submit each stage as ready retry failed or straggling tasks store and serve data blocks 26 Job scheduler in Spark B: A: G: Stage 1 C: groupBy D: F: map E: Stage 2 join union Stage 3 = cached partition • • • • Captures RDD dependency graph Pipelines functions into “stages” Cache-aware for data reuse & locality Partitioning-aware to avoid shuffles 27 Interactivity Line 1: var query = “hello” One class per line Line 2: rdd.filter(_.contains(query)) .count() Lines typed by user Closure includes a direct pointer to the earlier line object Resulting object graph • Spark also offers an interactive (Scala) shell – Similar to Ruby, Python, ... • How does this work under the hood? – Scala interpreter compiles a new class for each new line of code as it is typed on the master – These classes are shipped to the workers over HTTP and loaded into the JVM 27 Example: Log Mining • How do RDDs help with interactivity? • Let's consider another example: Log mining – Suppose the user wants to load error messages from a log into Cache 1 memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) BaseTransformed RDD RDD results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) tasks Driver Worker Block 1 messages.cache() Action messages.filter(_.contains(“foo”)).count Cache 2 messages.filter(_.contains(“bar”)).count . . . Result: Result:full-text search 1search TB data of in Wikipedia 5-7 sec in <1(vs sec 170 (vssec 20 for secon-disk for on-disk data)data) Worker Cache 3 Worker Block 3 Block 2 Plan for today • Iterative processing – Example: PageRank – Challenges with MapReduce • Spark – Resilient Distributed Datasets – Implementation – Broadcast variables and accumulators NEXT 29 // Load RDD of (URL, visit) pairs val visits = sc.textFile("visits.txt").map(...) A-H I-O visits.txt // Load RDD of (URL, name) pairs val pageNames = sc.textFile("pages.txt").map(...) pages.txt Why broadcast variables? P-T val joined = visits.join(pageNames) U-Z Map tasks Reduce tasks • Suppose we wanted to run the above computation – Given a list of web page names and a list of URLs visited, find a list of page names that have been visited • How would the join transformation be processed? – Spark would shuffle both pageNames and visits over the network! – But what if the list of pageNames is relatively small? 30 Alternative if one table is small val pageNames = sc.textFile("pages.txt").map(...) val pageMap = pageNames.collect().toMap() val visits = sc.textFile("visits.txt").map(...) val joined = visits.map(v=>(v._1,(pageMap(v._1),v._2))) pages.txt Master visits.txt joined • Idea: We can send the page names with each task! – That way, the join can run locally on each block of visits.txt! • Problem: Many more map partitions than nodes! – Sending the table along with each task is wasteful! 31 Broadcast variables val val val val val pages.txt pageNames = sc.textFile("pages.txt").map(...) pageMap = pageNames.collect().toMap() bc = sc.broadcast(pageMap) visits = sc.textFile("visits.txt").map(...) joined = visits.map(v=>(v._1,(bc.value(v._1),v._2))) Master visits.txt joined • Idea: The Master can send a copy to each node – This requires much fewer transmissions! – Tables can be shared across partitions on each node • Uses broadcast variables – Explicit in the code; tells Spark how to distribute the data 32 Accumulators val badRecords = sc.accumulator(0) val badBytes = sc.accumulator(0.0) records.filter(r => { if (isBad(r)) { badRecords += 1 badBytes += r.size false } else { true } }).save(...) printf(“Total bad records: %d, avg size: %f\n”, badRecords.value, badBytes.value / badRecords.value) • Another common sharing pattern is aggregation – Suppose we wanted to count the 'bad' records in a data set – We could use a reduce operation, but that seems a little heavyweight for this simple task. • Idea: Use accumulators – Support an associative 'add' operation and a 'zero' value – Workers can add, but only the master can read – Because of associativity, Spark can locally aggregate on each node 33 What is Spark (not) good for? • Good for: – Batch applications that apply the same operation to all elements to a data set • Example: Many machine-learning algorithms • RDDs can efficiently 'remember' each transformation as one step in the lineage – Computations with loops • Example: PageRank (and other iterative algorithms) • Can reuse intermediate results; no need to materialize to disk – Global aggregations • Example: Count the number of lines in a set of logs that start with some string • Not so good for: – Applications that make asynchronous, fine-grained updates to state • Examples: Web application, incremental web crawler 34