Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed by Matei Zaharia) CS M038 / GZ06 9th March 2016 Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters But as soon as it got popular, users wanted more: » More complex, multi-stage applications (e.g. iterative machine learning & graph processing) » More interactive ad-hoc queries Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters But as soon as it got popular, users wanted more: » More complex, multi-stage applications (e.g. iterative machine learning & graph processing) Response: specialized frameworks » More interactive ad-hoc queries for some of these apps (e.g. Pregel for graph processing) Motivation Complex apps and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing Motivation Complex apps and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing In MapReduce, the only way to share data across jobs is stable storage è slow! Examples HDFS read HDFS write HDFS read iter. 1 HDFS write . . . iter. 2 Input HDFS read Input query 1 result 1 query 2 result 2 query 3 result 3 . . . Examples HDFS read HDFS write HDFS read iter. 1 HDFS write . . . iter. 2 Input HDFS read Input query 1 result 1 query 2 result 2 query 3 result 3 . . . Slow because of replication and disk I/O, but necessary for fault tolerance Goal: In-Memory Data Sharing iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . RAM 10-100× faster than network/disk, but how to get fault tolerance? . . . Challenge How to design a distributed memory abstraction that is both fault-tolerant and efficient? Challenge Existing storage abstractions have interfaces based on fine-grained updates to mutable state » RAMCloud, databases, distributed mem, Piccolo Requires replicating data or logs across nodes for fault tolerance » Costly for data-intensive apps » 10-100x slower than memory write Solution: Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory » Immutable, partitioned collections of records » Can only be built through coarse-grained deterministic transformations (map, filter, join, …) Efficient fault recovery using lineage » Log one operation to apply to many elements » Recompute lost partitions on failure » No cost if nothing fails RDD Recovery iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . . . . RDD Recovery iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . . . . RDD Recovery iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . . . . RDD Recovery iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . . . . RDD Recovery iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . . . . RDD Recovery iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . . . . RDD Recovery iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . . . . RDD Recovery iter. 1 iter. 2 Input query 1 one-time processing Input query 2 query 3 . . . . . . Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms » These naturally apply the same operation to many items Unify many current programming models » Data flow models: MapReduce, Dryad, SQL, … » Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, … Support new apps that these models don’t Tradeoff Space Fine Granularity of Updates K-V stores, databases, RAMCloud HDFS RDDs Coarse Low Write Throughput High Tradeoff Space Network bandwidth Fine Granularity of Updates K-V stores, databases, RAMCloud HDFS RDDs Coarse Low Write Throughput High Tradeoff Space Network bandwidth Fine Granularity of Updates Memory bandwidth K-V stores, databases, RAMCloud HDFS RDDs Coarse Low Write Throughput High Tradeoff Space Network bandwidth Fine Granularity of Updates Memory bandwidth Best for transactional workloads K-V stores, databases, RAMCloud HDFS RDDs Coarse Low Write Throughput High Tradeoff Space Network bandwidth Fine Granularity of Updates Memory bandwidth Best for transactional workloads K-V stores, databases, RAMCloud Best for batch workloads HDFS RDDs Coarse Low Write Throughput High Outline Spark programming interface Implementation Demo How people are using Spark Outline Spark programming interface Implementation Demo How people are using Spark Spark Programming Interface DryadLINQ-like API in the Scala language Usable interactively from Scala interpreter Provides: » Resilient distributed datasets (RDDs) » Operations on RDDs: transformations (build new RDDs), actions (compute and output results) » Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc) Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Master Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker Master Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Base RDD Worker Master Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) Master Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Transformed RDD Worker errors = lines.filter(_.startsWith(“ERROR”)) Master Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Master Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master messages.filter(_.contains(“foo”)).count Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Action messages.filter(_.contains(“foo”)).count Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master messages.filter(_.contains(“foo”)).count Worker Worker Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Block 1 messages.filter(_.contains(“foo”)).count Worker Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master tasks Block 1 messages.filter(_.contains(“foo”)).count Worker Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Worker tasks Block 1 messages.filter(_.contains(“foo”)).count Worker Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 lines = spark.textFile(“hdfs://...”) results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Worker tasks Block 1 messages.filter(_.contains(“foo”)).count Msgs. 2 Worker Msgs. 3 Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Block 1 messages.filter(_.contains(“foo”)).count Msgs. 2 Worker Msgs. 3 Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Block 1 messages.filter(_.contains(“foo”)).count Msgs. 2 messages.filter(_.contains(“bar”)).count Worker Msgs. 3 Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master tasks Block 1 messages.filter(_.contains(“foo”)).count Msgs. 2 messages.filter(_.contains(“bar”)).count Worker Msgs. 3 Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 lines = spark.textFile(“hdfs://...”) results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Worker tasks Block 1 messages.filter(_.contains(“foo”)).count Msgs. 2 messages.filter(_.contains(“bar”)).count Worker Msgs. 3 Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 lines = spark.textFile(“hdfs://...”) results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Worker tasks Block 1 messages.filter(_.contains(“foo”)).count Msgs. 2 messages.filter(_.contains(“bar”)).count Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Worker Msgs. 3 Worker Block 3 Block 2 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 lines = spark.textFile(“hdfs://...”) results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Master Worker tasks Block 1 messages.filter(_.contains(“foo”)).count Msgs. 2 messages.filter(_.contains(“bar”)).count Result: full-text search Result: scaled to 1 TB data of in 5-7 Wikipedia in <1 secsec (vs 20 sec forsec on-disk data) data) (vs 170 for on-disk Worker Msgs. 3 Worker Block 3 Block 2 Fault Recovery RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) Fault Recovery RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD HadoopRDD path = hdfs://… FilteredRDD FilteredRDD func = _.contains(...) MappedRDD MappedRDD func = _.split(…) Iteration time (s) Fault Recovery Results 140 120 100 80 60 40 20 0 Failure happens 119 81 1 57 56 58 2 3 4 58 5 6 Iteration 57 59 57 59 7 8 9 10 Fault Tolerance vs. Performance With RDDs, programmer controls tradeoff between fault tolerance and performance • if frequently persist (e.g., with REPLICATE), fast recovery, but slower execution • if infrequently persist, fast execution but slow recovery Why Wide vs. Narrow Dependencies? RDD includes metadata: • lineage (graph of RDDs and operations) • wide dependency: requires shuffle (and possibly entire preceding RDD(s) in lineage) • narrow dependency: no shuffle (and only needs isolated partition(s) of preceding RDD(s) in lineage) Example: PageRank 1. Start each page with a rank of 1 2. On each iteration, update each page’s rank to Σi∈neighbors ranki / |neighborsi| links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) } Optimizing Placement Links (url, neighbors) Ranks0 (url, rank) join Contribs0 reduce Ranks1 join Contribs2 reduce Ranks2 . . . & joined links ranks repeatedly Can co-partition them (e.g. hash both on URL) to avoid shuffles Can also use app knowledge, e.g., hash on DNS name links = links.partitionBy( new URLPartitioner()) Time per iteration (s) PageRank Performance 200 171 Hadoop 150 100 50 0 Basic Spark 72 23 Spark + Controlled Partitioning Memory Exhaustion Least Recently Used eviction of partitions First evict non-persisted, used partitions; they vanish, as not stored on disk Then evict persisted ones Behavior with Insufficient RAM 60 11.5 40 29.7 40.7 58.1 80 68.8 Iteration time (s) 100 20 0 0% 25% 50% 75% Percent of working set in memory 100% Scalability 87 106 121 150 157 200 143 250 Hadoop HadoopBinMem Spark 197 100 33 61 Iteration time (s) 300 3 50 6 15 50 80 100 76 62 116 184 Iteration time (s) 150 111 Hadoop HadoopBinMem Spark 250 200 K-Means 274 Logistic Regression 0 0 25 50 100 Number of machines 25 50 Number of machines 100 Contributors to Speedup Text Input 13.1 Binary Input 2.9 6.9 10 8.4 15 5 2.9 15.4 Iteration time (s) 20 0 In-mem HDFS In-mem local file Spark RDD Implementation Runs on Mesos [NSDI 11] to share cluster w/ Hadoop Can read from any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node Node Node Node No changes to Scala language or compiler » Reflection + bytecode analysis to correctly ship code www.spark-project.org … Programming Models Implemented on Spark RDDs can express many existing parallel models » MapReduce, DryadLINQ » Pregel graph processing [200 LOC] » Iterative MapReduce [200 LOC] » SQL: Hive on Spark (Shark) All are based on coarse-grained operations Enables apps to efficiently intermix these models RDDs: Summary RDDs offer a simple and efficient programming model for a broad range of applications • • • Avoid disk I/O overhead for intermediate results of multi-phase computations Leverage lineage: cheaper than checkpointing when no failures, low cost when failures Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery Scheduler significantly more complicated than MapReduce’s; we did not discuss!