Download Spark: Resilient Distributed Datasets for In

Document related concepts
no text concepts found
Transcript
Spark:
Resilient Distributed Datasets for
In-Memory Cluster Computing
Brad Karp
UCL Computer Science
(with slides contributed by Matei Zaharia)
CS M038 / GZ06
9th March 2016
Motivation
MapReduce greatly simplified “big data”
analysis on large, unreliable clusters
But as soon as it got popular, users wanted
more:
» More complex, multi-stage applications
(e.g. iterative machine learning & graph
processing)
» More interactive ad-hoc queries
Motivation
MapReduce greatly simplified “big data”
analysis on large, unreliable clusters
But as soon as it got popular, users wanted
more:
» More complex, multi-stage applications
(e.g. iterative machine learning & graph
processing)
Response:
specialized
frameworks
» More interactive
ad-hoc
queries for some of
these apps (e.g. Pregel for graph processing)
Motivation
Complex apps and interactive queries both
need one thing that MapReduce lacks:
Efficient primitives for data sharing
Motivation
Complex apps and interactive queries both
need one thing that MapReduce lacks:
Efficient primitives for data sharing
In MapReduce, the only way to share data
across jobs is stable storage è slow!
Examples
HDFS
read
HDFS
write
HDFS
read
iter. 1
HDFS
write
. .
.
iter. 2
Input
HDFS
read
Input
query 1
result 1
query 2
result 2
query 3
result 3
. . .
Examples
HDFS
read
HDFS
write
HDFS
read
iter. 1
HDFS
write
. .
.
iter. 2
Input
HDFS
read
Input
query 1
result 1
query 2
result 2
query 3
result 3
. . .
Slow because of replication and disk I/O,
but necessary for fault tolerance
Goal: In-Memory Data Sharing
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
RAM 10-100× faster than network/disk,
but how to get fault tolerance?
. . .
Challenge
How to design a distributed memory
abstraction that is both fault-tolerant and
efficient?
Challenge
Existing storage abstractions have interfaces
based on fine-grained updates to mutable
state
» RAMCloud, databases, distributed mem, Piccolo
Requires replicating data or logs across
nodes for fault tolerance
» Costly for data-intensive apps
» 10-100x slower than memory write
Solution: Resilient Distributed
Datasets (RDDs)
Restricted form of distributed shared
memory
» Immutable, partitioned collections of records
» Can only be built through coarse-grained
deterministic transformations (map, filter, join, …)
Efficient fault recovery using lineage
» Log one operation to apply to many elements
» Recompute lost partitions on failure
» No cost if nothing fails
RDD Recovery
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
. .
.
RDD Recovery
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
. .
.
RDD Recovery
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
. .
.
RDD Recovery
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
. .
.
RDD Recovery
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
. .
.
RDD Recovery
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
. .
.
RDD Recovery
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
. .
.
RDD Recovery
iter. 1
iter. 2
Input
query 1
one-time
processing
Input
query 2
query 3
. . .
. .
.
Generality of RDDs
Despite their restrictions, RDDs can express
surprisingly many parallel algorithms
» These naturally apply the same operation to many
items
Unify many current programming models
» Data flow models: MapReduce, Dryad, SQL, …
» Specialized models for iterative apps: BSP (Pregel),
iterative MapReduce (Haloop), bulk incremental, …
Support new apps that these models don’t
Tradeoff Space
Fine
Granularity
of Updates
K-V stores,
databases,
RAMCloud
HDFS
RDDs
Coarse
Low
Write Throughput
High
Tradeoff Space
Network
bandwidth
Fine
Granularity
of Updates
K-V stores,
databases,
RAMCloud
HDFS
RDDs
Coarse
Low
Write Throughput
High
Tradeoff Space
Network
bandwidth
Fine
Granularity
of Updates
Memory
bandwidth
K-V stores,
databases,
RAMCloud
HDFS
RDDs
Coarse
Low
Write Throughput
High
Tradeoff Space
Network
bandwidth
Fine
Granularity
of Updates
Memory
bandwidth
Best for
transactional
workloads
K-V stores,
databases,
RAMCloud
HDFS
RDDs
Coarse
Low
Write Throughput
High
Tradeoff Space
Network
bandwidth
Fine
Granularity
of Updates
Memory
bandwidth
Best for
transactional
workloads
K-V stores,
databases,
RAMCloud
Best for batch
workloads
HDFS
RDDs
Coarse
Low
Write Throughput
High
Outline
Spark programming interface
Implementation
Demo
How people are using Spark
Outline
Spark programming interface
Implementation
Demo
How people are using Spark
Spark Programming
Interface
DryadLINQ-like API in the Scala language
Usable interactively from Scala interpreter
Provides:
» Resilient distributed datasets (RDDs)
» Operations on RDDs: transformations (build new
RDDs), actions (compute and output results)
» Control of each RDD’s partitioning (layout across
nodes) and persistence (storage in RAM, on disk,
etc)
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Worker
Master
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
Master
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Base RDD
Worker
Master
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
Master
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Transformed
RDD
Worker
errors = lines.filter(_.startsWith(“ERROR”))
Master
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
Master
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
messages.filter(_.contains(“foo”)).count
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Action
messages.filter(_.contains(“foo”)).count
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
messages.filter(_.contains(“foo”)).count
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Block 1
messages.filter(_.contains(“foo”)).count
Worker
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
tasks Block 1
messages.filter(_.contains(“foo”)).count
Worker
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Worker
tasks Block 1
messages.filter(_.contains(“foo”)).count
Worker
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Msgs. 1
lines = spark.textFile(“hdfs://...”)
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Worker
tasks Block 1
messages.filter(_.contains(“foo”)).count
Msgs. 2
Worker
Msgs. 3
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Msgs. 1
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Block 1
messages.filter(_.contains(“foo”)).count
Msgs. 2
Worker
Msgs. 3
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Msgs. 1
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Block 1
messages.filter(_.contains(“foo”)).count
Msgs. 2
messages.filter(_.contains(“bar”)).count
Worker
Msgs. 3
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Msgs. 1
lines = spark.textFile(“hdfs://...”)
Worker
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
tasks Block 1
messages.filter(_.contains(“foo”)).count
Msgs. 2
messages.filter(_.contains(“bar”)).count
Worker
Msgs. 3
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Msgs. 1
lines = spark.textFile(“hdfs://...”)
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Worker
tasks Block 1
messages.filter(_.contains(“foo”)).count
Msgs. 2
messages.filter(_.contains(“bar”)).count
Worker
Msgs. 3
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Msgs. 1
lines = spark.textFile(“hdfs://...”)
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Worker
tasks Block 1
messages.filter(_.contains(“foo”)).count
Msgs. 2
messages.filter(_.contains(“bar”)).count
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Worker
Msgs. 3
Worker
Block 3
Block 2
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Msgs. 1
lines = spark.textFile(“hdfs://...”)
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Master
Worker
tasks Block 1
messages.filter(_.contains(“foo”)).count
Msgs. 2
messages.filter(_.contains(“bar”)).count
Result:
full-text
search
Result:
scaled
to 1 TB
data of
in 5-7
Wikipedia in <1
secsec (vs 20 sec
forsec
on-disk
data) data)
(vs 170
for on-disk
Worker
Msgs. 3
Worker
Block 3
Block 2
Fault Recovery
RDDs track the graph of transformations that
built them (their lineage) to rebuild lost data
E.g.: messages
= textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
HadoopRDD
path = hdfs://…
FilteredRDD
func =
_.contains(...)
MappedRDD
func = _.split(…)
Fault Recovery
RDDs track the graph of transformations that
built them (their lineage) to rebuild lost data
E.g.: messages
= textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
HadoopRDD
HadoopRDD
path = hdfs://…
FilteredRDD
FilteredRDD
func =
_.contains(...)
MappedRDD
MappedRDD
func = _.split(…)
Iteration time (s)
Fault Recovery Results
140
120
100
80
60
40
20
0
Failure happens
119
81
1
57
56
58
2
3
4
58
5
6
Iteration
57
59
57
59
7
8
9
10
Fault Tolerance vs.
Performance
With RDDs, programmer controls tradeoff
between fault tolerance and performance
• if frequently persist (e.g., with REPLICATE),
fast recovery, but slower execution
• if infrequently persist, fast execution but slow
recovery
Why Wide vs. Narrow
Dependencies?
RDD includes metadata:
• lineage (graph of RDDs and operations)
• wide dependency: requires shuffle (and
possibly entire preceding RDD(s) in lineage)
• narrow dependency: no shuffle (and only
needs isolated partition(s) of preceding RDD(s)
in lineage)
Example: PageRank
1. Start each page with a rank of 1
2. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|
links = // RDD of (url, neighbors) pairs
ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
ranks = links.join(ranks).flatMap {
(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}.reduceByKey(_ + _)
}
Optimizing Placement
Links
(url, neighbors)
Ranks0
(url, rank)
join
Contribs0
reduce
Ranks1
join
Contribs2
reduce
Ranks2
. . .
&
joined
links
ranks
repeatedly
Can co-partition them (e.g.
hash both on URL) to avoid
shuffles
Can also use app knowledge,
e.g., hash on DNS name
links = links.partitionBy(
new URLPartitioner())
Time per iteration (s)
PageRank Performance
200
171
Hadoop
150
100
50
0
Basic Spark
72
23
Spark + Controlled
Partitioning
Memory Exhaustion
Least Recently Used eviction of partitions
First evict non-persisted, used partitions;
they vanish, as not stored on disk
Then evict persisted ones
Behavior with Insufficient RAM
60
11.5
40
29.7
40.7
58.1
80
68.8
Iteration time (s)
100
20
0
0%
25%
50%
75%
Percent of working set in memory
100%
Scalability
87
106
121
150
157
200
143
250
Hadoop
HadoopBinMem
Spark
197
100
33
61
Iteration time (s)
300
3
50
6
15
50
80
100
76
62
116
184
Iteration time (s)
150
111
Hadoop
HadoopBinMem
Spark
250
200
K-Means
274
Logistic Regression
0
0
25
50
100
Number of machines
25
50
Number of machines
100
Contributors to Speedup
Text Input
13.1
Binary Input
2.9
6.9
10
8.4
15
5
2.9
15.4
Iteration time (s)
20
0
In-mem HDFS In-mem local file
Spark RDD
Implementation
Runs on Mesos [NSDI 11]
to share cluster w/
Hadoop
Can read from any
Hadoop input source
(HDFS, S3, …)
Spark
Hadoop
MPI
Mesos
Node
Node
Node
Node
No changes to Scala language or compiler
» Reflection + bytecode analysis to correctly ship code
www.spark-project.org
…
Programming Models
Implemented on Spark
RDDs can express many existing parallel
models
» MapReduce, DryadLINQ
» Pregel graph processing [200 LOC]
» Iterative MapReduce [200 LOC]
» SQL: Hive on Spark (Shark)
All are based
on
coarse-grained
operations
Enables apps to efficiently intermix these
models
RDDs: Summary
RDDs offer a simple and efficient programming
model for a broad range of applications
•
•
•
Avoid disk I/O overhead for intermediate results of
multi-phase computations
Leverage lineage: cheaper than checkpointing
when no failures, low cost when failures
Leverage the coarse-grained nature of many
parallel algorithms for low-overhead recovery
Scheduler significantly more complicated than
MapReduce’s; we did not discuss!