Download Spark - UPenn School of Engineering and Applied Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
University of Pennsylvania
CIS 505
Software Systems
Linh Thi Xuan Phan
Department of Computer and Information Science
University of Pennsylvania
Lecture 21: Spark
April 17, 2017
Plan for today
• Iterative processing
NEXT
– Example: PageRank
– Challenges with MapReduce
• Spark
– Resilient Distributed Datasets
– Implementation
– Broadcast variables and accumulators
2
Why PageRank?
Team sports
• Suppose someone googles for "Team sports"
• Which pages should we return?
– Idea: Let's return pages that contain both words frequently (why?)
– Problem: Millions of pages contain these words!
– Which ones should we display ('rank') first?
3
Why PageRank?
“A-Team” page
"Hollywood Series
to Recycle” page
Mr. T’s
page
"Team Sports" Page
Cheesy TV
shows page
Yahoo
directory
Wikipedia
• Idea: Hyperlinks encode a lot of human judgment!
– What does it mean when a web page links another page?
– Intra-domain links: Often created primarily for navigation
– Inter-domain links: Confer some measure of authority
• Can we 'boost' pages with many inbound links?
– Theoretically yes – but should all pages 'count' the same?
4
PageRank: Intuition
Shouldn't E's vote be
worth more than F's?
A
G
H
How many levels
should we consider?
E
I
J
F
B
C
D
• Imagine a contest for The Web's Best Page
– Initially, each page has one vote
– Each page votes for all the pages it has a link to
– To ensure fairness, pages voting for more than one page must split
their vote equally between them
– Voting proceeds in rounds; in each round, each page has the
number of votes it received in the previous round
– In practice, it's a little more complicated - but not much!
5
Page, Brin, Motwani, Winograd, "The PageRank Citation Ranking: Bringing Order to the Web", Stanford University technical report, 1998
PageRank
• Each page i is given a rank xi
• Goal: Assign the xi such that the rank of each page
is governed by the ranks of the pages linking to it:
Rank of page i
Every page
j that links to i
Rank of page j
Number of
links out
from page j
6
Iterative PageRank
• How do we compute the rank values?
– Problem: The formula from the previous slide only
describes a goal, but not how to get there!
• Idea: Use an iterative algorithm!
– At first, set all the ranks to an initial guess
(e.g., 1/n, where n is the number of pages)
– Then, in several rounds, compute a new
value based on the values in the previous round
– Repeat until convergence
(0)
i
x
1

n
• Is it really that simple?
– Almost... for various practical reasons,
the algorithm ususally includes a damping factor
7
PageRank: Simple example
0.575
0.33
...
0.33
0.17
1.19
0.33
0.17
0.33
0.33
Step #1
0.292
0.433
Step #2a
Step #2b
1.16
0.644
Convergence!
• Let's look at a simple example
– Three pages with four links between them
• Remember the algorithm:
–
–
–
–
Initialize all the ranks to be equal
Propagate the weights along the edges
Add up incoming weights, apply damping factor
Repeat until convergence
8
Plan for today
• Iterative processing
– Example: PageRank
– Challenges with MapReduce
NEXT
• Spark
– Resilient Distributed Datasets
– Implementation
– Broadcast variables and accumulators
9
Challenges with MapReduce
map
map
red
red
Iterative
computation
Init state, convert
input to input + state
map
Test for
convergence
red
map
red
Discard state,
output results
• PageRank requires multiple stages of MapReduce
– Why is this a problem?
• Challenge #1: Performance
– Simple control flow: read → map → reduce → write (no loop support)
– Data is written to disk even if it is reused immediately!
– What does this mean for performance?
• Challenge #2: Usability
– MapReduce operates at the level of tuples; lots of boilerplate code
– What would the code for a PageRank job look like?
MapReduce usability
• #include "mapreduce/mapreduce.h"
• // User’s map function
• class SplitWords: public Mapper {
•
public:
•
virtual void Map(const MapInput& input)
•
{
•
const string& text = input.value();
•
const int n = text.size();
•
for (int i = 0; i < n; ) {
•
// Skip past leading whitespace
•
while (i < n && isspace(text[i]))
•
i++;
•
// Find word end
•
int start = i;
•
while (i < n && !isspace(text[i]))
•
i++;
•
if (start < i)
•
Emit(text.substr(
start,i-start),"1");
•
}
•
}
• };
•
{
•
•
•
•
•
•
•
•
•
•
•
}
• };
// Iterate over all entries with the
// same key and add the values
int64 value = 0;
while (!input->done()) {
value += StringToInt(
input->value());
input->NextValue();
}
// Emit sum for input->key()
Emit(IntToString(value));
• REGISTER_REDUCER(Sum);
• REGISTER_MAPPER(SplitWords);
• class Sum: public Reducer {
•
public:
•
virtual void Reduce(ReduceInput* input)
•
•
•
•
}
•
•
•
•
•
•
// Specify the output files
MapReduceOutput* out = spec.output();
out->set_filebase("/gfs/test/freq");
out->set_num_tasks(100);
out->set_format("text");
out->set_reducer_class("Sum");
•
•
// Do partial sums within map
out->set_combiner_class("Sum");
•
•
•
•
•
•
•
•
•
•
// Tuning parameters
spec.set_machines(2000);
spec.set_map_megabytes(100);
spec.set_reduce_megabytes(100);
in->set_format("text");
in->set_filepattern(argv[i]);
in->set_mapper_class("SplitWords");
// Now run it
MapReduceResult result;
if (!MapReduce(spec, &result)) abort();
return 0;
}
• int main(int argc, char** argv) {
•
ParseCommandLineFlags(argc, argv);
•
MapReduceSpecification spec;
•
for (int i = 1; i < argc; i++) {
•
MapReduceInput* in= spec.add_input();
• Lots of code for a (relatively) simple algorithm!
– Can we do better?
Challenges with MapReduce
• Challenge #3: Interactivity
– Suppose you wanted to do some interactive 'data mining' on a data
set: Ask a query, look at the result, ask another query, ...
– Even if you had a much simpler query language to formulate these
queries – would you want to use MapReduce for this?
– Probably not! Latency is much too high for interactive use?
• Need to load the entire data set from GFS, etc.
• Let's look at another, more recent system that
addresses these three challenges!
12
Plan for today
• Iterative processing
– Example: PageRank
– Challenges with MapReduce
• Spark
NEXT
– Resilient Distributed Datasets
– Implementation
– Broadcast variables and accumulators
13
What is
?
• Another 'big data' computing framework
– Much more recent than MapReduce! (2004 vs. 2010)
– Incorporates some of the 'lessons learned' from MapReduce
• Initially developed at UC Berkeley
– Zaharia, Chowdhury, Franklin, Shenker, and Stoica, "Spark: Cluster
Computing with Working Sets", HotCloud 2010
– NSDI paper in 2012
• Now a top-level Apache project
– http://spark.apache.org/
14
Spark vs. MapReduce
• How does Spark address the three challenges we
discussed earlier?
• Performance: In-memory computing
– Not limited by disk I/O (particularly for iterative computations!)
– 10x-100x faster than MapReduce
• Usability: Embedded DSL in Scala
– Type-safe; automatically optimized
– 5-10x less code than MapReduce
• Interactivity: Shell support
– Queries can be submitted interactively from a Scala shell
15
Resilient distributed datasets
• Key innovation: Resilient distributed dataset (RDD)
– Think of a huge block of RAM that is partitioned across machines
• RDDs are basically read-only collections
– Initial RDDs can be created by loading data from stable storage
– Computations on existing RDDs result in new RDDs
– User can write RDDs back to disk if and when they want to
• How does this help?
– RDDs enable reuse!
– Example: Intermediate values in PageRank
• What would be the RDDs?
• Which ones would (have to) be written to disk?
16
Transformations
map(f: T→U)
RDD[T] → RDD[U]
filter(f: T→Bool)
RDD[T] → RDD[T]
flatMap(f: T→Seq[U])
RDD[T] → RDD[U]
sample(fraction: Float)
RDD[T] → RDD[T] (deterministic sampling)
groupByKey()
RDD[(K,V)] → RDD[(K, Seq[V])]
reduceByKey(f: (V,V) → V)
RDD[(K,V)] → RDD[(K,V)]
union()
(RDD[T], RDD[T]) → RDD[T]
join()
(RDD[(K,V)],RDD[(K,W)]) → RDD[(K, (V,W))]
cogroup()
(RDD[(K,V)],RDD[(K,W)]) → RDD[(K, (Seq[V],Seq[W]))]
crossProduct()
(RDD[T], RDD[U]) → RDD[(T,U)]
mapValues(f: V → W)
RDD[(K,V)] → RDD[(K,W)] (preserves partitioning)
sort(c: Comparator[K])
RDD[(K,V)] → RDD([K,V])
partitionBy(p: Partitioner[K])
RDD[(K,V)] → RDD([K,V])
• Transformations can perform computation on RDDs
• Coarse-grained transformations!
– Actual 'join', 'filter', ... operators, not just plain Java code
– This makes it much easier to optimize queries!
17
Actions
count()
RDD[T] → Long
collect()
RDD[T] → Seq[T]
reduce(f: (T,T) → T)
RDD[T] → T
lookup(k: K)
RDD[(K,V)] → Seq[V] (on hash/range partitioned RDDs)
save(path: String)
Outputs RDD to a storage system, e.g., HDFS
• Transformations do not cause Spark to do work
– Example: errors = lines.filter(_.startsWith("ERROR"))
– Spark would simply remember how the new RDD ('errors') can be
computed from the old one ('lines')
• Work is done only once the user applies an action
– Example: errors.count()
– Why is this a good idea?
• Consider: errors.filter(_.contains("HDFS")).map(_.split("\t")(3)).collect()
18
Lineage
• Spark keeps track of how
RDDs have been constructed
– Result is a lineage graph
– Vertexes represent RDDs,
edges represent transformations
• What could this be useful for?
– Efficiency: Not all RDDs have to be 'materialized' (i.e., kept in RAM
as a full copy); enables query optimization
– Fault tolerance: When a machine fails, the corresponding piece of
the RDD can be recomputed efficiently
• How would a multi-stage MapReduce program achieve this?
– Caching: Since RDDs can be recomputed if need be, Spark can
'evict' them from memory and only keep a 'working set' of RDDs
• However, the user can give Spark 'hints' as to which RDDs should be kept
19
Example: WordCount
val file = spark.textFile(“hdfs://...”)
val counts = file.flatMap(line => line.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.save(“out.txt”)
• Let's look at some example code: WordCount
– Data is loaded from HDFS into an initial RDD called 'file'
– Three transformations are applied (blue)
• This results in three more RDDs, of which only the last one is named ('counts')
• Notice how this particular program invokes 'map' and 'reduce' to produce a
similar data flow as MapReduce itself! (But this isn't "hard-coded")
• Notice how each transformation accepts a function argument (orange)
– A final action writes the result back to a file (green)
• Much more compact than MapReduce code!
20
Example: PageRank
val links = spark.textFile(...).map(...).persist()
var ranks = ... // RDD of (URL, initialRank) pairs
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).flatMap {
(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)
}
Load the graph as
an RDD of
(URL,outlinks) pairs
Build an RDD of
(targetURL, float) pairs
with the contributions
sent by each page
Sum contributions
by URL and
get new ranks
• A few things to note:
– Code is also very nice and compact!
– Notice how 'links' is reused across iterations!
• The 'persist' link tells Spark not to evict it from the 'cache'
– What would the provenance look like?
21
Hadoop
150
14
50
Spark
80
100
23
Iteration time (s)
200
171
PageRank performance
0
30
60
Number of machines
• How does this affect performance?
– Experiment from the paper: PageRank with a 54GB Wikipedia dump
(about 4 million articles)
– Spark is several times faster than Hadoop (the open-source
implementation of MapReduce)
22
Plan for today
• Iterative processing
– Example: PageRank
– Challenges with MapReduce
• Spark
– Resilient Distributed Datasets
– Implementation NEXT
– Broadcast variables and accumulators
23
Spark: Implementation
• Single-master architecture, similar to MapReduce
– Developer writes 'driver' program that connects to cluster of workers
– Driver defines RDDs, invokes actions, tracks lineage
– Workers are long-lived processes that store pieces of RDDs in
memory and perform computations on them
– Many of the details will sound familiar: Scheduling, fault detection
and recovery, handling stragglers, etc.
24
Spark execution process
RDD Objects
DAG Scheduler
Task Scheduler
Worker
Cluster
manager
DAG
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
build operator DAG
Threads
Task
TaskSet
Block
manager
split graph into
stages of tasks
launch tasks via
cluster manager
execute tasks
submit each
stage as ready
retry failed or
straggling tasks
store and serve
data blocks
26
Job scheduler in Spark
B:
A:
G:
Stage 1
C:
groupBy
D:
F:
map
E:
Stage 2
join
union
Stage 3
= cached partition
•
•
•
•
Captures RDD dependency graph
Pipelines functions into “stages”
Cache-aware for data reuse & locality
Partitioning-aware to avoid shuffles
27
Interactivity
Line 1:
var query = “hello”
One class per line
Line 2:
rdd.filter(_.contains(query))
.count()
Lines typed by user
Closure includes a
direct pointer to the
earlier line object
Resulting object graph
• Spark also offers an interactive (Scala) shell
– Similar to Ruby, Python, ...
• How does this work under the hood?
– Scala interpreter compiles a new class for each new line of code as
it is typed on the master
– These classes are shipped to the workers over HTTP and loaded
into the JVM
27
Example: Log Mining
• How do RDDs help with interactivity?
• Let's consider another example: Log mining
– Suppose the user wants to load error messages from a log into
Cache 1
memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
BaseTransformed
RDD
RDD
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
tasks
Driver
Worker
Block 1
messages.cache()
Action
messages.filter(_.contains(“foo”)).count
Cache 2
messages.filter(_.contains(“bar”)).count
. . .
Result:
Result:full-text
search 1search
TB data
of in
Wikipedia
5-7 sec
in <1(vs
sec
170
(vssec
20 for
secon-disk
for on-disk
data)data)
Worker
Cache 3
Worker
Block 3
Block 2
Plan for today
• Iterative processing
– Example: PageRank
– Challenges with MapReduce
• Spark
– Resilient Distributed Datasets
– Implementation
– Broadcast variables and accumulators
NEXT
29
// Load RDD of (URL, visit) pairs
val visits =
sc.textFile("visits.txt").map(...)
A-H
I-O
visits.txt
// Load RDD of (URL, name) pairs
val pageNames =
sc.textFile("pages.txt").map(...)
pages.txt
Why broadcast variables?
P-T
val joined = visits.join(pageNames)
U-Z
Map tasks
Reduce tasks
• Suppose we wanted to run the above computation
– Given a list of web page names and a list of URLs visited, find a list
of page names that have been visited
• How would the join transformation be processed?
– Spark would shuffle both pageNames and visits over the network!
– But what if the list of pageNames is relatively small?
30
Alternative if one table is small
val pageNames = sc.textFile("pages.txt").map(...)
val pageMap = pageNames.collect().toMap()
val visits = sc.textFile("visits.txt").map(...)
val joined = visits.map(v=>(v._1,(pageMap(v._1),v._2)))
pages.txt
Master
visits.txt
joined
• Idea: We can send the page names with each task!
– That way, the join can run locally on each block of visits.txt!
• Problem: Many more map partitions than nodes!
– Sending the table along with each task is wasteful!
31
Broadcast variables
val
val
val
val
val
pages.txt
pageNames = sc.textFile("pages.txt").map(...)
pageMap = pageNames.collect().toMap()
bc = sc.broadcast(pageMap)
visits = sc.textFile("visits.txt").map(...)
joined = visits.map(v=>(v._1,(bc.value(v._1),v._2)))
Master
visits.txt
joined
• Idea: The Master can send a copy to each node
– This requires much fewer transmissions!
– Tables can be shared across partitions on each node
• Uses broadcast variables
– Explicit in the code; tells Spark how to distribute the data
32
Accumulators
val badRecords = sc.accumulator(0)
val badBytes = sc.accumulator(0.0)
records.filter(r => {
if (isBad(r)) {
badRecords += 1
badBytes += r.size
false
} else {
true
}
}).save(...)
printf(“Total bad records: %d, avg size: %f\n”, badRecords.value,
badBytes.value / badRecords.value)
• Another common sharing pattern is aggregation
– Suppose we wanted to count the 'bad' records in a data set
– We could use a reduce operation, but that seems a little
heavyweight for this simple task.
• Idea: Use accumulators
– Support an associative 'add' operation and a 'zero' value
– Workers can add, but only the master can read
– Because of associativity, Spark can locally aggregate on each node
33
What is Spark (not) good for?
• Good for:
– Batch applications that apply the same operation to all elements to
a data set
• Example: Many machine-learning algorithms
• RDDs can efficiently 'remember' each transformation as one step in the lineage
– Computations with loops
• Example: PageRank (and other iterative algorithms)
• Can reuse intermediate results; no need to materialize to disk
– Global aggregations
• Example: Count the number of lines in a set of logs that start with some string
• Not so good for:
– Applications that make asynchronous, fine-grained updates to state
• Examples: Web application, incremental web crawler
34