Download Slides

Document related concepts

Non-negative matrix factorization wikipedia , lookup

Matrix multiplication wikipedia , lookup

Transcript
Upper and Lower Bounds on the cost of a
Map-Reduce Computation
Based on an article by Foto N. Afrati, Anish Das Sarma, Semih
Salihoglu, Jeffrey D. Ullman
Images taken from slides by same authors.
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
MapReduce - Overview


A programming paradigm for processing large amounts of data with
a parallel and distributed algorithm.
Consists of a map() function
◦ An operation applied to all elements of a sequence, desirably in parallel.
◦ Runs on Map processors of a MapReduce implementation.

And a reduce() function
◦ An operation that performs a summary (counting, averaging) on all the elements.
◦ Runs on Reduce processors of a MapReduce implementation.

Implementations are available such as Apache Hadoop.

Let’s look at the canonical example…
Input is partitioned into lines/files and sent to the Mappers
Mappers perform the map() function and output key-value pairs in the
form <word, 1>.
In the shuffling phase data is sorted and sent to the Reducers based on
the key.
The Reducers perform the reduce() function (sum) over all the keyvalue pairs and output a new key-value pair.
New key-value pairs are aggregated into the final result.
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
The Drug Interaction Problem




3000 sets of drug data (patients taking, dates, diagnoses).
About 1M of data per drug.
We are interested whether there are 2 drugs that when taken
together increase the risk of heart attack.
Naturally we cross reference every pair of drugs across whole set
of drugs.
The Drug Interaction Problem
The Drug Interaction Problem

We have a problem: for 3000 drugs, each set of drug data is
replicated 2999 times.
If each set of data is 1M large we have ~9000GB of communication.
Communication cost is too high!

We purpose a different approach: grouping drugs…


The Drug Interaction Problem
The Drug Interaction Problem



We group 3000 drugs to 30 groups: 𝐺1 from 1 to 100, 𝐺2 from 101
to 200, …. , 𝐺30 from 2901 to 3000.
Each set of drug data is only replicated 29 times, which means
87GB communication cost vs 9000GB.
But lower parallelism, higher processing cost!
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
Problem Model & Assumptions






Assume that the Map phase is parallel.
We will discuss the tradeoff in the Reduce phase.
We will focus on single-round MR applications.
We want to maximize parallelism, which means more Reduce
processors while each processor gets smaller amounts of data.
We also want to minimize communication costs, which means less
traffic between Map and Reduce processors.
But the more parallelism we got, the bigger the communication
overhead is because we need to transfer data to more nodes.
Problem Model & Assumptions


Let’s agree that input / output terminology is relative to the reduce
phase.
For the purposes of the model let us define a problem which
consists of:
◦ Sets of inputs and outputs.
◦ A mapping from outputs to sets of inputs. Each output depends on only the set
of inputs it is mapped to.

We need to limit ourselves to finite sets of inputs and outputs.
Problem Model & Assumptions

We define reducer as a reduce key together with its list of
associated values.
Reducer size is the upper bound on how long the list of values can
be – the maximum number of inputs that can be sent to any one
reducer.

Let us denote reducer size q.

Problem Model & Assumptions




We define replication rate as the average number of key value pairs
(reducers) to which each input is mapped by the mappers.
Let us denote replication rate r.
Replication rate is a function of reducer size: 𝑟 = 𝑓(𝑞).
Replication rate represents the expected communication if we
multiply it by the number of inputs actually present.
Problem Model & Assumptions

A mapping schema for a given problem is an assignment of a set of
inputs to each reducer, subject to the constraints that:
◦ No reducer is assigned more than q inputs
◦ For every output, there is at least one reducer that is assigned to all of the
inputs for that output. Such reducer covers the output. This reducer need not be
unique.


Inputs and outputs are hypothetical, in the sense that they are all
the possible inputs and outputs that might be present in an instance
of the problem.
The mapping assigns inputs to reducers without reference to what
inputs are actually present.
Problem Model & Assumptions
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
The Recipe for Lower Bounds





What we want now is to compute a lower bound for the
replication rate, so we use a generic technique.
Upper bounds will be derived independently on each problem.
Let us fix 𝑞, which is the maximum number of inputs.
Derive 𝑔(𝑞) which is an upper bound on the number of outputs a
reducer can cover, given 𝑞.
Count the total number of inputs |𝐼| and outputs |𝑂|.
The Recipe for Lower Bounds

Now assume there are 𝑝 reducers, each receiving 𝑞𝑖 inputs.
Together they cover all outputs. Then:
◦
◦
◦
𝑝
𝑖=1 𝑔 𝑞𝑖
𝑝 𝑞𝑖 𝑔(𝑞𝑖 )
𝑖=1 𝑞𝑖
𝑝 𝑞𝑖 𝑔(𝑞)
𝑖=1 𝑞
≥ |𝑂|
≥ |𝑂|
≥
𝑝 𝑞𝑖 𝑔 𝑞𝑖
𝑖=1 𝑞𝑖
𝑎𝑛𝑑
◦ 𝑟=

1
|𝐼|
𝑝
𝑖=1 𝑞𝑖
≥
≥ |𝑂|
∀𝑞𝑖 : 𝑞𝑖 ≤ 𝑞
𝑔 𝑞𝑖
𝑚𝑜𝑛𝑜𝑡𝑜𝑛𝑖𝑐𝑎𝑙𝑙𝑦 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔
𝑞𝑖
𝑞|𝑂|
𝑔 𝑞 |𝐼|
Thus we get a lower bound for r.
The Drug Interaction Problem
The Drug Interaction Problem
The Drug Interaction Problem



Using the inequality we get:
𝐼 = 6, q = 4, g q =
2= 𝑟≥
𝑞𝑂
𝑔 𝑞 𝐼
=
4 ∙ 15
6∙6
4
2
= 6, O =
=
5
3
6
2
= 15
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
Word Count




Think of the inputs as the word occurrences themselves in the
files.
Then each word occurrence results in exactly one key-value pair.
Replication rate is 1, independent on the limit on reducer size.
There is no tradeoff at all between q and r – Word Count problem
is embarrassingly parallel !
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
Hamming Distance 1 Problem


A reminder: Hamming distance measures distance between two
binary strings.
Our problem is to find pairs of bit strings of length 𝑏 that are at
hamming distance 1.
Hamming Distance 1 Problem
𝑞∙𝑙𝑜𝑔𝑞
2

Lemma: 𝑔 𝑞 ≤

Proof by induction on 𝑏:
◦ Basis: 𝑏 = 1. q is either 1 or 2.
◦ If 𝑞 = 1 the reducer can cover no outputs and
1∙𝑙𝑜𝑔1
2
= 0.
◦ If 𝑞 = 2 the reducer can cover at most one output and
2∙𝑙𝑜𝑔2
2
◦ Now assume for b and consider string length 𝑏 + 1.
◦ Let X be a set of q bit strings of length 𝑏 + 1.
◦ Let Y be the subset of X of strings in the form 0𝑤. |𝑌| = 𝑦.
◦ Let Z be the subset of X of strings in the form 1𝑤. |𝑍| = 𝑧.
◦ 𝑞 = 𝑦 + 𝑧.
=1
Hamming Distance 1 Problem

Proof by induction on 𝑏 (cont.):
◦ For any string in Y, there is at most one string in Z at Hamming distance 1.
◦ Thus, the number of outputs with one string in Y and the other in Z is at most
min(𝑦, 𝑧).
𝑦∙𝑙𝑜𝑔𝑦
outputs both of whose inputs are in Y.
2
𝑧∙𝑙𝑜𝑔𝑧
induction, at most
outputs both of whose inputs are in Z.
2
𝑧∙𝑙𝑜𝑔𝑧
𝑦∙𝑙𝑜𝑔𝑦
𝑦+𝑧 ∙log 𝑦+𝑧
𝑞∙𝑙𝑜𝑔𝑞
𝑔 𝑞 =
+
+ min 𝑦, 𝑧 = … ≤
=
2
2
2
2
◦ By induction, at most
◦ By
◦ So
Hamming Distance 1 Problem
2𝑏
𝑏∙2𝑏
2

Note that 𝐼 =

𝑔(𝑞)
𝑞

Theorem: 𝑟 ≥

Proof: Using the recipe from previous section we get:
≤
◦ 𝑟≥
𝑙𝑜𝑔𝑞
2
𝑞𝑂
𝑔 𝑞 𝐼
𝑎𝑛𝑑 𝑂 =
is monotonically increasing.
≥
𝑏
𝑙𝑜𝑔𝑞
𝑞 ∙𝑏 ∙2𝑏 ∙2
2 ∙𝑞 ∙𝑙𝑜𝑔𝑞 ∙2𝑏
=
𝑏
𝑙𝑜𝑔𝑞
Hamming Distance 1 Problem




Now we want to find an upper bound for the problem.
Let’s treat extreme cases first…
If 𝑞 = 2 every reducer gets exactly 2 inputs. In our case it means
that every input string is sent to exactly 𝑏 reducers. So 𝑟 = 𝑏.
If 𝑞 = 2𝑏 we need only one reducer which gets all the input. So 𝑟
= 1.
Hamming Distance 1 Problem



𝑏
𝑐
Let b ≥ 𝑐 ≥ 2 such that 𝑐 divides 𝑏. We have reducer size 2 and
replication rate 𝑐 if we use a splitting algorithm.
We split each bit string into c segments each of length 𝑏/𝑐, and
send them to c reducers.
There are c groups of reducers corresponding to each of the
𝑏
𝑏 –𝑐


2
strings of length 𝑏 – 𝑏/𝑐 (Each group ignores one segment).
Any two strings of Hamming distance 1 will disagree in only one of
the c segments of length 𝑏/𝑐.
The reducer ignoring this segment will cover the output pair.
Hamming Distance 1 Problem





Consider now Hamming distance 2…
While 𝑔 𝑞 = 𝑂(𝑞𝑙𝑜𝑔𝑞) for Hamming distance 1, in this case the
bound is Ω 𝑞2 , which prevents us from getting a good lower
bound on replication rate.
There is an algorithm that creates one reducer for each string of
length 𝑏.
For string 𝑠, this algorithm assigns for its reducer all strings at
distance 1 from 𝑠. Notice that all distinct strings at distance 1 from
𝑠 are distance 2 from each other.
Thus, each reducer covers
𝑏
2
=
𝑞
2
= Ω 𝑞2 outputs.
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
Triangle Finding



Given a graph G, the problem now is to find edges that form a
triangle.
Assume that all possible edges in the graph can be present
according to our model…
Therefore, the inputs to the reducers are the possible edges of a
graph and the outputs are the triples of edges that form a triangle.
Triangle Finding
2∙𝑞 3/2
.
3

Lemma: 𝑔 𝑞 ≤

Proof: Suppose we assign a reducer all the edges among a set of 𝑘
nodes. Then there
then 𝑘 =

𝑘2
approx.
2
edges assigned. Let this quantity be 𝑞,
2𝑞.
The number of triangles among 𝑘 nodes
of 𝑞, the upper bound on the number of
𝑘3
is approx. and in
6
2∙𝑞 3/2
outputs is
.
3
terms
Triangle Finding

Let 𝑛 be the number of vertices in G, then 𝐼 =
=
𝑔(𝑞)

𝑞
𝑛
3
≤
≈
≈
𝑛3
.
6
2𝑞
3
𝑛
2
𝑛2
2
and 𝑂
is monotonically increasing.
𝑞𝑂
𝑔 𝑞 𝐼
𝑛3 ∙3∙2
𝑛

Using the recipe we derive that 𝑟 ≥

There are known algorithms that match the lower bound on
𝑛
replication rate within a constant factor so 𝑟 = Ω( ).
=
6∙ 2𝑞∙𝑛2
𝑞
=
2𝑞
.
Triangle Finding

In practice, applications of triangle finding in analysis of
communities in social networks are generally applied to large but
sparse graphs.

Suppose data graph has 𝑚 of possible

𝑛
We can assign a “target” 𝑞𝑡 =
of the possible edges to one
𝑚 2
reducer. The expected number of edges that will arrive will be q.

𝑟=Ω
𝑞
𝑛
𝑞𝑡
= Ω
𝑚
𝑞
.
𝑛
2
edges, chosen randomly.
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
The Alon Class of Sample Graphs



Let’s generalize triangle finding further…
Problem: find instances of Alon class graphs, named after Noga Alon.
These graph have the property that we can partition the nodes
into disjoint sets, such that the subgraph induced by each partition
is either:
◦ A Single Edge between two nodes, or
◦ Contains an odd-length Hamiltonian cycle.

Cycles, complete graphs, paths of odd length are in the Alon class.
The Alon Class of Sample Graphs


Let S be a sample graph in the Alon class, with s nodes.
Noga Alon proved that the number of instances of S in a graph of
m edges is 𝑂 𝑚

𝑠
2
.
So if a reducer has 𝑞 inputs, the number of instances of S it can find
𝑠
2
is 𝑂 𝑞 .


If all edges are present the number of instances is Ω(𝑛 𝑠 ).
Repeating the analysis we get:
◦ 𝑟= Ω
𝑞 ∙𝑛𝑠
𝑠
𝑛2 ∙𝑞 2
= Ω
𝑛𝑠−2
𝑞
𝑠−2
.
Paths of Length Two


Let’s look at the simplest non-Alon graph: the path of length 2 (2path), and perform the known analysis:
Any two distinct edges can be combined to form at most one 2path, so the number of 2-paths covered by the reducer is at most
𝑔 𝑞 =



𝑞
2
≈
𝑞2
.
2
Assume that input graph has n vertices.
𝐼 =
𝑟≥
𝑛
2
𝑞𝑂
𝑔 𝑞 𝐼
≈
=
𝑛2
2
and 𝑂 = 3
𝑞∙2∙2∙𝑛3
𝑞 2 ∙2∙𝑛2
=
2𝑛
𝑞
.
𝑛
3
≈
𝑛3
2
.
Paths of Length Two




We want to show now upper bounds for r.
If 𝑞 = 𝑛 we have one reducer for each node. We send an edge
(𝑎, 𝑏) to two reducers. The replication rate is thus 2.
The reducer for node 𝑎 receives all edges consisting of 𝑎 and
another node and can produce all 2-paths that have 𝑎 as the middle
node.
If 𝑞 ≥ 𝑛 r is either 1 or 2.
Paths of Length Two



If 𝑞 < 𝑛 we denote 𝑘 =
2𝑛
𝑞
(suppose for convenience that it is an
integer).
Suppose ℎ is a hash function that divides 𝑛 nodes into 𝑘 equal size
buckets.
The reducers correspond to pairs [𝑢, {𝑖, 𝑗}] where u is the middle
node of the 2-path and 1 ≤ 𝑖, 𝑗 ≤ 𝑘, i ≠ 𝑗.
𝑘
2

There are thus 𝑛
reducers.

We send (𝑎, 𝑏) to 2(𝑘 − 1) reducers [𝑏, {ℎ(𝑎),∗}] and [𝑎, {
∗, ℎ(𝑏)}].
Paths of Length Two



Let’s look at the reducer [𝑢, {𝑖, 𝑗}]. This covers all 2-paths 𝑣 − 𝑢
− 𝑤 such that ℎ(𝑣) and ℎ(𝑤) are each either 𝑖 or 𝑗.
If ℎ(𝑣) = ℎ(𝑤), then many reducers will cover the 2-path, and we
want only one.
We can fix this by letting the reducer produce the 𝑣 − 𝑢 − 𝑤 if:
◦ ℎ(𝑣) = 𝑖 𝑎𝑛𝑑 ℎ(𝑤) = 𝑗 (or vice versa).
◦ ℎ(𝑣) = ℎ(𝑤) = 𝑖 𝑎𝑛𝑑 𝑗 = 𝑖 + 1 (𝑚𝑜𝑑 𝑘).

In this case 𝑟 = 2(𝑘 − 1). Thus, to within a constant factor, the
upper and lower bounds match.
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
Matrix Multiplication



Suppose we have 𝑛 × 𝑛 matrices R and S and we wish to form
their product T.
Unlike previous examples, each output depends on 2𝑛 inputs,
rather than just two or three.
We will also explore methods that use two interrelated rounds of
MapReduce, and discover that they can be considerably better.
Matrix Multiplication







We now apply the familiar recipe…
Suppose a reducer covers the outputs 𝑡14 and 𝑡23 .
Then rows 1,2 of R are input to that reducer and columns 3,4 of S
are input to that reducer.
Thus, this reducer covers also 𝑡13 and 𝑡24 .
The set of outputs covered forms a “rectangle”.
If an input to a reducer is not part of a whole row or column it
cannot be used in any output…
Thus, the number of inputs to this reducer is 𝑛(𝑤 + ℎ), where 𝑤
and ℎ are “width” and “height” of the “rectangle” respectively.
Matrix Multiplication





Total number of outputs covered is g q = ℎ𝑤.
For a given q, the number of outputs is maximized when the
rectangle is a square: 𝑤 = ℎ = 𝑞/(2𝑛).
In this case, 𝑔(𝑞) = 𝑞2 /(4𝑛2 ).
Obviously, 𝐼 = 2𝑛2 𝑎𝑛𝑑 𝑂 = 𝑛2 .
𝑟≥
𝑞𝑂
𝑔 𝑞 𝐼
=
4𝑛2 ∙𝑞∙𝑛2
𝑞 2 ∙2𝑛2
=
2𝑛2
𝑞
.
Matrix Multiplication






If 𝑞 ≥ 2𝑛2 then the entire job can be done by one reducer.
If 𝑞 < 2𝑛 then no reducer can get enough input to compute even
one output.
Between these ranges, we can match the lower bound by giving
each reducer a set of rows from R and an equal number of columns
from S.
Partition the rows and columns into 𝑛/𝑠 groups of 𝑠
rows/columns.
𝑞 = 2𝑠𝑛 and 𝑟 = 2𝑛2 /𝑞.
Number of reducers is
𝑛 2
.
𝑠
Matrix Multiplication




Let’s discuss now the two phase MapReduce algorithm for matrix
multiplication.
In the first phase we compute 𝑥𝑖𝑗𝑘 = 𝑟𝑖𝑗 𝑠𝑗𝑘 for each 𝑖, 𝑗, 𝑘 between
1 and n.
We sum 𝑥𝑖𝑗𝑘 ’s at a given reducer if they share common values of 𝑖
and 𝑘, producing a partial sum for (𝑖, 𝑘).
In the second phase, the partial sum for each pair is sent to a
reducer whose responsibility is to sum all the partial sums and
compute 𝑡𝑖𝑘 .
Matrix Multiplication
Matrix Multiplication

Note that the mappers of the second phase can reside at the same
node as the 𝑥𝑖𝑗𝑘 ’s to which they apply.

Thus, no communication is needed between first phase reducers
and second phase mappers.
The set of outputs covered by a reducer (first phase) again forms a
“rectangle”. If a reducer covers 𝑥𝑖𝑗𝑧 and 𝑥𝑦𝑗𝑧 then it also covers
𝑥𝑖𝑗𝑧 and 𝑥𝑦𝑗𝑘 .


We shall assume that each reducer (first phase) is given a set of 𝑠
rows of R, 𝑠 columns of S and 𝑡 values of 𝑗, 1 ≤ 𝑡 ≤ 𝑛.
Matrix Multiplication

There is a reducer covering each 𝑥𝑖𝑗𝑘 which means that the
number of reducers is


𝑛 2 𝑛
𝑠
𝑡
.
Each element of matrices R and S is sent to 𝑛/𝑠 reducers so “total
communication” in the first phase is 2𝑛3 /𝑠.
Each reducer produces a partial sum for 𝑠 2 pairs, so “total
communication” in the second phase is
𝑛 2 𝑛
2
𝑠
𝑠
𝑡
=
𝑛3
.
𝑡
Matrix Multiplication
2𝑛3
𝑠
𝑛3
.
𝑡

Sum of communication is

We want to minimize the function subject to the constraint that 𝑞
= 2𝑠𝑡, and we get 𝑡 =
𝑞
2
+
and 𝑠 =
2𝑛3
𝑞
𝑞.
2𝑛3
𝑞
4𝑛3
.
𝑞

Sum of communication is then

Total communication for the one phase method is the replication
rate times the number of

+
2𝑛2
inputs:
𝑞
∙
=
2𝑛2
=
4𝑛4
.
𝑞
The two phase method uses less communication when
4𝑛4
𝑞
or 𝑞 ≤ 𝑛2 . That is, for any number of reducers except 1.
>
4𝑛3
𝑞
Agenda


MapReduce – a brief overview
Communication / Parallelism Tradeoff Model
◦ Motivational Example
◦ Problem Model & Assumptions
◦ Recipe for Lower Bounds

Known problems
◦ Word Count
◦ Hamming-Distance-1 Problem
◦ Triangle Finding
◦ Finding Instances of Other Graphs
◦ Matrix Multiplication

Summary
Summary & Open Problems




We introduced a simple model for MapReduce algorithms, enabling
us to study their performance.
We identified replication rate and reducer size as two parameters
representing the communication cost and node capabilities.
We demonstrated that these two parameters are related by a
precise tradeoff formula.
An open problem for example: Hamming distance greater than 1…
Summary & Open Problems