Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Upper and Lower Bounds on the cost of a Map-Reduce Computation Based on an article by Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, Jeffrey D. Ullman Images taken from slides by same authors. Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary MapReduce - Overview A programming paradigm for processing large amounts of data with a parallel and distributed algorithm. Consists of a map() function ◦ An operation applied to all elements of a sequence, desirably in parallel. ◦ Runs on Map processors of a MapReduce implementation. And a reduce() function ◦ An operation that performs a summary (counting, averaging) on all the elements. ◦ Runs on Reduce processors of a MapReduce implementation. Implementations are available such as Apache Hadoop. Let’s look at the canonical example… Input is partitioned into lines/files and sent to the Mappers Mappers perform the map() function and output key-value pairs in the form <word, 1>. In the shuffling phase data is sorted and sent to the Reducers based on the key. The Reducers perform the reduce() function (sum) over all the keyvalue pairs and output a new key-value pair. New key-value pairs are aggregated into the final result. Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary The Drug Interaction Problem 3000 sets of drug data (patients taking, dates, diagnoses). About 1M of data per drug. We are interested whether there are 2 drugs that when taken together increase the risk of heart attack. Naturally we cross reference every pair of drugs across whole set of drugs. The Drug Interaction Problem The Drug Interaction Problem We have a problem: for 3000 drugs, each set of drug data is replicated 2999 times. If each set of data is 1M large we have ~9000GB of communication. Communication cost is too high! We purpose a different approach: grouping drugs… The Drug Interaction Problem The Drug Interaction Problem We group 3000 drugs to 30 groups: 𝐺1 from 1 to 100, 𝐺2 from 101 to 200, …. , 𝐺30 from 2901 to 3000. Each set of drug data is only replicated 29 times, which means 87GB communication cost vs 9000GB. But lower parallelism, higher processing cost! Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary Problem Model & Assumptions Assume that the Map phase is parallel. We will discuss the tradeoff in the Reduce phase. We will focus on single-round MR applications. We want to maximize parallelism, which means more Reduce processors while each processor gets smaller amounts of data. We also want to minimize communication costs, which means less traffic between Map and Reduce processors. But the more parallelism we got, the bigger the communication overhead is because we need to transfer data to more nodes. Problem Model & Assumptions Let’s agree that input / output terminology is relative to the reduce phase. For the purposes of the model let us define a problem which consists of: ◦ Sets of inputs and outputs. ◦ A mapping from outputs to sets of inputs. Each output depends on only the set of inputs it is mapped to. We need to limit ourselves to finite sets of inputs and outputs. Problem Model & Assumptions We define reducer as a reduce key together with its list of associated values. Reducer size is the upper bound on how long the list of values can be – the maximum number of inputs that can be sent to any one reducer. Let us denote reducer size q. Problem Model & Assumptions We define replication rate as the average number of key value pairs (reducers) to which each input is mapped by the mappers. Let us denote replication rate r. Replication rate is a function of reducer size: 𝑟 = 𝑓(𝑞). Replication rate represents the expected communication if we multiply it by the number of inputs actually present. Problem Model & Assumptions A mapping schema for a given problem is an assignment of a set of inputs to each reducer, subject to the constraints that: ◦ No reducer is assigned more than q inputs ◦ For every output, there is at least one reducer that is assigned to all of the inputs for that output. Such reducer covers the output. This reducer need not be unique. Inputs and outputs are hypothetical, in the sense that they are all the possible inputs and outputs that might be present in an instance of the problem. The mapping assigns inputs to reducers without reference to what inputs are actually present. Problem Model & Assumptions Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary The Recipe for Lower Bounds What we want now is to compute a lower bound for the replication rate, so we use a generic technique. Upper bounds will be derived independently on each problem. Let us fix 𝑞, which is the maximum number of inputs. Derive 𝑔(𝑞) which is an upper bound on the number of outputs a reducer can cover, given 𝑞. Count the total number of inputs |𝐼| and outputs |𝑂|. The Recipe for Lower Bounds Now assume there are 𝑝 reducers, each receiving 𝑞𝑖 inputs. Together they cover all outputs. Then: ◦ ◦ ◦ 𝑝 𝑖=1 𝑔 𝑞𝑖 𝑝 𝑞𝑖 𝑔(𝑞𝑖 ) 𝑖=1 𝑞𝑖 𝑝 𝑞𝑖 𝑔(𝑞) 𝑖=1 𝑞 ≥ |𝑂| ≥ |𝑂| ≥ 𝑝 𝑞𝑖 𝑔 𝑞𝑖 𝑖=1 𝑞𝑖 𝑎𝑛𝑑 ◦ 𝑟= 1 |𝐼| 𝑝 𝑖=1 𝑞𝑖 ≥ ≥ |𝑂| ∀𝑞𝑖 : 𝑞𝑖 ≤ 𝑞 𝑔 𝑞𝑖 𝑚𝑜𝑛𝑜𝑡𝑜𝑛𝑖𝑐𝑎𝑙𝑙𝑦 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑞𝑖 𝑞|𝑂| 𝑔 𝑞 |𝐼| Thus we get a lower bound for r. The Drug Interaction Problem The Drug Interaction Problem The Drug Interaction Problem Using the inequality we get: 𝐼 = 6, q = 4, g q = 2= 𝑟≥ 𝑞𝑂 𝑔 𝑞 𝐼 = 4 ∙ 15 6∙6 4 2 = 6, O = = 5 3 6 2 = 15 Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary Word Count Think of the inputs as the word occurrences themselves in the files. Then each word occurrence results in exactly one key-value pair. Replication rate is 1, independent on the limit on reducer size. There is no tradeoff at all between q and r – Word Count problem is embarrassingly parallel ! Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary Hamming Distance 1 Problem A reminder: Hamming distance measures distance between two binary strings. Our problem is to find pairs of bit strings of length 𝑏 that are at hamming distance 1. Hamming Distance 1 Problem 𝑞∙𝑙𝑜𝑔𝑞 2 Lemma: 𝑔 𝑞 ≤ Proof by induction on 𝑏: ◦ Basis: 𝑏 = 1. q is either 1 or 2. ◦ If 𝑞 = 1 the reducer can cover no outputs and 1∙𝑙𝑜𝑔1 2 = 0. ◦ If 𝑞 = 2 the reducer can cover at most one output and 2∙𝑙𝑜𝑔2 2 ◦ Now assume for b and consider string length 𝑏 + 1. ◦ Let X be a set of q bit strings of length 𝑏 + 1. ◦ Let Y be the subset of X of strings in the form 0𝑤. |𝑌| = 𝑦. ◦ Let Z be the subset of X of strings in the form 1𝑤. |𝑍| = 𝑧. ◦ 𝑞 = 𝑦 + 𝑧. =1 Hamming Distance 1 Problem Proof by induction on 𝑏 (cont.): ◦ For any string in Y, there is at most one string in Z at Hamming distance 1. ◦ Thus, the number of outputs with one string in Y and the other in Z is at most min(𝑦, 𝑧). 𝑦∙𝑙𝑜𝑔𝑦 outputs both of whose inputs are in Y. 2 𝑧∙𝑙𝑜𝑔𝑧 induction, at most outputs both of whose inputs are in Z. 2 𝑧∙𝑙𝑜𝑔𝑧 𝑦∙𝑙𝑜𝑔𝑦 𝑦+𝑧 ∙log 𝑦+𝑧 𝑞∙𝑙𝑜𝑔𝑞 𝑔 𝑞 = + + min 𝑦, 𝑧 = … ≤ = 2 2 2 2 ◦ By induction, at most ◦ By ◦ So Hamming Distance 1 Problem 2𝑏 𝑏∙2𝑏 2 Note that 𝐼 = 𝑔(𝑞) 𝑞 Theorem: 𝑟 ≥ Proof: Using the recipe from previous section we get: ≤ ◦ 𝑟≥ 𝑙𝑜𝑔𝑞 2 𝑞𝑂 𝑔 𝑞 𝐼 𝑎𝑛𝑑 𝑂 = is monotonically increasing. ≥ 𝑏 𝑙𝑜𝑔𝑞 𝑞 ∙𝑏 ∙2𝑏 ∙2 2 ∙𝑞 ∙𝑙𝑜𝑔𝑞 ∙2𝑏 = 𝑏 𝑙𝑜𝑔𝑞 Hamming Distance 1 Problem Now we want to find an upper bound for the problem. Let’s treat extreme cases first… If 𝑞 = 2 every reducer gets exactly 2 inputs. In our case it means that every input string is sent to exactly 𝑏 reducers. So 𝑟 = 𝑏. If 𝑞 = 2𝑏 we need only one reducer which gets all the input. So 𝑟 = 1. Hamming Distance 1 Problem 𝑏 𝑐 Let b ≥ 𝑐 ≥ 2 such that 𝑐 divides 𝑏. We have reducer size 2 and replication rate 𝑐 if we use a splitting algorithm. We split each bit string into c segments each of length 𝑏/𝑐, and send them to c reducers. There are c groups of reducers corresponding to each of the 𝑏 𝑏 –𝑐 2 strings of length 𝑏 – 𝑏/𝑐 (Each group ignores one segment). Any two strings of Hamming distance 1 will disagree in only one of the c segments of length 𝑏/𝑐. The reducer ignoring this segment will cover the output pair. Hamming Distance 1 Problem Consider now Hamming distance 2… While 𝑔 𝑞 = 𝑂(𝑞𝑙𝑜𝑔𝑞) for Hamming distance 1, in this case the bound is Ω 𝑞2 , which prevents us from getting a good lower bound on replication rate. There is an algorithm that creates one reducer for each string of length 𝑏. For string 𝑠, this algorithm assigns for its reducer all strings at distance 1 from 𝑠. Notice that all distinct strings at distance 1 from 𝑠 are distance 2 from each other. Thus, each reducer covers 𝑏 2 = 𝑞 2 = Ω 𝑞2 outputs. Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary Triangle Finding Given a graph G, the problem now is to find edges that form a triangle. Assume that all possible edges in the graph can be present according to our model… Therefore, the inputs to the reducers are the possible edges of a graph and the outputs are the triples of edges that form a triangle. Triangle Finding 2∙𝑞 3/2 . 3 Lemma: 𝑔 𝑞 ≤ Proof: Suppose we assign a reducer all the edges among a set of 𝑘 nodes. Then there then 𝑘 = 𝑘2 approx. 2 edges assigned. Let this quantity be 𝑞, 2𝑞. The number of triangles among 𝑘 nodes of 𝑞, the upper bound on the number of 𝑘3 is approx. and in 6 2∙𝑞 3/2 outputs is . 3 terms Triangle Finding Let 𝑛 be the number of vertices in G, then 𝐼 = = 𝑔(𝑞) 𝑞 𝑛 3 ≤ ≈ ≈ 𝑛3 . 6 2𝑞 3 𝑛 2 𝑛2 2 and 𝑂 is monotonically increasing. 𝑞𝑂 𝑔 𝑞 𝐼 𝑛3 ∙3∙2 𝑛 Using the recipe we derive that 𝑟 ≥ There are known algorithms that match the lower bound on 𝑛 replication rate within a constant factor so 𝑟 = Ω( ). = 6∙ 2𝑞∙𝑛2 𝑞 = 2𝑞 . Triangle Finding In practice, applications of triangle finding in analysis of communities in social networks are generally applied to large but sparse graphs. Suppose data graph has 𝑚 of possible 𝑛 We can assign a “target” 𝑞𝑡 = of the possible edges to one 𝑚 2 reducer. The expected number of edges that will arrive will be q. 𝑟=Ω 𝑞 𝑛 𝑞𝑡 = Ω 𝑚 𝑞 . 𝑛 2 edges, chosen randomly. Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary The Alon Class of Sample Graphs Let’s generalize triangle finding further… Problem: find instances of Alon class graphs, named after Noga Alon. These graph have the property that we can partition the nodes into disjoint sets, such that the subgraph induced by each partition is either: ◦ A Single Edge between two nodes, or ◦ Contains an odd-length Hamiltonian cycle. Cycles, complete graphs, paths of odd length are in the Alon class. The Alon Class of Sample Graphs Let S be a sample graph in the Alon class, with s nodes. Noga Alon proved that the number of instances of S in a graph of m edges is 𝑂 𝑚 𝑠 2 . So if a reducer has 𝑞 inputs, the number of instances of S it can find 𝑠 2 is 𝑂 𝑞 . If all edges are present the number of instances is Ω(𝑛 𝑠 ). Repeating the analysis we get: ◦ 𝑟= Ω 𝑞 ∙𝑛𝑠 𝑠 𝑛2 ∙𝑞 2 = Ω 𝑛𝑠−2 𝑞 𝑠−2 . Paths of Length Two Let’s look at the simplest non-Alon graph: the path of length 2 (2path), and perform the known analysis: Any two distinct edges can be combined to form at most one 2path, so the number of 2-paths covered by the reducer is at most 𝑔 𝑞 = 𝑞 2 ≈ 𝑞2 . 2 Assume that input graph has n vertices. 𝐼 = 𝑟≥ 𝑛 2 𝑞𝑂 𝑔 𝑞 𝐼 ≈ = 𝑛2 2 and 𝑂 = 3 𝑞∙2∙2∙𝑛3 𝑞 2 ∙2∙𝑛2 = 2𝑛 𝑞 . 𝑛 3 ≈ 𝑛3 2 . Paths of Length Two We want to show now upper bounds for r. If 𝑞 = 𝑛 we have one reducer for each node. We send an edge (𝑎, 𝑏) to two reducers. The replication rate is thus 2. The reducer for node 𝑎 receives all edges consisting of 𝑎 and another node and can produce all 2-paths that have 𝑎 as the middle node. If 𝑞 ≥ 𝑛 r is either 1 or 2. Paths of Length Two If 𝑞 < 𝑛 we denote 𝑘 = 2𝑛 𝑞 (suppose for convenience that it is an integer). Suppose ℎ is a hash function that divides 𝑛 nodes into 𝑘 equal size buckets. The reducers correspond to pairs [𝑢, {𝑖, 𝑗}] where u is the middle node of the 2-path and 1 ≤ 𝑖, 𝑗 ≤ 𝑘, i ≠ 𝑗. 𝑘 2 There are thus 𝑛 reducers. We send (𝑎, 𝑏) to 2(𝑘 − 1) reducers [𝑏, {ℎ(𝑎),∗}] and [𝑎, { ∗, ℎ(𝑏)}]. Paths of Length Two Let’s look at the reducer [𝑢, {𝑖, 𝑗}]. This covers all 2-paths 𝑣 − 𝑢 − 𝑤 such that ℎ(𝑣) and ℎ(𝑤) are each either 𝑖 or 𝑗. If ℎ(𝑣) = ℎ(𝑤), then many reducers will cover the 2-path, and we want only one. We can fix this by letting the reducer produce the 𝑣 − 𝑢 − 𝑤 if: ◦ ℎ(𝑣) = 𝑖 𝑎𝑛𝑑 ℎ(𝑤) = 𝑗 (or vice versa). ◦ ℎ(𝑣) = ℎ(𝑤) = 𝑖 𝑎𝑛𝑑 𝑗 = 𝑖 + 1 (𝑚𝑜𝑑 𝑘). In this case 𝑟 = 2(𝑘 − 1). Thus, to within a constant factor, the upper and lower bounds match. Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary Matrix Multiplication Suppose we have 𝑛 × 𝑛 matrices R and S and we wish to form their product T. Unlike previous examples, each output depends on 2𝑛 inputs, rather than just two or three. We will also explore methods that use two interrelated rounds of MapReduce, and discover that they can be considerably better. Matrix Multiplication We now apply the familiar recipe… Suppose a reducer covers the outputs 𝑡14 and 𝑡23 . Then rows 1,2 of R are input to that reducer and columns 3,4 of S are input to that reducer. Thus, this reducer covers also 𝑡13 and 𝑡24 . The set of outputs covered forms a “rectangle”. If an input to a reducer is not part of a whole row or column it cannot be used in any output… Thus, the number of inputs to this reducer is 𝑛(𝑤 + ℎ), where 𝑤 and ℎ are “width” and “height” of the “rectangle” respectively. Matrix Multiplication Total number of outputs covered is g q = ℎ𝑤. For a given q, the number of outputs is maximized when the rectangle is a square: 𝑤 = ℎ = 𝑞/(2𝑛). In this case, 𝑔(𝑞) = 𝑞2 /(4𝑛2 ). Obviously, 𝐼 = 2𝑛2 𝑎𝑛𝑑 𝑂 = 𝑛2 . 𝑟≥ 𝑞𝑂 𝑔 𝑞 𝐼 = 4𝑛2 ∙𝑞∙𝑛2 𝑞 2 ∙2𝑛2 = 2𝑛2 𝑞 . Matrix Multiplication If 𝑞 ≥ 2𝑛2 then the entire job can be done by one reducer. If 𝑞 < 2𝑛 then no reducer can get enough input to compute even one output. Between these ranges, we can match the lower bound by giving each reducer a set of rows from R and an equal number of columns from S. Partition the rows and columns into 𝑛/𝑠 groups of 𝑠 rows/columns. 𝑞 = 2𝑠𝑛 and 𝑟 = 2𝑛2 /𝑞. Number of reducers is 𝑛 2 . 𝑠 Matrix Multiplication Let’s discuss now the two phase MapReduce algorithm for matrix multiplication. In the first phase we compute 𝑥𝑖𝑗𝑘 = 𝑟𝑖𝑗 𝑠𝑗𝑘 for each 𝑖, 𝑗, 𝑘 between 1 and n. We sum 𝑥𝑖𝑗𝑘 ’s at a given reducer if they share common values of 𝑖 and 𝑘, producing a partial sum for (𝑖, 𝑘). In the second phase, the partial sum for each pair is sent to a reducer whose responsibility is to sum all the partial sums and compute 𝑡𝑖𝑘 . Matrix Multiplication Matrix Multiplication Note that the mappers of the second phase can reside at the same node as the 𝑥𝑖𝑗𝑘 ’s to which they apply. Thus, no communication is needed between first phase reducers and second phase mappers. The set of outputs covered by a reducer (first phase) again forms a “rectangle”. If a reducer covers 𝑥𝑖𝑗𝑧 and 𝑥𝑦𝑗𝑧 then it also covers 𝑥𝑖𝑗𝑧 and 𝑥𝑦𝑗𝑘 . We shall assume that each reducer (first phase) is given a set of 𝑠 rows of R, 𝑠 columns of S and 𝑡 values of 𝑗, 1 ≤ 𝑡 ≤ 𝑛. Matrix Multiplication There is a reducer covering each 𝑥𝑖𝑗𝑘 which means that the number of reducers is 𝑛 2 𝑛 𝑠 𝑡 . Each element of matrices R and S is sent to 𝑛/𝑠 reducers so “total communication” in the first phase is 2𝑛3 /𝑠. Each reducer produces a partial sum for 𝑠 2 pairs, so “total communication” in the second phase is 𝑛 2 𝑛 2 𝑠 𝑠 𝑡 = 𝑛3 . 𝑡 Matrix Multiplication 2𝑛3 𝑠 𝑛3 . 𝑡 Sum of communication is We want to minimize the function subject to the constraint that 𝑞 = 2𝑠𝑡, and we get 𝑡 = 𝑞 2 + and 𝑠 = 2𝑛3 𝑞 𝑞. 2𝑛3 𝑞 4𝑛3 . 𝑞 Sum of communication is then Total communication for the one phase method is the replication rate times the number of + 2𝑛2 inputs: 𝑞 ∙ = 2𝑛2 = 4𝑛4 . 𝑞 The two phase method uses less communication when 4𝑛4 𝑞 or 𝑞 ≤ 𝑛2 . That is, for any number of reducers except 1. > 4𝑛3 𝑞 Agenda MapReduce – a brief overview Communication / Parallelism Tradeoff Model ◦ Motivational Example ◦ Problem Model & Assumptions ◦ Recipe for Lower Bounds Known problems ◦ Word Count ◦ Hamming-Distance-1 Problem ◦ Triangle Finding ◦ Finding Instances of Other Graphs ◦ Matrix Multiplication Summary Summary & Open Problems We introduced a simple model for MapReduce algorithms, enabling us to study their performance. We identified replication rate and reducer size as two parameters representing the communication cost and node capabilities. We demonstrated that these two parameters are related by a precise tradeoff formula. An open problem for example: Hamming distance greater than 1… Summary & Open Problems