Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Port able and architecture independent parallel performance tuning using a call-graph profiling tool Stephen A. Jarvis Constantinos J. Siniolakis Vasil P. Vasilev Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford, UK. Jonathan M.D. Hill Abstract profiling tools for parallel languages is to identify and expose the relationship (imbalance) of computational costs amongst processors, and subsequently express this relationship in terms of the three criteria outlined above. Unfortunately, within a parallel framework, there is a multiplicity of interacting issues that make these criteria significantly more obscure and complex: This paper describes a post-mortem call-graph profiling tool that analyses trace information generated during the execution of BSPlib programs. The purpose of the tool is t o expose imbalance in either computation or communication, and t o highlight portions of code that are amenable to improvement. Unlike other profiling tools, the profile information guides optimisation in an architecture independent way. From an ease of use perspective, the amount of information displayed when visualising a profile f o r a parallel program is no more complex than that of a sequential program. 1 What-to-cost In parallel programming there are at least two kinds of cost which can cause bottlenecks within programs, computation and communication. These costs should not be decoupled and profiled independently as it is of paramount importance that the interaction between the two is identified and exposed to the user. The motivation being that if programs are optimised with respect to one of these costs it is not at the detriment of the other. Introduction The role of a profiling tool is to associate computational bottlenecks that arise during program execution with easily identifiable segments of the source code. The usefulness of a profiling tool depends upon the ease in which users can employ this information to alleviate identified bottlenecks within their programs. The success of profiling tools in sequential languages has been predominantly based on the employment of three criteria as the platform on which profiling tools are built. The first of these criteria is ‘what’ is measured; typically this might be the percentage of execution time spent in each part of the program. The second criteria is ‘where’ in the code these costs should be attributed; costs may be associated with functions or libraries for example. The third criteria is ‘how-touse’ the profiling information to optimise programs in a quantifiable and portable way; for example, problematic portions of code may be rewritten using an algorithm with improved asymptotic complexity. The difference between profiling parallel programs as opposed to sequential programs is that parallel programs are executed on a number of processors. Consequently, each part of the code may be associated with up-to p costs, where p is the number of processors. The major challenge for the developers of Where-to-cost Costing communication can be problematic due to the fact that ‘related’ communication costs on different processors may be caused by up-to p different (and interacting) parts of a program. For example, in message-passing systems, there exist p distinct and independently interacting ‘costable’ parts of code. Profiling tools designed for such systems may therefore clutter the user with vast amounts of indigestible information unless careful attention is paid to the design. One such graphical system which suffers from this problem is upshot [2]. How-to-use Most parallel algorithms written today are built upon programming models that have no usable cost model. Therefore, when profile information is used to optimise bottlenecks within programs, care has to be taken that these optimisations are not specifically tailored to a particular machine or architecture. As in the sequential setting, portable optimisation can only be achieved by improving the overall 286 0-8186-8332398 $10.00 0 1998 IEEE Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply. structure of algorithms in a quantifiable, portable and universal way-without a pragmatic cost model this cannot be realised. operations, to the throughput of the router in terms of words of information delivered; alternatively, g is the single-word delivery cost under continuous message traffic conditions. From the text of a superstep and these two architectural parameters, it is possible to compute the cost of executing a program on a given architecture as follows. In particular, the cost C' of a superstep Sk is captured by the formulae, In this paper it is demonstrated that parallel programs written using the disciplined approach of the BSP model 19-11] are amenable to the three profiling criteria stated above. The development of a, BSP profiling tool is documented. The work motivates the notion of computation and communication balance as the metric by which programs are optimised. It is shown that by minimising imbalance, significant improvements in the algorithmic complexity of parallel algorithms usually follows. This approach provides the foundation upon which portable and architecture independent optimisation can be achieved. The paper is structured as follows: In section 2 the BSP model (and its implementation BSPlib) and its cost calculus are introduced. In section 3 some of the features of BSPlib that facilitate parallel profiling are described. In section 3 a call-graph profiling tool is introduced, and in section 5 two broadcast algorithms are analysed with the tool. 2 Ck=wkthk.g+l where w k= max{ wf I O < i < P ) (1) hk = max{ max(inf, out!) I 0 i < p }, < where k ranges over the supersteps; i ranges over processors; w! is an architecture independent cost that models the maximum number of basic computational operations executed by processor i in the local computation phase of superstep S'; in; (respectively, out!) is the largest accumulated size of all messages entering (respectively, leaving) processor i within superstep S" The total computation cost of a program, is simply the sum of all the superstep costs, Ck. XI, The BSP model 3 The Bulk Synchronous Parallel (BSP) model [lo, 111 views a parallel machine as a set of processormemory pairs, with a global communication network and a mechanism for synchronising all processors. A BSP calculation consists of a sequence of supersteps. Each superstep can be decomposed into three phases: (1)processor-memorypairs perform a number of computations on data held locally at the start of a superstep; (2) processors communicate data into other processor's memories; and (3) all processors barrier synchronise at the end of a superstep. The cost of a BSP program can be calculated simply by summing the costs of each separate superstep executed by the program; in turn, for each superstep, the cost can be decomposed into: (i) local computation; (ii) global exchange of data; and (iii) barrier synchronisation. The maximum number of messages (words) communicated to or from any processor during a superstep is denoted by h, and the complete set of messages is captured in the notion of an hrelation. To ensure that cost analysis can be performed in an architecture independent way, cost formulas are based on the following architecture dependent parameters: p , the number of processors; 1, the minimal time between successive synchronisation operations, measured in terms of basic computational operations; and g, the ratio of the total throughput of the whole system in terms of basic computational Profiling imbalance in parallel programs The BSP model encourages a disciplined use of computation and communication resources, in the sense that all processors perform lock-step phases of computation followed by communication. One way of writing BSP programs is to use existing communication libraries such as PVM or MPI that support non-blocking communications. These general purpose libraries, however, are rarely optimised for the relatively small, but by no means trivial, subset of operations that are required for representing the BSP programming paradigm [S, lo]. To address this problem, the BSP research community has proposed a standard library - BSPlib - for programming within the BSP framework [5]. BSPlib is a small communication library consisting of twenty operations for programming in a SPMD (Single Program Multiple Data) manner. The main features of BSPlib are two modes of communication, one capturing a BSP oriented message passing approach and the other reflecting a one-sided direct remote memory access (DRMA) paradigm. The onesided BSPlib function bsp-put can be used to transfer data from contiguous memory locations on the processor initiating the communication, into contiguous memory locations of a remote processor, without the active participation of the remote processor. The end 287 Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply. of a superstep is identified by function bsp-sync, at which point all processors barrier synchronise, and any message transmissions issued by processors during the superstep are guaranteed to arrive at their destination. Unlike programs written in a general message passing style, the disciplined nature of BSPlib facilitates profiling in a number of ways: 1. The cost model emphasises that both computation and communication costs should be used as cost metrics when profiling. Computation Time Idle Time cclmmunil:atinn Time Idle Time Barrier S ynchronizatic Figure 1: Superstep structure. 2. The cost of communication within a superstep can be considered en-masse. This greatly simplifies the presentation of profiled results, as communication within a superstep can be attributed to the barrier synchronisation that marks the end of a superstep and not to individual communication actions [3]. communication time; (iii) accumulated idle (waiting) time; and (iv) accumulated h-relation size. The total elapsed time spent at a cost centre is simply the sum of the accumulated maximum communication and computation times. The purpose of the profiling tool is to expose imbalances in either computation or communication, and to highlight those imbalances which are amenable to improvement. It is clear from the BSP cost formulae for program execution that balance is the key to good BSP design: 3. BSP cost analysis is modular and convex, i.e., improvement in the performance of algorithms (as a whole) cannot be achieved by making one part slower. This is important when profiling, as portions of code can be elided and hence make the data visualisation simpler, without the removed parts having any adverse effects on the cost of the remaining supersteps. 4. The disciplined usage of computation and communication, encouraged by the BSP model, and thus BSPlib, suggests that a programming style where processes are required to pass through the same textual bsp-sync for each superstep.’ Consequently, the line-number and filename of the piece of code that contains the bsp-sync statement provides a convenient part of the code to which costs can be attributed. In the following sections two broadcasting algorithms that highlight the salient features of the BSPlib profiling tool are analysed. 4 e balanced computation amongst processes within supersteps is encouraged on the premise that w (see equation 1) is a maximum over local execution times; e balanced communication amongst processes within supersteps as h is a maximum over fan-in and fan-out of messages; e the total number of supersteps should be minimised as each contributes an 1 term to the total execution time. Figure 1 shows a schematic diagram of a BSP superstep and its associated costs. As can be seen from the diagram, idle time can arise in either local computation or communication. For computation, idle time arises when processes have to wait at a barrier synchronisation for the process with the largest amount of work to arrive. Alternatively, idle time may occur during the communication phase of a superstep, as processes have to wait until all processes finish communicating before safely proceeding into the next superstepS2 Criteria for good BSP design A post-mortem call-graph profiling tool has been developed which analyses trace information generated during the execution of BSPlib programs. The parts of the program to which profiling information is assigned are referred to as ‘cost-centres’. For each costcentre in the program, that is the textual position of a bsp-sync call, the following information is recorded: (i) accumulated computation time; (ii) accumulated 21t is noted that idle time during communication depends upon the type of architecture BSPlib is implemented upon. For example, on the DRMA and shared memory architectures (e.g., lThis is more restrictive than the semantics required for BSPlib programs. 288 Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply. 5 For each cost-centre, p costs are recorded, one per process. This data is presented to the user in one of two ways: Example: broadcasting n values to p processes Consider the problem of communicating a datap ) held on one prostructure of size n (where n cess, into the memories of all p processors. A naive algorithm can be effected in a single superstep by having the broadcasting process perform p - 1 distinct bspquts into the memories of each other process. As the broadcasting process transmits p - 1messages each of size n, the superstep realises an n(p - 1)-relation with approximate cost (assuming p instead of p - 1 in the above formulae): > Summarised data: The accumulated cost within a single cost-centre can be summarised in terms of maximum (the standard BSP interpretation of cost), average, and minimum accumulated costs associated with each of the p processes. More formally, given that a program may pass through a particular cost centre x times generating a sequence of costs (Cl,.. . , C.), the accumulated computation cost for the given costcentre is given by the formulae: maximum cost = E m a x { w; 1 o Ii < p 1 one stage bcast = npg + I , (6) where 1 is the up-front cost of performing a single superstep. (2) k source min { w; I o L: i minimum cost = <p 1 (4) k Similar formulae exist for communication time, idle time, and h-relation size. Superstep All data: The cost associated with each of the p processes is presented to the user as a pie chart. Care has to be taken when interpreting the results shown in this manner as the cost is calculated using a formulae that differs from the standard BSP interpretation of cost. The motivation is that the effect of visualising a pie-chart is to identify the largest (maximal) segment in the chart. The size of this segment is: Superstep Two Figure 3: Two stage broadcast using total exchange. It is clearly seen from the cost-formula captured in equation 6, that this algorithm is not scalable as its cost linearly increases with p. An alternative scalable BSP broadcasting algorithm [I,81, with cost 2 g ( n - ( n / p ) ) 21 is shown in Figure 3. The algorithm consists of two supersteps, that initially evenly distribute the data amongst the processes and subsequently perform a balanced communication involving all the processes. The cost of the distribution phase is ( n / p ) ( p- l ) g I as a single message of size ( n / p ) is sent to each of p - 1 processes. In the second superstep, every process sends and receives p - 1messages of size ( n / p ) from every other process. Surprisingly, the cost of this superstep is also (n/p)(p-l)g+E; note that BSP cost analysis encourages balanced communication. The approximate cost (assuming p instead of p - 1 in the above formulae) of the entire algorithm is determined by summing the cost of the (5) + which is different from the maximum identified by the prior analysis. As can be clearly seen from equations 2 and 5 , the latter equation abstracts the maximum outside the summation, which produces a result that might be smaller than that obtained from the former equation. Although this interpretation is not strictly in line with BSP cost analysis, it is useful in identifying the process that might be causing the bottleneck. + Cray T3D/E and SGI Power Challenge), communication idle time arises as shown in Figure 1. However, BSPJib built on top of architectures that only support message passing (e.g., IBM SP2), results in communication idle time being coalesced with the computation idle time of the following superstep. Refer to [6] for details. 289 Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply. Figure 2: Sample call-graph profile on a 16 processor Cray T3E. two supersteps: two stage beast = 2 x c -pg +1 ) = 2ng + 21. The call-graph shown in Figure 2 reflects a program that performs 500 iterations of the one-stage broadcast and 500 iterations of the two-stage. To highlight the features of the call-graph profile, the procedures f oo and bar contain procedure calls to the two broadcasting algorithms. In procedure f oo, the one-stage broadcast algorithm is executed 250 times, and subsequently, in procedure bar, the one-stage algorithm is executed 250 times, along with 500 iterations of the two-stage. The graph clearly shows how the costs are inheritred from the leaves of the graph towards the root. That is, the top-level procedure main, records the accumulated computation, communication, and idle cost for all the supersteps within the program; whereas interior nodes in the call-graph record information pertaining to supersteps performed during the lifetime of the procedure identified by the interior node. Leaf nodes record: (i) the textual position of the bsp-sync call within the program; (ii) the number of times a particular superstep is executed; and (iii) summaries of the size of h-relation, computation, communication, and idle cost, in terms of the maximum, average, and minimum cost on p processors. The average and minimum cost is given as a percentage of the maximum cost. Interior nodes record similar information, except that the label of the node is a procedure name, and the accumulated cost is inherited from each of the supersteps executed during that procedure. Notice for the node labelled bcast-onestage that the maximum computation, communication, and idle time are each 19 seconds. That is, a total of 38 seconds is spent in the one-stage broadcast. However, some of the proc- (7) From equations 6 and 7 it is possible to determine that when n > l/(pg - 29) the two-stage algorithm is superior to the one-stage algorithm. For example, when 1 is large, and n and p are small, the cost of the extra superstep out-weighs the cost of communicating a few small messages. On the other hand, for large n or p , the communication cost out-weighs the overhead of the extra superstep. 6 Interpreting call-graph information Figure 2 shows an example call-graph profile for the two broadcast algorithms running on a 16 processor Cray T3E. The call-graph contains a series of interior and leaf nodes. The interior nodes represent procedures entered during program execution, whereas the leaf nodes represent the textual position of supersteps, that is, the line of code containing a bsp-sync. The path from a leaf to the root of the graph identifies the sequence of cost-centres passed through to reach the part of the code that is active when the bsp-sync associated with the given leaf is executed. This path is termed a call-stack and thus a collection of call-stacks comprise a call-graph profile. One of the main advantages of call-graph profiling is that a complete set of unambiguous program costs can be collected at run-time and post-processed. This greatly aids the identification of program bottlenecks. Furthermore, the costs of shared procedures can be accurately apportioned to their parents via a scheme known as inheritance. This allows the programmer to resolve any ambiguities with regard the cost of shared procedures [7]. 290 Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply. esses spend 19 seconds idling during this 38 secondsthis idle time is due to the processes waiting for the broadcasting process to transmit all the messages. In Figure 2 the absolute imbalance in h-relation size is used to identify critical paths. The tool clearly shows that the one-stage algorithm reflects a large amount of imbalance in communication. We quantify this imbalance in terms of the form (12% I 7%). This identifies that the average cost is 12% of the maximum, whereas the minimum is 7%. Clearly, such small percentages for the average and minimum cost identify large imbalances in the algorithm. The tool correctly identifies that there is a similar imbalance in the first superstep of the two-stage algorithm (initial distribution of data)-it is noted that this imbalance is unavoidable. However, it is worth noting that the tool does not identify this imbalance as severe as the imbalance underlying the one-stage algorithm, as it is caused by a smaller h-relation, i.e., an ( n / p ) O ,- 1)relation compared to an n(p - 1)-relation. Finally in the last superstep of the two-stage broadcast, there is virtually no imbalance (100% I 100%) within the node labelled bcast c l i n e 41. Identifying critical paths 7 The scope of the profiling tool is not limited to merely visualising the computation and communications patterns at each of the cost-centres, but aims to identify critical cost paths within programs. Critical paths are visualised by shading each of the nodes in the graph with a colour ranging from white to red. A red node corresponds to a bottleneck (or ‘hot-spot’!) in the program (in this paper, colours have been replaced by grey-scales from white to dark grey). There are seventeen different critical paths that can be identified by the profiling tool. The simplest is the synchronisation critical path that identifies nodes in the graph containing the most supersteps. For computation, communication, idle time and h-relation size, four different kinds of critical path can be identified: e Absolute - identifies the nodes with the largest maximum cost. e Absolute imbalance - identifies the nodes with the largest difference between the maximum and average cost. . 8 An architecture independent metric for critical paths In the introduction it was stated that care has to be taken when optimising programs based on profile information. A problem with most commercially available profiling tools is that profile data is often errorprone, especially when wall-clock time is used as a cost metric. One of the important features of BSP is that the size of h-relations directly influences the cost of communication. Therefore, instead of using actual communication time as a cost metric, which may include errors, the predicted cost of communication, hg + 1 , is used. This is error-free as the value of h, which is not affected by the choice of the underlying machine or architecture, is accurately recorded at runtime. Therefore, our hypothesis is encapsulated in the notion that the imbalance in maximum and average h-relation can be effectively used as the metric by which BSP programs are optimised and optimal architecture independent parallel algorithms can be developed. The hypothesis is reinforced by both the BSP cost analysis formulae and experimental results. The cost of the two broadcasting algorithms support this hypothesis. It can be clearly seen from the nodes in Figure 2 labelled one-stage-bcast and t w o s t age-broadcast, that the two-stage broadcast is superior to the one stage, as it is revealed by the accumulated values for the computation, communication and idle costs. For example, the performance on a 16 processor Cray T3E gives an improvement of: 19.48 19.55 = 4.15, 4.73 4.67 Relative imbalance - identifies the nodes with the largest percentage-wise deviation between maximum and average cost. Weighted - identifies the nodes with both the largest difference between the maximum and average cost and the largest percentage-wise deviation between maximum and average cost. Informally, this path combines the prior two critical paths. The role of the absolute critical path is to identify those nodes that constitute the major components (cost-wise) within the program. In contrast, the absolute imbalance path highlights those nodes that are amenable to improvement due to the underlying imbalance in the maximum and average cost. However, the problem with this metric is that nodes with large cost value and small deviation might be identified as ‘more critical’ than nodes with smaller cost value but larger deviation. As the later of these nodes are more amenable to improvement, the relative imbalance critical path is useful in determining the nodes that are imbalanced, irrespective of their size. Finally, the weighted critical path combines the advantages of the previous two approaches. + + 291 Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply. Figure 4: Screen shot from a profile of an SQL application 9 Conclusions [4] 3 . M. D. Hill, S. Jarvis, C. Siniolakis, and V. P. Vasilev. Portable and architecture independent parallel performance tuning using a call-graph profiling tool: a case study in optimising SQL. Technical Report 17-97, Programming Research Group, Oxford University Computing Laboratory, May 1997. [5] J. M. D. Hill, B. McColl, D. C. Stefanescu, M. W. Goudreau, K. Lang, S. B. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP Programming Library. Technical Report PRG-TR-29-9, Oxford University Computing Laboratory, May 1997. see www .bsp-worldwide .org for more details. [6] J. M. D. Hill and D. Skillicorn. Lessons learned from implementing BSP. Journal of Future Generation Computer Systems, April 1998. [7] S.A. Jarvis. Profiling large-scale lazy functional programs. PhD thesis, Computer Science Department, University of Durham, 1996. 181 B. H. H. Juurlink and H. A. G . Wiishoff. Communication primitives for BSP computers. Information Processing Letters, 58:303-310, 1996. [9] W. F. McColl. Scalable computing. In J. van Leeuwen, editor, Computer Science Today: Recent Trends and Developments, number 1000 in LNCS, pages 46-61. SpringerVerlag, 1995. [lo] D. Skillicorn, 3 . M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific Programming, 1997. [11] L, G, Valiant. A bridging model for parallel computation. CACM, 33(8):103-111, August 1990. The performance improvements resulting from the analysis of the call-graph profiles demonstrate that the tool can be used to optimise programs in a portable and architecture independent manner. These conclusions are further reinforced in [4]where the steps involved in optimising an SQL application (also see Figure 4) are described. Unlike other profiling tools, an architecture independent metric - h-relation size - guides the optimisation process. The major benefit of this profiling tool is that the amount of information displayed when visualising a profile for a parallel program is no more complex than that of a sequential program. References & M. Barnett, D. Payne, R. van de Geijn, and J. Watts. Broadcasting on meshes with wormhole routing. Journal of Parallel and Distributed Computing, 35(2):111-122, 1996. V. Heearte and E. Lusk. Studying parallel program behaviour with upshot. Technical Report ANL91/15, Argonne National Lab, Argonne, 11. 60439, 1991. J. M. D. Hill, P. I. Crumpton, and D. A. Burgess. Theory, practice, and a for BSP performance prediction’ In EuroPar’96, number 1124 in LNCS, pages 697-705, Lyon, France, Aug. 1996. Springer-Verlag. , 292 Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply.