Download Portable and Architecture Independent Parallel Performance Tuning

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Customer cost wikipedia , lookup

Port able and architecture independent parallel performance tuning
using a call-graph profiling tool
Stephen A. Jarvis
Constantinos J. Siniolakis
Vasil P. Vasilev
Oxford University Computing Laboratory,
Wolfson Building, Parks Road, Oxford, UK.
Jonathan M.D. Hill
profiling tools for parallel languages is to identify and
expose the relationship (imbalance) of computational
costs amongst processors, and subsequently express
this relationship in terms of the three criteria outlined
above. Unfortunately, within a parallel framework,
there is a multiplicity of interacting issues that make
these criteria significantly more obscure and complex:
This paper describes a post-mortem call-graph profiling tool that analyses trace information generated
during the execution of BSPlib programs. The purpose
of the tool is t o expose imbalance in either computation or communication, and t o highlight portions of
code that are amenable to improvement. Unlike other
profiling tools, the profile information guides optimisation in an architecture independent way. From an
ease of use perspective, the amount of information displayed when visualising a profile f o r a parallel program
is no more complex than that of a sequential program.
What-to-cost In parallel programming there are
at least two kinds of cost which can cause bottlenecks within programs, computation and communication. These costs should not be decoupled and profiled independently as it is of paramount importance
that the interaction between the two is identified and
exposed to the user. The motivation being that if programs are optimised with respect to one of these costs
it is not at the detriment of the other.
The role of a profiling tool is to associate computational bottlenecks that arise during program execution with easily identifiable segments of the source
code. The usefulness of a profiling tool depends upon
the ease in which users can employ this information to
alleviate identified bottlenecks within their programs.
The success of profiling tools in sequential languages has been predominantly based on the employment of three criteria as the platform on which profiling tools are built. The first of these criteria is ‘what’
is measured; typically this might be the percentage of
execution time spent in each part of the program. The
second criteria is ‘where’ in the code these costs should
be attributed; costs may be associated with functions
or libraries for example. The third criteria is ‘how-touse’ the profiling information to optimise programs in
a quantifiable and portable way; for example, problematic portions of code may be rewritten using an
algorithm with improved asymptotic complexity.
The difference between profiling parallel programs
as opposed to sequential programs is that parallel programs are executed on a number of processors. Consequently, each part of the code may be associated
with up-to p costs, where p is the number of processors. The major challenge for the developers of
Where-to-cost Costing communication can be
problematic due to the fact that ‘related’ communication costs on different processors may be caused by
up-to p different (and interacting) parts of a program.
For example, in message-passing systems, there exist p
distinct and independently interacting ‘costable’ parts
of code. Profiling tools designed for such systems may
therefore clutter the user with vast amounts of indigestible information unless careful attention is paid to
the design. One such graphical system which suffers
from this problem is upshot [2].
How-to-use Most parallel algorithms written today
are built upon programming models that have no usable cost model. Therefore, when profile information
is used to optimise bottlenecks within programs, care
has to be taken that these optimisations are not specifically tailored to a particular machine or architecture. As in the sequential setting, portable optimisation can only be achieved by improving the overall
0-8186-8332398 $10.00 0 1998 IEEE
Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply.
structure of algorithms in a quantifiable, portable and
universal way-without a pragmatic cost model this
cannot be realised.
operations, to the throughput of the router in terms
of words of information delivered; alternatively, g is
the single-word delivery cost under continuous message traffic conditions.
From the text of a superstep and these two architectural parameters, it is possible to compute the cost
of executing a program on a given architecture as follows. In particular, the cost C' of a superstep Sk is
captured by the formulae,
In this paper it is demonstrated that parallel programs written using the disciplined approach of the
BSP model 19-11] are amenable to the three profiling criteria stated above. The development of a, BSP
profiling tool is documented. The work motivates the
notion of computation and communication balance as
the metric by which programs are optimised. It is
shown that by minimising imbalance, significant improvements in the algorithmic complexity of parallel
algorithms usually follows. This approach provides
the foundation upon which portable and architecture
independent optimisation can be achieved.
The paper is structured as follows: In section 2 the
BSP model (and its implementation BSPlib) and its
cost calculus are introduced. In section 3 some of the
features of BSPlib that facilitate parallel profiling are
described. In section 3 a call-graph profiling tool is
introduced, and in section 5 two broadcast algorithms
are analysed with the tool.
where w k= max{
I O < i < P ) (1)
hk = max{ max(inf, out!) I 0 i < p },
where k ranges over the supersteps; i ranges over processors; w! is an architecture independent cost that
models the maximum number of basic computational
operations executed by processor i in the local computation phase of superstep S'; in; (respectively, out!)
is the largest accumulated size of all messages entering
(respectively, leaving) processor i within superstep S"
The total computation cost of a program, is simply the
sum of all the superstep costs,
The BSP model
The Bulk Synchronous Parallel (BSP) model [lo,
111 views a parallel machine as a set of processormemory pairs, with a global communication network
and a mechanism for synchronising all processors. A
BSP calculation consists of a sequence of supersteps.
Each superstep can be decomposed into three phases:
(1)processor-memorypairs perform a number of computations on data held locally at the start of a superstep; (2) processors communicate data into other
processor's memories; and (3) all processors barrier
synchronise at the end of a superstep.
The cost of a BSP program can be calculated simply
by summing the costs of each separate superstep executed by the program; in turn, for each superstep,
the cost can be decomposed into: (i) local computation; (ii) global exchange of data; and (iii) barrier
synchronisation. The maximum number of messages
(words) communicated to or from any processor during a superstep is denoted by h, and the complete
set of messages is captured in the notion of an hrelation. To ensure that cost analysis can be performed in an architecture independent way, cost formulas are based on the following architecture dependent parameters: p , the number of processors; 1, the
minimal time between successive synchronisation operations, measured in terms of basic computational
operations; and g, the ratio of the total throughput
of the whole system in terms of basic computational
Profiling imbalance in parallel programs
The BSP model encourages a disciplined use of
computation and communication resources, in the
sense that all processors perform lock-step phases of
computation followed by communication. One way
of writing BSP programs is to use existing communication libraries such as PVM or MPI that support
non-blocking communications. These general purpose
libraries, however, are rarely optimised for the relatively small, but by no means trivial, subset of operations that are required for representing the BSP programming paradigm [S, lo]. To address this problem,
the BSP research community has proposed a standard
library - BSPlib - for programming within the BSP
framework [5].
BSPlib is a small communication library consisting of twenty operations for programming in a SPMD
(Single Program Multiple Data) manner. The main
features of BSPlib are two modes of communication,
one capturing a BSP oriented message passing approach and the other reflecting a one-sided direct remote memory access (DRMA) paradigm. The onesided BSPlib function bsp-put can be used to transfer
data from contiguous memory locations on the processor initiating the communication, into contiguous
memory locations of a remote processor, without the
active participation of the remote processor. The end
Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply.
of a superstep is identified by function bsp-sync, at
which point all processors barrier synchronise, and any
message transmissions issued by processors during the
superstep are guaranteed to arrive at their destination.
Unlike programs written in a general message
passing style, the disciplined nature of BSPlib facilitates profiling in a number of ways:
1. The cost model emphasises that both computation and communication costs should be used as
cost metrics when profiling.
Idle Time
Idle Time
S ynchronizatic
Figure 1: Superstep structure.
2. The cost of communication within a superstep
can be considered en-masse. This greatly simplifies the presentation of profiled results, as communication within a superstep can be attributed to
the barrier synchronisation that marks the end of
a superstep and not to individual communication
actions [3].
communication time; (iii) accumulated idle (waiting)
time; and (iv) accumulated h-relation size. The total
elapsed time spent at a cost centre is simply the sum of
the accumulated maximum communication and computation times.
The purpose of the profiling tool is to expose imbalances in either computation or communication, and
to highlight those imbalances which are amenable to
improvement. It is clear from the BSP cost formulae
for program execution that balance is the key to good
BSP design:
3. BSP cost analysis is modular and convex, i.e., improvement in the performance of algorithms (as
a whole) cannot be achieved by making one part
slower. This is important when profiling, as portions of code can be elided and hence make the
data visualisation simpler, without the removed
parts having any adverse effects on the cost of
the remaining supersteps.
4. The disciplined usage of computation and communication, encouraged by the BSP model, and
thus BSPlib, suggests that a programming style
where processes are required to pass through the
same textual bsp-sync for each superstep.’ Consequently, the line-number and filename of the
piece of code that contains the bsp-sync statement provides a convenient part of the code to
which costs can be attributed.
In the following sections two broadcasting algorithms that highlight the salient features of the
BSPlib profiling tool are analysed.
balanced computation amongst processes within
supersteps is encouraged on the premise that w
(see equation 1) is a maximum over local execution times;
processes within supersteps as h is a maximum
over fan-in and fan-out of messages;
the total number of supersteps should be minimised as each contributes an 1 term to the total
execution time.
Figure 1 shows a schematic diagram of a BSP superstep and its associated costs. As can be seen from
the diagram, idle time can arise in either local computation or communication. For computation, idle time
arises when processes have to wait at a barrier synchronisation for the process with the largest amount
of work to arrive. Alternatively, idle time may occur during the communication phase of a superstep,
as processes have to wait until all processes finish
communicating before safely proceeding into the next
Criteria for good BSP design
A post-mortem call-graph profiling tool has been
developed which analyses trace information generated
during the execution of BSPlib programs. The parts
of the program to which profiling information is assigned are referred to as ‘cost-centres’. For each costcentre in the program, that is the textual position of
a bsp-sync call, the following information is recorded:
(i) accumulated computation time; (ii) accumulated
21t is noted that idle time during communication depends
upon the type of architecture BSPlib is implemented upon. For
example, on the DRMA and shared memory architectures (e.g.,
lThis is more restrictive than the semantics required for
BSPlib programs.
Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply.
For each cost-centre, p costs are recorded, one per
process. This data is presented to the user in one of
two ways:
Example: broadcasting n values to p
Consider the problem of communicating a datap ) held on one prostructure of size n (where n
cess, into the memories of all p processors. A naive
algorithm can be effected in a single superstep by having the broadcasting process perform p - 1 distinct
bspquts into the memories of each other process. As
the broadcasting process transmits p - 1messages each
of size n, the superstep realises an n(p - 1)-relation
with approximate cost (assuming p instead of p - 1 in
the above formulae):
Summarised data: The accumulated cost within
a single cost-centre can be summarised in terms of
maximum (the standard BSP interpretation of cost),
average, and minimum accumulated costs associated
with each of the p processes. More formally, given that
a program may pass through a particular cost centre
x times generating a sequence of costs (Cl,.. . , C.),
the accumulated computation cost for the given costcentre is given by the formulae:
maximum cost = E m a x { w; 1
o Ii < p 1
one stage bcast = npg + I ,
where 1 is the up-front cost of performing a single superstep.
min { w; I o L: i
minimum cost =
<p 1
Similar formulae exist for communication time, idle
time, and h-relation size.
All data: The cost associated with each of the p
processes is presented to the user as a pie chart. Care
has to be taken when interpreting the results shown in
this manner as the cost is calculated using a formulae
that differs from the standard BSP interpretation of
cost. The motivation is that the effect of visualising a
pie-chart is to identify the largest (maximal) segment
in the chart. The size of this segment is:
Figure 3: Two stage broadcast using total exchange.
It is clearly seen from the cost-formula captured
in equation 6, that this algorithm is not scalable
as its cost linearly increases with p. An alternative scalable BSP broadcasting algorithm [I,81, with
cost 2 g ( n - ( n / p ) ) 21 is shown in Figure 3. The
algorithm consists of two supersteps, that initially
evenly distribute the data amongst the processes and
subsequently perform a balanced communication involving all the processes. The cost of the distribution
phase is ( n / p ) ( p- l ) g I as a single message of size
( n / p ) is sent to each of p - 1 processes. In the second
superstep, every process sends and receives p - 1messages of size ( n / p ) from every other process. Surprisingly, the cost of this superstep is also (n/p)(p-l)g+E;
note that BSP cost analysis encourages balanced communication. The approximate cost (assuming p instead of p - 1 in the above formulae) of the entire
algorithm is determined by summing the cost of the
which is different from the maximum identified by the
prior analysis. As can be clearly seen from equations 2
and 5 , the latter equation abstracts the maximum
outside the summation, which produces a result that
might be smaller than that obtained from the former
equation. Although this interpretation is not strictly
in line with BSP cost analysis, it is useful in identifying the process that might be causing the bottleneck.
Cray T3D/E and SGI Power Challenge), communication idle
time arises as shown in Figure 1. However, BSPJib built on top
of architectures that only support message passing (e.g., IBM
SP2), results in communication idle time being coalesced with
the computation idle time of the following superstep. Refer to
[6] for details.
Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply.
Figure 2: Sample call-graph profile on a 16 processor Cray T3E.
two supersteps:
two stage beast = 2 x
= 2ng
+ 21.
The call-graph shown in Figure 2 reflects a program
that performs 500 iterations of the one-stage broadcast and 500 iterations of the two-stage. To highlight
the features of the call-graph profile, the procedures
f oo and bar contain procedure calls to the two broadcasting algorithms. In procedure f oo, the one-stage
broadcast algorithm is executed 250 times, and subsequently, in procedure bar, the one-stage algorithm
is executed 250 times, along with 500 iterations of the
The graph clearly shows how the costs are inheritred
from the leaves of the graph towards the root. That
is, the top-level procedure main, records the accumulated computation, communication, and idle cost for
all the supersteps within the program; whereas interior
nodes in the call-graph record information pertaining
to supersteps performed during the lifetime of the procedure identified by the interior node.
Leaf nodes record: (i) the textual position of the
bsp-sync call within the program; (ii) the number
of times a particular superstep is executed; and (iii)
summaries of the size of h-relation, computation, communication, and idle cost, in terms of the maximum,
average, and minimum cost on p processors. The average and minimum cost is given as a percentage of
the maximum cost.
Interior nodes record similar information, except
that the label of the node is a procedure name, and
the accumulated cost is inherited from each of the supersteps executed during that procedure. Notice for
the node labelled bcast-onestage that the maximum
computation, communication, and idle time are each
19 seconds. That is, a total of 38 seconds is spent in
the one-stage broadcast. However, some of the proc-
From equations 6 and 7 it is possible to determine
that when n > l/(pg - 29) the two-stage algorithm
is superior to the one-stage algorithm. For example,
when 1 is large, and n and p are small, the cost of the
extra superstep out-weighs the cost of communicating
a few small messages. On the other hand, for large n
or p , the communication cost out-weighs the overhead
of the extra superstep.
Interpreting call-graph information
Figure 2 shows an example call-graph profile for the
two broadcast algorithms running on a 16 processor
Cray T3E. The call-graph contains a series of interior
and leaf nodes. The interior nodes represent procedures entered during program execution, whereas the
leaf nodes represent the textual position of supersteps,
that is, the line of code containing a bsp-sync. The
path from a leaf to the root of the graph identifies the
sequence of cost-centres passed through to reach the
part of the code that is active when the bsp-sync associated with the given leaf is executed. This path is
termed a call-stack and thus a collection of call-stacks
comprise a call-graph profile. One of the main advantages of call-graph profiling is that a complete set of unambiguous program costs can be collected at run-time
and post-processed. This greatly aids the identification of program bottlenecks. Furthermore, the costs
of shared procedures can be accurately apportioned to
their parents via a scheme known as inheritance. This
allows the programmer to resolve any ambiguities with
regard the cost of shared procedures [7].
Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply.
esses spend 19 seconds idling during this 38 secondsthis idle time is due to the processes waiting for the
broadcasting process to transmit all the messages.
In Figure 2 the absolute imbalance in h-relation
size is used to identify critical paths. The tool clearly
shows that the one-stage algorithm reflects a large
amount of imbalance in communication. We quantify
this imbalance in terms of the form (12% I 7%). This
identifies that the average cost is 12% of the maximum, whereas the minimum is 7%. Clearly, such
small percentages for the average and minimum cost
identify large imbalances in the algorithm. The tool
correctly identifies that there is a similar imbalance in
the first superstep of the two-stage algorithm (initial
distribution of data)-it is noted that this imbalance
is unavoidable. However, it is worth noting that the
tool does not identify this imbalance as severe as the
imbalance underlying the one-stage algorithm, as it is
caused by a smaller h-relation, i.e., an ( n / p ) O ,- 1)relation compared to an n(p - 1)-relation. Finally in
the last superstep of the two-stage broadcast, there is
virtually no imbalance (100% I 100%) within the node
labelled bcast c l i n e 41.
Identifying critical paths
The scope of the profiling tool is not limited to
merely visualising the computation and communications patterns at each of the cost-centres, but aims to
identify critical cost paths within programs. Critical
paths are visualised by shading each of the nodes in
the graph with a colour ranging from white to red. A
red node corresponds to a bottleneck (or ‘hot-spot’!)
in the program (in this paper, colours have been replaced by grey-scales from white to dark grey). There
are seventeen different critical paths that can be identified by the profiling tool. The simplest is the synchronisation critical path that identifies nodes in the
graph containing the most supersteps. For computation, communication, idle time and h-relation size,
four different kinds of critical path can be identified:
Absolute - identifies the nodes with the largest
maximum cost.
Absolute imbalance - identifies the nodes with
the largest difference between the maximum and
average cost.
An architecture independent metric
for critical paths
In the introduction it was stated that care has to be
taken when optimising programs based on profile information. A problem with most commercially available profiling tools is that profile data is often errorprone, especially when wall-clock time is used as a
cost metric. One of the important features of BSP
is that the size of h-relations directly influences the
cost of communication. Therefore, instead of using
actual communication time as a cost metric, which
may include errors, the predicted cost of communication, hg + 1 , is used. This is error-free as the value
of h, which is not affected by the choice of the underlying machine or architecture, is accurately recorded
at runtime. Therefore, our hypothesis is encapsulated
in the notion that the imbalance in maximum and average h-relation can be effectively used as the metric
by which BSP programs are optimised and optimal
architecture independent parallel algorithms can be
developed. The hypothesis is reinforced by both the
BSP cost analysis formulae and experimental results.
The cost of the two broadcasting algorithms support this hypothesis. It can be clearly seen from
the nodes in Figure 2 labelled one-stage-bcast and
t w o s t age-broadcast, that the two-stage broadcast
is superior to the one stage, as it is revealed by the
accumulated values for the computation, communication and idle costs. For example, the performance on
a 16 processor Cray T3E gives an improvement of:
19.48 19.55
= 4.15,
4.73 4.67
Relative imbalance - identifies the nodes with the
largest percentage-wise deviation between maximum and average cost.
Weighted - identifies the nodes with both the
largest difference between the maximum and average cost and the largest percentage-wise deviation between maximum and average cost. Informally, this path combines the prior two critical
The role of the absolute critical path is to identify
those nodes that constitute the major components
(cost-wise) within the program. In contrast, the absolute imbalance path highlights those nodes that are
amenable to improvement due to the underlying imbalance in the maximum and average cost. However,
the problem with this metric is that nodes with large
cost value and small deviation might be identified as
‘more critical’ than nodes with smaller cost value but
larger deviation. As the later of these nodes are more
amenable to improvement, the relative imbalance critical path is useful in determining the nodes that are
imbalanced, irrespective of their size. Finally, the
weighted critical path combines the advantages of the
previous two approaches.
Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply.
Figure 4: Screen shot from a profile of an SQL application
[4] 3 . M. D. Hill, S. Jarvis, C. Siniolakis, and V. P. Vasilev.
Portable and architecture independent parallel performance tuning using a call-graph profiling tool: a case study
in optimising SQL. Technical Report 17-97, Programming
Research Group, Oxford University Computing Laboratory, May 1997.
[5] J. M. D. Hill, B. McColl, D. C. Stefanescu, M. W. Goudreau, K. Lang, S. B. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP Programming Library. Technical Report PRG-TR-29-9, Oxford University Computing Laboratory, May 1997. see www .bsp-worldwide .org
for more details.
[6] J. M. D. Hill and D. Skillicorn. Lessons learned from implementing BSP. Journal of Future Generation Computer
Systems, April 1998.
[7] S.A. Jarvis. Profiling large-scale lazy functional programs.
PhD thesis, Computer Science Department, University of
Durham, 1996.
181 B. H. H. Juurlink and H. A. G . Wiishoff. Communication
primitives for BSP computers. Information Processing Letters, 58:303-310, 1996.
[9] W. F. McColl. Scalable computing. In J. van Leeuwen,
editor, Computer Science Today: Recent Trends and Developments, number 1000 in LNCS, pages 46-61. SpringerVerlag, 1995.
[lo] D. Skillicorn, 3 . M. D. Hill, and W. F. McColl. Questions
and answers about BSP. Scientific Programming, 1997.
[11] L, G, Valiant. A bridging model for parallel computation.
CACM, 33(8):103-111, August 1990.
The performance improvements resulting from the
analysis of the call-graph profiles demonstrate that the
tool can be used to optimise programs in a portable
and architecture independent manner. These conclusions are further reinforced in [4]where the steps
involved in optimising an SQL application (also see
Figure 4) are described. Unlike other profiling tools,
an architecture independent metric - h-relation size
- guides the optimisation process. The major benefit
of this profiling tool is that the amount of information displayed when visualising a profile for a parallel
program is no more complex than that of a sequential
M. Barnett, D. Payne, R. van de Geijn, and J. Watts.
Broadcasting on meshes with wormhole routing. Journal of
Parallel and Distributed Computing, 35(2):111-122, 1996.
V. Heearte and E. Lusk. Studying parallel program behaviour with upshot. Technical Report ANL91/15, Argonne
National Lab, Argonne, 11. 60439, 1991.
J. M. D. Hill, P. I. Crumpton, and D. A. Burgess. Theory,
practice, and a
for BSP performance prediction’ In
EuroPar’96, number 1124 in LNCS, pages 697-705, Lyon,
France, Aug. 1996. Springer-Verlag.
Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 20,2010 at 10:53:46 UTC from IEEE Xplore. Restrictions apply.