Download Software Pipelining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Software Pipelining
Software Pipelining
HM
(11/14/2011)
Software Pipelining (SW PL) is a performance enhancing technique that
exploits instruction-level parallelism on a wide instruction word to accomplish
execution speed-up. As the name implies, pipelining here is not primarily
achieved through hardware, but rather refers to the interleaved execution of
distinct iterations of a SW loop. Parts of multiple iterations of a source
program’s loop body are executed simultaneously in a pipelined fashion.
This requires that the target machine has multiple HW modules that can
execute simultaneously. For example, an LIW architecture has instructions
that perform, say, a floating point multiply at the same time they execute a
floating point add, and perhaps also an integer addition plus a load
operation. We have seen this on a smaller scale with superscalar machines.
Similarly, a VLIW architecture provides instructions that do all of the above
plus a store, several other integer operations, and perhaps loop instructions.
Very roughly, the boundary between VLIW and LIW is 128 opcode bits.
Synopsis
 Motivation
 Outline of Software Pipelining
 General Pattern of VLIW Instructions Used
 Definitions
 Comparison of Pipelining, SW PL, and Superscalar Execution
 Typical VLIW instruction
 Example of Simple SW PL Loop
 Peeling-Off Iterations
 Software Pipelined Matrix Multiply
 More Complex SW Pipelined Loop
 Skipping the Epilogue?
 Literature References
1
Software Pipelining
HM
Motivation
Imagine you construct a super-machine instruction, one that can perform
numerous operations in parallel. The time for the novel instruction to
complete equals the time of the longest sub operation. Once the faster sub
operations complete, their HW modules sit idle, waiting for another
instruction to keep them busy, until the most complex sub operation
terminates. Let us name this super-machine instruction a Very Long
Instruction Word (VLIW) operation.
This VLIW operation may compute a floating point (FP) multiply (fmul) at the
same time it performs an FP add (fadd), and a load (ld), and a store (st),
and two integer operations, and perhaps more. Below is a sequence of 3
fictitious assembler operations, executed in sequence; later we shall package
them into a VLIW operation:
0
1
2
3
-- 2
ld
ld
fadd
loads, 1 add
r3, (r1)++
r4, (r2)++
r5, r3, r4
-----
assume good addresses in r1, r2
load r3 ind. thru r1, post-incr.
load into r4 memory word at (r2)
Floating-point add r5 ← r3 + r4
The sequence of instructions above executes sequentially. Our novel VLIW
instruction can do all of the above, and more in parallel. The new superassembler expresses this simultaneousness via new assembly pseudooperations, the { and } braces. These braces say: All enclosed operations
are packed into a single VLIW instruction, and all the sub opcodes,
regardless of the order listed, are executed in parallel, in a single logical
step, in a single instruction. See the new VLIW instruction below:
{
1
2
3
ld
r3, (r1)++
ld
r4, (r2)++
fadd r5, r3, r4
-----
assume good addresses in r1, r2
load r3 indirect thru r1, post-incr.
load into r4 memory word at (r2)
FP add r5 ← r3 + r4
}
It is not reasonable to define the architecture such that the sum of r3 and r4,
ending up in r5, would be the addition of the newly fetched memory words,
indirectly accessed through r1 and r2. Instead, the old values in r3 and r4 that
were already in the registers when the instruction started, those older values
would be added into r5. New values then are loaded into r3 and r4. So the
meanings of the sequential and VLSI instruction sequences above are not
equivalent. But then this seems that a VLIW operation cannot truly be used
for executing in parallel and fast, what had to be evaluated in sequence and
slowly. And that is a correct observation; yet maybe we can invent
something great with VLIW operations after all?
2
Software Pipelining
HM
This is where Software Pipelining comes into play. It is a technique that
takes advantage of the built in parallelism, while overcoming the limitation
that the destination register of one sub operation cannot be used in the
same logical step as a source for another. Software Pipelining packs parts of
multiple iterations of one source loop by assembling them into the same
VLIW instruction. For this to work the Software Pipelined loop must be
suitably prepared; this is done in the loop Prologue. And before the iterations
are all complete, the VLIW instruction must be emptied or drained; that is
done in the loop Epilogue.
3
Software Pipelining
HM
Outline of Software Pipelining
 Software Pipelining requires a target processor with multiple executing
agents (HW modules) that can run simultaneously
 Target may be an LIW, a VLIW, or a superscalar processor
 These multiple HW modules operate in parallel on (parts of) different
iterations of a source program’s loop
 It is key that none of these multiple, simultaneous operations generate
results that would be expected as input of another operation in the same
iteration
 Instead, in step 1 an operation generates a result in destination register
r1, while in step 2 another operation can use r1 as an input to generate
the next result r2
 It is the compiler’s or programmer’s task to pack operations of different
iterations of the source loop into suitable fields of one long instruction in a
way that they be executed in one step
 In extreme cases, a complete loop body may consist of a single VLIW
instruction; clearly a desirable special case 
4
Software Pipelining
HM
General Pattern of VLIW Instructions Used:
In the following examples we use VLIW instructions. The assembly language
syntax is similar to conventional assemblers. There will be individual load
and store operations, floating point adds and multiplies. In addition, VLIW
instructions can group numerous of these operations together; this is
expressed by the { and } braces, indicating that all operations enclosed can
be executed together, jointly in a single step. The sequential order in which
the various VLIW suboperations are written in assembler source does by no
means imply sequential execution order! Instead, all suboperations are
started at the same time. For example:
1
2
3
4
5
6
7
ld
add
{
fadd
ld
ld
}
r3, (r1)++
r0, r6, #2
r5, r3, r4
r3, (r1)++
r4, (r2)++
--------
load r3 ind. thru r1, post-incr.
add 2 to r6 and put sum into r0
start of VLIW operation
add current r3 + r4 into r5
new value in r3, old value used
new value in r4, old value used
end of VLIW instruction
Lines 1 and 2 above display typical sequential load and add instructions.
However, lines 3 through 7 display one single VLIW operation that lumps a
floating point add, two integer increments, and two loads into one single
executable step. Note that r3 and r4 are both used as operands, as inputs to
the floating point add operation, but they also receive new values as a result
of the load functions. This works perfectly well, as long as the processor
latches their old values for use in the fadd. After completion of the VLIW
instruction the loaded values in r3 and r4 can be used as operands again.
The execution time of the above VLIW instruction is dominated by the time
to execute two loads in sequence, unless the two addresses (in r1 and r2)
happen to activate different memory banks. In that case, the time for two
loads would be the time for one load plus a small number of clock cycles.
5
Software Pipelining
HM
Loop Instructions, Not germane to VLIW
The loop instruction used in the examples here is akin to the Intel 8086
processor. The hypothetical operation:
loop foo
-- use special-purpose loop register rx
has the following meaning: An implied loop register rx is assumed initially to
hold the number of iterations of a countable loop. This integer value is
decreased by 1. If that decremented value is not yet 0 execution continues
at label foo. Typically foo resides at an address physically before the loop
instruction. So this results in a back branch most of the time, and once in a
fall through.
6
Software Pipelining
HM
Definitions
Column:
A Software Pipelined loop operates on (parts of) more than one iteration
of a source program’s loop body. Each distinct loop iteration executed in
one step --towards which a compute step progresses-- is called a
column. Hence, if one execution of the loop makes some progress toward
4 different iterations, such a SW-pipelined loop is said to have 4 columns.
Draining:
Execute further instructions on a VLIW architecture after VLIW loop
execution, such that a SW-pipelined loop completes correctly. That means,
all registers still holding good values after loop termination will finally be
used up by such sequential instructions. Synonym: flushing. Antonym:
Priming.
Epilogue:
When the (steady state of the) loop terminates, there will generally be some
valid operands in the hardware resources used. For example, an operand
may have been loaded and is yet to be used. Or, a product has been
computed that is yet to be added to the final sum. Thus the last operands
must be consumed, the pipeline must be drained. This is accomplished in
the object code after the steady state and is called the Epilogue. Antonym:
Prologue.
LIW:
Long Instruction Word; an instruction requiring more opcode bits than a
conventional (e.g. RISC) instructions, because multiple simultaneous
operations are packed and encoded into a single LIW instruction. Typically,
LIW instructions are longer than 32 but shorter than 128 bits.
The opcode proper may be short, possibly even a single bit, but generally
there will be further bits to specify sub-opcodes. For example, the floating
point add may perform either a subtract, negate, or unsigned add etc. and
that must be specified via bits for sub opcodes.
Peeling Off:
Removal of an iteration from the original complete loop. Usually this is done
to perform an optimization. In software pipelining this is done to ensure the
pipeline is primed before and drained after execution of the loop. The object
code of peeled off iterations can be scheduled together with other
instructions. Hence the Prologue and Epilogue may end up in VLIW
instructions of code preceding or following a software pipelined loop.
7
Software Pipelining
HM
Priming:
Execute sequential instructions on a VLIW architecture before loop
execution, such that a SW-pipelined loop can run correctly. That means, all
registers are holding good values at the start of the loop, but hold also
partly unused values at termination of the loop. Antonym: Flushing.
Prologue:
Before the Software Pipelined loop body can be initiated, the various
hardware resources (e.g. registers) that partake in the SW PL must be
initialized. For example, the first operands may have to be loaded, or the
first sum of two loaded operands must be computed. Thus the first operands
must be generated, the pipeline must be primed. This is accomplished in the
object code before the steady state and is called the Prologue. Antonym:
Epilogue.
Steady State:
The object code executed repeatedly, after the Prologue has been initiated
and before the Epilogue will be active is called the Steady State. Each single
execution of the Steady State makes some progress toward multiple
iterations of the source loop. This loop was mapped into Prologue, Steady
State, plus Epilogue by the Software Pipelining compiler.
VLIW:
Very Long Instruction Word; like an LIW instruction, but VLIW
instructions typically consume 128 or more bits for the opcode, subopcodes, plus all operands. Some of the sub-opcodes may actually be
NOPs.
8
Software Pipelining
HM
Comparison of Pipelining, SW PL, and Superscalar Execution
Pipelining:
 Assumes: multiple independent HW units that collectively execute one
single instruction in sequence. Hardware modules are not replicated.
 Does: at any step (clock cycle) each HW module executes one part of a
different instruction.
 Allows: simultaneous execution of multiple parts of different instructions
at the same time. This does not accelerate the execution time of a single
instruction.
 Speeds up: the throughput of numerous consecutive instructions;
provided there are no stalls (AKA hazards)
Superscalar Execution:
 Assumes: multiple independent HW units that each can execute a
complete instruction. Also assumes that some instruction sequences are
arranged in the object code such that they can be fetched and executed
together; implies they have no data dependence on one another
 Does: at any step execute either a single, or sometimes multiple,
instructions simultaneously. Always fetches greedily.
 Allows: simultaneous execution of some sequences of instructions. Note
that not all instruction sequences (or instructions pairs in a superscalar
architecture with a maximum of 2 identical HW Modules) can be executed
concurrently.
 Speeds up: those select instruction sequences, for which the architecture
provides multiple HW modules, and for which the logic of the running
program provides data independence, i.e. the output of one is not
required as the input to the other.
Software Pipelining:
 Assumes: VLIW or LIW instructions.
 Does: at any VLIW-step executes multiple operations at the same time,
packaged into a single VLIW instruction.
 Allows: execution of parts of multiple iterations of the same source loop
at run time. But generally one VLIW instruction does not map into a full
source loop; only a portion of a whole loop body.
 At each iteration the SW-pipelined loop executes VLIW instructions, which
together hold instructions for multiple source statements of a loop body.
Thus does speed up the total throughput of a complete loop.
9
Software Pipelining
HM
Typical VLIW Instruction Performs
 floating point multiply
 floating point add
 load from memory, with optional auto-increment, decrement, pre- or post
 second load or store, with auto-increment, decrement, pre- or post
 integer add, also with optional auto-increment etc.
 integer multiply or divide
 loop-related operation
Note: If both a load and a store are performed in one VLIW instruction on a
system with a single memory controller, the load should be initiated first.
The execution time is greater than the maximum of the times for a load and
a store. Having two store operations as part of the same VLIW instruction
would require special care to define the case of both memory destinations
being identical or overlapping.
Example of Simple SW PL Loop
The program to be software pipelined adds all elements of floating-point
vector a[] to the corresponding elements of vector b[] and moves the sums
into vector a[]. First we show the source program in pseudo-C, with a
possible mapping into some hypothetical assembly language. Then we show
a pictorial representation of the problem with the hardware resources used,
and a software pipelined version in VLIW assembly language. Note that
registers r1, and r2 are used as pointers. Register r1 points to the next
element in a[] to be fetched, r2 to the next available address in b[].
Source of Vector Add Program:
for ( i = 0; i < N; i++ ) {
a[ i ] = a[ i ] + b[ i ];
// same as: a[ i ] += b[ i ]
} //end for
--a[] and b[] are floats
Sequential Assembly Program of Vector Add:
mov
mov
mov
mov
r0,
r1,
r2,
rx,
Addr( a[] )-- address of a[0] is in r0 for load
r0
-- copy pointer into r1 for store
Addr( b[] )-- address of b[0] in r2 for load
N
-- Number of iterations in loop rx
L_1:
ld
r3, (r0)++
-ld
r4, (r2)++
-fadd r5, r3, r4
-st
r5, (r1)++
-loop L_1
--- successor of loop
load indirect into r3 through r0
what r2 points to is loaded into r4
r5 holds sum of two elements
store result and post-increment
if not completed, jump to L_1
10
Software Pipelining
HM
Pictorial Representation of Vector Add Program:
a[]
b[]
r2
r0
r1
r3
r4
+
r5
11
Software Pipelining
HM
Software Pipelined Assembly Version of Vector Add:
The iteration space spans 0 through N-1. Two of N iterations are peeled off
in the Prologue plus Epilogue. Each iteration in the Steady State consists of
two loads, two integer additions with post-increment, a floating point add, a
store, and the loop-overhead step.
The Prologue executes a pair of loads and one floating point add. Then
comes another pair of loads, since the previously loaded values are used up
in the floating point add. The Epilogue then completes the last but one store,
performs the last add and concludes by storing that final value.
--
Initialization:
mov r0, Addr( a[] )-- r0 is pointer to float array a[0]
mov r1, r0
-- copy address of a[0] into r1
mov r2, Addr( b[] )-- r2 is pointer to b[0]
mov rx, N-2
-- rx holds iteration count N-2
--
Prologue primes the SW pipe:
ld
r3, (r0)++
-- r3 holds a[0], r0 now points
ld
r4, (r2)++
-- r4 holds b[0], r2 now points
fadd r5, r3, r4
-- old a[0] + b[0] in r5
ld
r3, (r0)++
-- a[1] in r3, r0 now points to
ld
r4, (r2)++
-- b[1] in r4, r2 now points to
Steady State executes repeatedly, (N-2) times:
-L_2:
{
st
fadd
ld
ld
loop
r5,
r5,
r3,
r4,
L_2
(r1)++
r3, r4
(r0)++
(r2)++
------
to a[1]
to b[1]
a[2]
b[2]
store into address at r1
r5 holds sum of two elements
load indirect into r3 through r0
what r2 points to is loaded into r4
decrement rx, if != 0, jump to L_2
}
-- r1 now points to last-but-two element of a[] for storing
-Epilogue drains the SW pipe:
st
r5, (r1)++
-- increment r1
fadd r5, r3, r4
-- use last pair of FP operands
st
r5, (r1)
-- and store last sum in a[N-1]
Peeling Off Iterations
The picture shows that two iterations had to be peeled off. The overall net
effect of all peeled-off operations is 2 pairs of loads, 2 floating-point adds,
and 2 stores. All those loads and one add are done early in the Prologue. The
other add and the 2 stores are executed in the Epilogue. The former primes
the software pipeline for the Steady State; the latter drains it.
12
Software Pipelining
HM
Software Pipelined Matrix Multiply
To show in the assembly listings the association of loop iterations with lines
of assembly code, we assume a small number of total iterations, here N = 5.
First we show the source for the Matrix Multiply with the important inner loop
bold faced. This inner loop is generally converted into a reduction
operation by the compiler’s optimizer; we shall use that optimization here.
After the assembly code for the normal, i.e. sequential matrix multiply, we
discuss the SW Pipelined version.
c[]
a[]
b[]
r1
r2
r4
r3
*
r6
r5
+
Source of Matrix Multiply, Inner Loop Emphasized:
for ( row = 0; row < N; row++ ) {
for ( col = 0; col < N; col++ ) {
c[row][col] = 0.0;
for ( i = 0; i < N; i++ ) {
c[row][col] += a[row][i] * b[i][col];
} //end for
} //end for
} //end for
-- optimizer or smart programmer reduces this to equivalent:
temp_reg = 0.0;
for ( i = 0; i < N; i++ ) {
temp_reg += a[row][i] * b[i][col];
} //end for
c[row][col] = temp_reg;
-- now we code a) in assembler and then b) SW-pipeline inner loop
13
Software Pipelining
HM
Sequential Assembly Program of Matrix Multiply:
The repeated store into memory at c[row][col] N times has been optimized
away into a single store outside the inner loop; named a reduction operation
in optimizer parlance.
row and col are loop invariant for the innermost loop:
-- row and col are
-- plausible cycle
1
mov r1,
1
mov r2,
1
mov r6,
1
mov rx,
L_3:
3
ld
r3,
3
ld
r4,
1
iadd r2,
3
fmul r5,
1
fadd r6,
1
loop L_3
end:
3
st
r6,
known
count
Addr(
Addr(
0.0
N
here, and are not changed
per instruction in leftmost column:
a[row][0] )
-- address into r1
b[0][col] )
-- address into r2
-- cumulative FP-sum so far
-- rx holds iteration count N>2
(r1)++
(r2)
#stride
r3, r4
r5, r6
-- load into r3 thru r1
-- load into r4 thru r2
-- stride too big for ++
-- r5 holds next product
-- r6 holds cumulative products
-- if not done, jump to L_3
-- label here for documentation
Addr( c[row][col] )
14
Software Pipelining
HM
Software Pipelined Assembly Program of Matrix Multiply:
Two iterations will be peeled off from the inner loop of the source program.
One performs two loads and the floating point multiply; the other quickly
reloads the registers r3 and r4 again. Thus, with N=5, there are only 3
iterations left in the Steady State of the software pipelined code. We
explicitly show the iterations’ numbers of the inner loop associated for each
piece of object code. For the inner loop there will be N-2 iterations; see the
leftmost 3 numeric columns in the listing for the Steady State below.
-- source iteration indices in left columns:
-- Initialization, assume N=5
-- for loop-counting illustration purposes
0
mov r1, Addr( a[row][0] )
-- pointer into r1
0
mov r2, Addr( b[0][col] )
-- pointer into r2
0
mov r6, 0.0
-- cumulative sum so far is 0
0
mov rx, N-2
-- rx holds iteration count N-2
-- Prologue ‘primes’ the SW pipe:
1
ld
r3, (r1)++
-- a[row][0] in r3
1
ld
r4, (r2)
-- b[0][col] in r4
1
fmul r5, r3, r4
-- first product in r5
2
iadd r2, #stride
-- stride too big for ++
2
ld
r3, (r1)++
-- a[row][1] in r3
2
ld
r4, (r2)
-- b[1][col] in r4
3
iadd r2, #stride
-- r2 points to b[2][col]
-Steady State executes N-2 times:
L_4:
{
1 2 3
fadd r6, r5, r6
-- r6 holds cumulative float *
2 3 4
fmul r5, r3, r4
-- r5 holds next product *
3 4 5
ld
r3, (r1)++
-- load next a[][] into r3 via r1
3 4 5
ld
r4, (r2)
-- load next b[][] into r4 via r2
4 5 6
iadd r2, #stride
-- too big for an auto-increment
-- finally, r2 points to next b[][]
loop L_4
-- loop again, or fall through?
} –- end of loop body
-- end of steady state
--
Epilogue drains the SW pipe:
4
fadd r6, r5, r6
5
fmul r5, r3, r4
5
fadd r6, r5, r6
-- the store is not part of the optimized loop:
-- All N stores have been optimized (reduced) into a single
-- store after the innermost loop is complete
st
r6, c[row][col]
15
Software Pipelining
HM
Caution: In a single iteration the Steady State of the above software
pipelined loop progresses on 4 different iterations of the source program’s
loop. We say the number of columns is 4. Since r2 is used as a pointer to the
next element of b[], technically speaking it is out of range after loop
completion. But since there is no attempt to reference this illegal pointer, the
program is still safe.
If the last element of b[] just happens to reside at the end of memory, r2
would experience integer overflow, which ends up being harmless, due to
lack of pointer reference. Most machines do ignore integer overflow in
registers, but trap on a floating-point over- (or under-) flow.
16
Software Pipelining
HM
More Complex SW Pipelined Loop
The next example to be mapped into software pipelined VLIW instructions
will add a vector b[] and the products of vectors c[] and d[], then it will
store the result in yet a fourth vector a[]. See the C source sketched below:
for ( i = 0; i < N; i++ ) {
a[ i ] = b[ i ] + c[ i ] * d[ i ];
} //end for
a[]
r1
d[]
c[]
b[]
r4
r3
r2
r7
r6
r5
+
r9
*
r8
A more Complex Vector Operation:
Three iterations will be peeled off from the loop. After proper initialization of
all register pointers using r1 through r4, one peeled off iteration performs the
first loads of c[0] and d[0] into r6 and r7 for the first multiplication. It then
loads r5 with b[0] for the subsequent floating point addition. Except for the
store into a[0] this collective work constitutes one full iteration.
The second partly peeled off iteration takes advantage of the fact that the
multiply just completed did consume the values in r6 and r7, thus they can be
re-loaded. The FP-add now uses the values in r5 and r8; r8 holds the first
product. Hence 3 new loads can proceed and a new multiply can be
performed. This completes the second partly peeled-off iteration.
Finally the third peeled off iteration loads register pair r6 and r7 again. This is
feasible, because the multiply in peeled off iteration 2 used the values
already (these are c[1] and d[1]), hence r6 and r7 must be reloaded again.
Then the complete pipe is primed. By this time two fmul and one fadd
instruction have been executed, and 3 pairs of c[i] and d[i] have been
17
Software Pipelining
HM
loaded into register pair r6 and r7. This does not yet constitute 3 full
iterations worth of peeled off work. The remaining portion is yet to be
completed in the epilogue.
18
Software Pipelining
HM
---
iteration indices 1..N in left column
index 0 says: preparation NOT directly part of source
0
mov r1, Addr( a[0] )
-- pointer a[i] into r1
0
mov r2, Addr( b[0] )
-- pointer b[i] into r2
0
mov r3, Addr( c[0] )
-- pointer c[i] into r3
0
mov r4, Addr( d[0] )
-- pointer d[i] into r4
0
mov rx, N-3
-- rx holds iteration count-3
--
Prologue ‘primes’ the SW pipe; left col: iteration 1..N
1
ld
r6, (r3)++
-- r6 = c[0]
1
ld
r7, (r4)++
-- r7 = d[0], ready to fmul
1
fmul r8, r6, r7
-- r8 = c[0] * d[0]
1
ld
r5, (r2)++
-- r5 = b[0], ready for fadd
1
fadd r9, r5, r8
-- r9 = a[0] + c[0] * d[0]
2
ld
r6, (r3)++
-- r6 = c[1]
2
ld
r7, (r4)++
-- r7 = d[1], ready to fmul
2
fmul r8, r6, r7
-- r8 = c[1] * d[1]
2
ld
r5, (r2)++
-- r5 = b[1], ready to fadd
3
ld
r6, (r3)++
-- r6 = c[2], 1st * operand
3
ld
r7, (r4)++
-- r7 = d[2], ready to fmul
-Steady State executes N - 3 times
L_5:
{
1 2 ..
st
r9, (r1)++
-- r9 holds sum, r1 Addr(a[i])
2 3 ..
fadd r9, r5, r8
-- next sum[i+1] into r9
3 4 ..
fmul r8, r6, r7
-- product c[i]*d[i] in r8
4 5 ..
ld
r5, (r2)++
-- r5 = b[i-1]
4 5 ..
ld
r6, (r3)++
-- r6 = c[i]
4 5 ..
ld
r7, (r4)++
-- r7 = d[i]
loop L_5
} –- end Steady State, done N-3 times
---
Epilogue drains SW pipe, sum indices range 1..N
iteration indices 1..N, subscript range 0..N-1
N-2 st
r9, (r1)++
-- a[N-3] = r9, holds sum[N-2]
N-1 fadd r9, r5, r8
-- sum[N-1]
N-1 st
r9, (r1)++
-- a[N-2] = sum[N-1]
N
ld
r5, (r2)++
-r5 = b[N-1]
N
fmul r8, r6, r7
-r8 = c[N-1] * d[N-1]
N
fadd r9, r5, r8
-r9 = sum[N]
N
st
r9, (r1)
-- a[N-1] = sum[N]
19
Software Pipelining
HM
Skipping The Epilogue?
An interesting question is “Could we skip the Epilogue and simply execute
the Steady State the corresponding number of iterations more?” The
presumed saving would be compact code and faster execution. The answer
clearly depends on the source of the inner loop and on the depth of
iterations peeled off.
Generally, however, the saving in object code is not dramatic. The code of
the Epilogue usually can be scheduled into the Prologue of subsequent VLIW
instructions. Moreover, both Prologue and Epilogue are executed just once.
So the run-time saving cannot be dramatic. More interestingly, the pipeline
would be filled with values through load operations that progress too far.
This alone would not pose any danger, as the registers holding the loaded
results must be considered dead after the loop with Epilogue. However, if the
addresses referenced a point beyond the legally addressable memory, an
exception could arise that would render the whole execution illegal. In that
case, even fast execution is no longer attractive. Our response therefore:
The Epilogue must be executed appropriately.
20
Software Pipelining
HM
Bibliography
1. Monica Lam, CMU dissertation 1989, A Systolic Array Optimizing Compiler
(1989) (ISBN 0-89838-300-5).
2. B.R. Rau and C.D. Glaeser, 1981, Some scheduling techniques and an
easily schedulable horizontal architecture for high performance scientific
computing, Proceedings of the Fourteenth Annual Workshop on
Microprogramming (MICRO-14), December 1981, pp. 183-198
3. Monica Lam, 1988, Software pipelining: An effective scheduling technique
for VLIW machines, Proceedings of the ACM SIGPLAN 88 Conference on
Programming Language Design and Implementation (PLDI 88), July 1988,
pp. 318-328.
4. John Ruttenberg et al., 1996, Software pipelining showdown: optimal vs.
heuristic methods in a production compiler, Proceedings of the ACM
SIGPLAN 1996 Conference on Programming Language Design and
Implementation, June 1996, pp. 1-11.
21