Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Software Pipelining Software Pipelining HM (11/14/2011) Software Pipelining (SW PL) is a performance enhancing technique that exploits instruction-level parallelism on a wide instruction word to accomplish execution speed-up. As the name implies, pipelining here is not primarily achieved through hardware, but rather refers to the interleaved execution of distinct iterations of a SW loop. Parts of multiple iterations of a source program’s loop body are executed simultaneously in a pipelined fashion. This requires that the target machine has multiple HW modules that can execute simultaneously. For example, an LIW architecture has instructions that perform, say, a floating point multiply at the same time they execute a floating point add, and perhaps also an integer addition plus a load operation. We have seen this on a smaller scale with superscalar machines. Similarly, a VLIW architecture provides instructions that do all of the above plus a store, several other integer operations, and perhaps loop instructions. Very roughly, the boundary between VLIW and LIW is 128 opcode bits. Synopsis Motivation Outline of Software Pipelining General Pattern of VLIW Instructions Used Definitions Comparison of Pipelining, SW PL, and Superscalar Execution Typical VLIW instruction Example of Simple SW PL Loop Peeling-Off Iterations Software Pipelined Matrix Multiply More Complex SW Pipelined Loop Skipping the Epilogue? Literature References 1 Software Pipelining HM Motivation Imagine you construct a super-machine instruction, one that can perform numerous operations in parallel. The time for the novel instruction to complete equals the time of the longest sub operation. Once the faster sub operations complete, their HW modules sit idle, waiting for another instruction to keep them busy, until the most complex sub operation terminates. Let us name this super-machine instruction a Very Long Instruction Word (VLIW) operation. This VLIW operation may compute a floating point (FP) multiply (fmul) at the same time it performs an FP add (fadd), and a load (ld), and a store (st), and two integer operations, and perhaps more. Below is a sequence of 3 fictitious assembler operations, executed in sequence; later we shall package them into a VLIW operation: 0 1 2 3 -- 2 ld ld fadd loads, 1 add r3, (r1)++ r4, (r2)++ r5, r3, r4 ----- assume good addresses in r1, r2 load r3 ind. thru r1, post-incr. load into r4 memory word at (r2) Floating-point add r5 ← r3 + r4 The sequence of instructions above executes sequentially. Our novel VLIW instruction can do all of the above, and more in parallel. The new superassembler expresses this simultaneousness via new assembly pseudooperations, the { and } braces. These braces say: All enclosed operations are packed into a single VLIW instruction, and all the sub opcodes, regardless of the order listed, are executed in parallel, in a single logical step, in a single instruction. See the new VLIW instruction below: { 1 2 3 ld r3, (r1)++ ld r4, (r2)++ fadd r5, r3, r4 ----- assume good addresses in r1, r2 load r3 indirect thru r1, post-incr. load into r4 memory word at (r2) FP add r5 ← r3 + r4 } It is not reasonable to define the architecture such that the sum of r3 and r4, ending up in r5, would be the addition of the newly fetched memory words, indirectly accessed through r1 and r2. Instead, the old values in r3 and r4 that were already in the registers when the instruction started, those older values would be added into r5. New values then are loaded into r3 and r4. So the meanings of the sequential and VLSI instruction sequences above are not equivalent. But then this seems that a VLIW operation cannot truly be used for executing in parallel and fast, what had to be evaluated in sequence and slowly. And that is a correct observation; yet maybe we can invent something great with VLIW operations after all? 2 Software Pipelining HM This is where Software Pipelining comes into play. It is a technique that takes advantage of the built in parallelism, while overcoming the limitation that the destination register of one sub operation cannot be used in the same logical step as a source for another. Software Pipelining packs parts of multiple iterations of one source loop by assembling them into the same VLIW instruction. For this to work the Software Pipelined loop must be suitably prepared; this is done in the loop Prologue. And before the iterations are all complete, the VLIW instruction must be emptied or drained; that is done in the loop Epilogue. 3 Software Pipelining HM Outline of Software Pipelining Software Pipelining requires a target processor with multiple executing agents (HW modules) that can run simultaneously Target may be an LIW, a VLIW, or a superscalar processor These multiple HW modules operate in parallel on (parts of) different iterations of a source program’s loop It is key that none of these multiple, simultaneous operations generate results that would be expected as input of another operation in the same iteration Instead, in step 1 an operation generates a result in destination register r1, while in step 2 another operation can use r1 as an input to generate the next result r2 It is the compiler’s or programmer’s task to pack operations of different iterations of the source loop into suitable fields of one long instruction in a way that they be executed in one step In extreme cases, a complete loop body may consist of a single VLIW instruction; clearly a desirable special case 4 Software Pipelining HM General Pattern of VLIW Instructions Used: In the following examples we use VLIW instructions. The assembly language syntax is similar to conventional assemblers. There will be individual load and store operations, floating point adds and multiplies. In addition, VLIW instructions can group numerous of these operations together; this is expressed by the { and } braces, indicating that all operations enclosed can be executed together, jointly in a single step. The sequential order in which the various VLIW suboperations are written in assembler source does by no means imply sequential execution order! Instead, all suboperations are started at the same time. For example: 1 2 3 4 5 6 7 ld add { fadd ld ld } r3, (r1)++ r0, r6, #2 r5, r3, r4 r3, (r1)++ r4, (r2)++ -------- load r3 ind. thru r1, post-incr. add 2 to r6 and put sum into r0 start of VLIW operation add current r3 + r4 into r5 new value in r3, old value used new value in r4, old value used end of VLIW instruction Lines 1 and 2 above display typical sequential load and add instructions. However, lines 3 through 7 display one single VLIW operation that lumps a floating point add, two integer increments, and two loads into one single executable step. Note that r3 and r4 are both used as operands, as inputs to the floating point add operation, but they also receive new values as a result of the load functions. This works perfectly well, as long as the processor latches their old values for use in the fadd. After completion of the VLIW instruction the loaded values in r3 and r4 can be used as operands again. The execution time of the above VLIW instruction is dominated by the time to execute two loads in sequence, unless the two addresses (in r1 and r2) happen to activate different memory banks. In that case, the time for two loads would be the time for one load plus a small number of clock cycles. 5 Software Pipelining HM Loop Instructions, Not germane to VLIW The loop instruction used in the examples here is akin to the Intel 8086 processor. The hypothetical operation: loop foo -- use special-purpose loop register rx has the following meaning: An implied loop register rx is assumed initially to hold the number of iterations of a countable loop. This integer value is decreased by 1. If that decremented value is not yet 0 execution continues at label foo. Typically foo resides at an address physically before the loop instruction. So this results in a back branch most of the time, and once in a fall through. 6 Software Pipelining HM Definitions Column: A Software Pipelined loop operates on (parts of) more than one iteration of a source program’s loop body. Each distinct loop iteration executed in one step --towards which a compute step progresses-- is called a column. Hence, if one execution of the loop makes some progress toward 4 different iterations, such a SW-pipelined loop is said to have 4 columns. Draining: Execute further instructions on a VLIW architecture after VLIW loop execution, such that a SW-pipelined loop completes correctly. That means, all registers still holding good values after loop termination will finally be used up by such sequential instructions. Synonym: flushing. Antonym: Priming. Epilogue: When the (steady state of the) loop terminates, there will generally be some valid operands in the hardware resources used. For example, an operand may have been loaded and is yet to be used. Or, a product has been computed that is yet to be added to the final sum. Thus the last operands must be consumed, the pipeline must be drained. This is accomplished in the object code after the steady state and is called the Epilogue. Antonym: Prologue. LIW: Long Instruction Word; an instruction requiring more opcode bits than a conventional (e.g. RISC) instructions, because multiple simultaneous operations are packed and encoded into a single LIW instruction. Typically, LIW instructions are longer than 32 but shorter than 128 bits. The opcode proper may be short, possibly even a single bit, but generally there will be further bits to specify sub-opcodes. For example, the floating point add may perform either a subtract, negate, or unsigned add etc. and that must be specified via bits for sub opcodes. Peeling Off: Removal of an iteration from the original complete loop. Usually this is done to perform an optimization. In software pipelining this is done to ensure the pipeline is primed before and drained after execution of the loop. The object code of peeled off iterations can be scheduled together with other instructions. Hence the Prologue and Epilogue may end up in VLIW instructions of code preceding or following a software pipelined loop. 7 Software Pipelining HM Priming: Execute sequential instructions on a VLIW architecture before loop execution, such that a SW-pipelined loop can run correctly. That means, all registers are holding good values at the start of the loop, but hold also partly unused values at termination of the loop. Antonym: Flushing. Prologue: Before the Software Pipelined loop body can be initiated, the various hardware resources (e.g. registers) that partake in the SW PL must be initialized. For example, the first operands may have to be loaded, or the first sum of two loaded operands must be computed. Thus the first operands must be generated, the pipeline must be primed. This is accomplished in the object code before the steady state and is called the Prologue. Antonym: Epilogue. Steady State: The object code executed repeatedly, after the Prologue has been initiated and before the Epilogue will be active is called the Steady State. Each single execution of the Steady State makes some progress toward multiple iterations of the source loop. This loop was mapped into Prologue, Steady State, plus Epilogue by the Software Pipelining compiler. VLIW: Very Long Instruction Word; like an LIW instruction, but VLIW instructions typically consume 128 or more bits for the opcode, subopcodes, plus all operands. Some of the sub-opcodes may actually be NOPs. 8 Software Pipelining HM Comparison of Pipelining, SW PL, and Superscalar Execution Pipelining: Assumes: multiple independent HW units that collectively execute one single instruction in sequence. Hardware modules are not replicated. Does: at any step (clock cycle) each HW module executes one part of a different instruction. Allows: simultaneous execution of multiple parts of different instructions at the same time. This does not accelerate the execution time of a single instruction. Speeds up: the throughput of numerous consecutive instructions; provided there are no stalls (AKA hazards) Superscalar Execution: Assumes: multiple independent HW units that each can execute a complete instruction. Also assumes that some instruction sequences are arranged in the object code such that they can be fetched and executed together; implies they have no data dependence on one another Does: at any step execute either a single, or sometimes multiple, instructions simultaneously. Always fetches greedily. Allows: simultaneous execution of some sequences of instructions. Note that not all instruction sequences (or instructions pairs in a superscalar architecture with a maximum of 2 identical HW Modules) can be executed concurrently. Speeds up: those select instruction sequences, for which the architecture provides multiple HW modules, and for which the logic of the running program provides data independence, i.e. the output of one is not required as the input to the other. Software Pipelining: Assumes: VLIW or LIW instructions. Does: at any VLIW-step executes multiple operations at the same time, packaged into a single VLIW instruction. Allows: execution of parts of multiple iterations of the same source loop at run time. But generally one VLIW instruction does not map into a full source loop; only a portion of a whole loop body. At each iteration the SW-pipelined loop executes VLIW instructions, which together hold instructions for multiple source statements of a loop body. Thus does speed up the total throughput of a complete loop. 9 Software Pipelining HM Typical VLIW Instruction Performs floating point multiply floating point add load from memory, with optional auto-increment, decrement, pre- or post second load or store, with auto-increment, decrement, pre- or post integer add, also with optional auto-increment etc. integer multiply or divide loop-related operation Note: If both a load and a store are performed in one VLIW instruction on a system with a single memory controller, the load should be initiated first. The execution time is greater than the maximum of the times for a load and a store. Having two store operations as part of the same VLIW instruction would require special care to define the case of both memory destinations being identical or overlapping. Example of Simple SW PL Loop The program to be software pipelined adds all elements of floating-point vector a[] to the corresponding elements of vector b[] and moves the sums into vector a[]. First we show the source program in pseudo-C, with a possible mapping into some hypothetical assembly language. Then we show a pictorial representation of the problem with the hardware resources used, and a software pipelined version in VLIW assembly language. Note that registers r1, and r2 are used as pointers. Register r1 points to the next element in a[] to be fetched, r2 to the next available address in b[]. Source of Vector Add Program: for ( i = 0; i < N; i++ ) { a[ i ] = a[ i ] + b[ i ]; // same as: a[ i ] += b[ i ] } //end for --a[] and b[] are floats Sequential Assembly Program of Vector Add: mov mov mov mov r0, r1, r2, rx, Addr( a[] )-- address of a[0] is in r0 for load r0 -- copy pointer into r1 for store Addr( b[] )-- address of b[0] in r2 for load N -- Number of iterations in loop rx L_1: ld r3, (r0)++ -ld r4, (r2)++ -fadd r5, r3, r4 -st r5, (r1)++ -loop L_1 --- successor of loop load indirect into r3 through r0 what r2 points to is loaded into r4 r5 holds sum of two elements store result and post-increment if not completed, jump to L_1 10 Software Pipelining HM Pictorial Representation of Vector Add Program: a[] b[] r2 r0 r1 r3 r4 + r5 11 Software Pipelining HM Software Pipelined Assembly Version of Vector Add: The iteration space spans 0 through N-1. Two of N iterations are peeled off in the Prologue plus Epilogue. Each iteration in the Steady State consists of two loads, two integer additions with post-increment, a floating point add, a store, and the loop-overhead step. The Prologue executes a pair of loads and one floating point add. Then comes another pair of loads, since the previously loaded values are used up in the floating point add. The Epilogue then completes the last but one store, performs the last add and concludes by storing that final value. -- Initialization: mov r0, Addr( a[] )-- r0 is pointer to float array a[0] mov r1, r0 -- copy address of a[0] into r1 mov r2, Addr( b[] )-- r2 is pointer to b[0] mov rx, N-2 -- rx holds iteration count N-2 -- Prologue primes the SW pipe: ld r3, (r0)++ -- r3 holds a[0], r0 now points ld r4, (r2)++ -- r4 holds b[0], r2 now points fadd r5, r3, r4 -- old a[0] + b[0] in r5 ld r3, (r0)++ -- a[1] in r3, r0 now points to ld r4, (r2)++ -- b[1] in r4, r2 now points to Steady State executes repeatedly, (N-2) times: -L_2: { st fadd ld ld loop r5, r5, r3, r4, L_2 (r1)++ r3, r4 (r0)++ (r2)++ ------ to a[1] to b[1] a[2] b[2] store into address at r1 r5 holds sum of two elements load indirect into r3 through r0 what r2 points to is loaded into r4 decrement rx, if != 0, jump to L_2 } -- r1 now points to last-but-two element of a[] for storing -Epilogue drains the SW pipe: st r5, (r1)++ -- increment r1 fadd r5, r3, r4 -- use last pair of FP operands st r5, (r1) -- and store last sum in a[N-1] Peeling Off Iterations The picture shows that two iterations had to be peeled off. The overall net effect of all peeled-off operations is 2 pairs of loads, 2 floating-point adds, and 2 stores. All those loads and one add are done early in the Prologue. The other add and the 2 stores are executed in the Epilogue. The former primes the software pipeline for the Steady State; the latter drains it. 12 Software Pipelining HM Software Pipelined Matrix Multiply To show in the assembly listings the association of loop iterations with lines of assembly code, we assume a small number of total iterations, here N = 5. First we show the source for the Matrix Multiply with the important inner loop bold faced. This inner loop is generally converted into a reduction operation by the compiler’s optimizer; we shall use that optimization here. After the assembly code for the normal, i.e. sequential matrix multiply, we discuss the SW Pipelined version. c[] a[] b[] r1 r2 r4 r3 * r6 r5 + Source of Matrix Multiply, Inner Loop Emphasized: for ( row = 0; row < N; row++ ) { for ( col = 0; col < N; col++ ) { c[row][col] = 0.0; for ( i = 0; i < N; i++ ) { c[row][col] += a[row][i] * b[i][col]; } //end for } //end for } //end for -- optimizer or smart programmer reduces this to equivalent: temp_reg = 0.0; for ( i = 0; i < N; i++ ) { temp_reg += a[row][i] * b[i][col]; } //end for c[row][col] = temp_reg; -- now we code a) in assembler and then b) SW-pipeline inner loop 13 Software Pipelining HM Sequential Assembly Program of Matrix Multiply: The repeated store into memory at c[row][col] N times has been optimized away into a single store outside the inner loop; named a reduction operation in optimizer parlance. row and col are loop invariant for the innermost loop: -- row and col are -- plausible cycle 1 mov r1, 1 mov r2, 1 mov r6, 1 mov rx, L_3: 3 ld r3, 3 ld r4, 1 iadd r2, 3 fmul r5, 1 fadd r6, 1 loop L_3 end: 3 st r6, known count Addr( Addr( 0.0 N here, and are not changed per instruction in leftmost column: a[row][0] ) -- address into r1 b[0][col] ) -- address into r2 -- cumulative FP-sum so far -- rx holds iteration count N>2 (r1)++ (r2) #stride r3, r4 r5, r6 -- load into r3 thru r1 -- load into r4 thru r2 -- stride too big for ++ -- r5 holds next product -- r6 holds cumulative products -- if not done, jump to L_3 -- label here for documentation Addr( c[row][col] ) 14 Software Pipelining HM Software Pipelined Assembly Program of Matrix Multiply: Two iterations will be peeled off from the inner loop of the source program. One performs two loads and the floating point multiply; the other quickly reloads the registers r3 and r4 again. Thus, with N=5, there are only 3 iterations left in the Steady State of the software pipelined code. We explicitly show the iterations’ numbers of the inner loop associated for each piece of object code. For the inner loop there will be N-2 iterations; see the leftmost 3 numeric columns in the listing for the Steady State below. -- source iteration indices in left columns: -- Initialization, assume N=5 -- for loop-counting illustration purposes 0 mov r1, Addr( a[row][0] ) -- pointer into r1 0 mov r2, Addr( b[0][col] ) -- pointer into r2 0 mov r6, 0.0 -- cumulative sum so far is 0 0 mov rx, N-2 -- rx holds iteration count N-2 -- Prologue ‘primes’ the SW pipe: 1 ld r3, (r1)++ -- a[row][0] in r3 1 ld r4, (r2) -- b[0][col] in r4 1 fmul r5, r3, r4 -- first product in r5 2 iadd r2, #stride -- stride too big for ++ 2 ld r3, (r1)++ -- a[row][1] in r3 2 ld r4, (r2) -- b[1][col] in r4 3 iadd r2, #stride -- r2 points to b[2][col] -Steady State executes N-2 times: L_4: { 1 2 3 fadd r6, r5, r6 -- r6 holds cumulative float * 2 3 4 fmul r5, r3, r4 -- r5 holds next product * 3 4 5 ld r3, (r1)++ -- load next a[][] into r3 via r1 3 4 5 ld r4, (r2) -- load next b[][] into r4 via r2 4 5 6 iadd r2, #stride -- too big for an auto-increment -- finally, r2 points to next b[][] loop L_4 -- loop again, or fall through? } –- end of loop body -- end of steady state -- Epilogue drains the SW pipe: 4 fadd r6, r5, r6 5 fmul r5, r3, r4 5 fadd r6, r5, r6 -- the store is not part of the optimized loop: -- All N stores have been optimized (reduced) into a single -- store after the innermost loop is complete st r6, c[row][col] 15 Software Pipelining HM Caution: In a single iteration the Steady State of the above software pipelined loop progresses on 4 different iterations of the source program’s loop. We say the number of columns is 4. Since r2 is used as a pointer to the next element of b[], technically speaking it is out of range after loop completion. But since there is no attempt to reference this illegal pointer, the program is still safe. If the last element of b[] just happens to reside at the end of memory, r2 would experience integer overflow, which ends up being harmless, due to lack of pointer reference. Most machines do ignore integer overflow in registers, but trap on a floating-point over- (or under-) flow. 16 Software Pipelining HM More Complex SW Pipelined Loop The next example to be mapped into software pipelined VLIW instructions will add a vector b[] and the products of vectors c[] and d[], then it will store the result in yet a fourth vector a[]. See the C source sketched below: for ( i = 0; i < N; i++ ) { a[ i ] = b[ i ] + c[ i ] * d[ i ]; } //end for a[] r1 d[] c[] b[] r4 r3 r2 r7 r6 r5 + r9 * r8 A more Complex Vector Operation: Three iterations will be peeled off from the loop. After proper initialization of all register pointers using r1 through r4, one peeled off iteration performs the first loads of c[0] and d[0] into r6 and r7 for the first multiplication. It then loads r5 with b[0] for the subsequent floating point addition. Except for the store into a[0] this collective work constitutes one full iteration. The second partly peeled off iteration takes advantage of the fact that the multiply just completed did consume the values in r6 and r7, thus they can be re-loaded. The FP-add now uses the values in r5 and r8; r8 holds the first product. Hence 3 new loads can proceed and a new multiply can be performed. This completes the second partly peeled-off iteration. Finally the third peeled off iteration loads register pair r6 and r7 again. This is feasible, because the multiply in peeled off iteration 2 used the values already (these are c[1] and d[1]), hence r6 and r7 must be reloaded again. Then the complete pipe is primed. By this time two fmul and one fadd instruction have been executed, and 3 pairs of c[i] and d[i] have been 17 Software Pipelining HM loaded into register pair r6 and r7. This does not yet constitute 3 full iterations worth of peeled off work. The remaining portion is yet to be completed in the epilogue. 18 Software Pipelining HM --- iteration indices 1..N in left column index 0 says: preparation NOT directly part of source 0 mov r1, Addr( a[0] ) -- pointer a[i] into r1 0 mov r2, Addr( b[0] ) -- pointer b[i] into r2 0 mov r3, Addr( c[0] ) -- pointer c[i] into r3 0 mov r4, Addr( d[0] ) -- pointer d[i] into r4 0 mov rx, N-3 -- rx holds iteration count-3 -- Prologue ‘primes’ the SW pipe; left col: iteration 1..N 1 ld r6, (r3)++ -- r6 = c[0] 1 ld r7, (r4)++ -- r7 = d[0], ready to fmul 1 fmul r8, r6, r7 -- r8 = c[0] * d[0] 1 ld r5, (r2)++ -- r5 = b[0], ready for fadd 1 fadd r9, r5, r8 -- r9 = a[0] + c[0] * d[0] 2 ld r6, (r3)++ -- r6 = c[1] 2 ld r7, (r4)++ -- r7 = d[1], ready to fmul 2 fmul r8, r6, r7 -- r8 = c[1] * d[1] 2 ld r5, (r2)++ -- r5 = b[1], ready to fadd 3 ld r6, (r3)++ -- r6 = c[2], 1st * operand 3 ld r7, (r4)++ -- r7 = d[2], ready to fmul -Steady State executes N - 3 times L_5: { 1 2 .. st r9, (r1)++ -- r9 holds sum, r1 Addr(a[i]) 2 3 .. fadd r9, r5, r8 -- next sum[i+1] into r9 3 4 .. fmul r8, r6, r7 -- product c[i]*d[i] in r8 4 5 .. ld r5, (r2)++ -- r5 = b[i-1] 4 5 .. ld r6, (r3)++ -- r6 = c[i] 4 5 .. ld r7, (r4)++ -- r7 = d[i] loop L_5 } –- end Steady State, done N-3 times --- Epilogue drains SW pipe, sum indices range 1..N iteration indices 1..N, subscript range 0..N-1 N-2 st r9, (r1)++ -- a[N-3] = r9, holds sum[N-2] N-1 fadd r9, r5, r8 -- sum[N-1] N-1 st r9, (r1)++ -- a[N-2] = sum[N-1] N ld r5, (r2)++ -r5 = b[N-1] N fmul r8, r6, r7 -r8 = c[N-1] * d[N-1] N fadd r9, r5, r8 -r9 = sum[N] N st r9, (r1) -- a[N-1] = sum[N] 19 Software Pipelining HM Skipping The Epilogue? An interesting question is “Could we skip the Epilogue and simply execute the Steady State the corresponding number of iterations more?” The presumed saving would be compact code and faster execution. The answer clearly depends on the source of the inner loop and on the depth of iterations peeled off. Generally, however, the saving in object code is not dramatic. The code of the Epilogue usually can be scheduled into the Prologue of subsequent VLIW instructions. Moreover, both Prologue and Epilogue are executed just once. So the run-time saving cannot be dramatic. More interestingly, the pipeline would be filled with values through load operations that progress too far. This alone would not pose any danger, as the registers holding the loaded results must be considered dead after the loop with Epilogue. However, if the addresses referenced a point beyond the legally addressable memory, an exception could arise that would render the whole execution illegal. In that case, even fast execution is no longer attractive. Our response therefore: The Epilogue must be executed appropriately. 20 Software Pipelining HM Bibliography 1. Monica Lam, CMU dissertation 1989, A Systolic Array Optimizing Compiler (1989) (ISBN 0-89838-300-5). 2. B.R. Rau and C.D. Glaeser, 1981, Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing, Proceedings of the Fourteenth Annual Workshop on Microprogramming (MICRO-14), December 1981, pp. 183-198 3. Monica Lam, 1988, Software pipelining: An effective scheduling technique for VLIW machines, Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988, pp. 318-328. 4. John Ruttenberg et al., 1996, Software pipelining showdown: optimal vs. heuristic methods in a production compiler, Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, June 1996, pp. 1-11. 21