Download for loop

Program design and analysis Design patterns Representations of programs Assembly and linking 1 Design patterns Design pattern: generalized description of the design of a certain type of program. Designer fills in details to customize the pattern to a particular programming problem. 2 List design pattern 1 List * List-element create() add-element() delete-element() find-element() destroy() 3 Design pattern elements Class diagram State diagrams Sequence diagrams etc. 4 State machine State machine is useful in many contexts: parsing user input responding to complex stimuli controlling sequential outputs 5 State machine example no seat/no seat/ buzzer off idle seat/timer on no seat/buzzer seated Belt/buzzer on belt/ buzzer off belt/belted no belt and no timer/- no belt/timer on 6 State machine pattern State machine state output step(input) 7 C implementation #define IDLE 0 #define SEATED 1 #define BELTED 2 #define BUZZER 3 switch (state) { case IDLE: if (seat) { state = SEATED; timer_on = TRUE; } break; case SEATED: if (belt) state = BELTED; else if (timer) state = BUZZER; break; … } 8 Circular buffer x1 x2 x3 t1 x4 t2 x5 x6 t3 Data stream x1 x5 x2 x6 x3 x7 x4 Circular buffer 9 Circular buffer pattern Circular buffer init() add(data) data head() data element(index) 10 Circular buffer implementation: FIR filter int circ_buffer[N], circ_buffer_head = 0; int c[N]; /* coefficients */ … int ibuf, ic; for (f=0, ibuff=circ_buff_head, ic=0; ic<N; ibuff=(ibuff==N-1?0:ibuff++), ic++) f = f + c[ic]*circ_buffer[ibuf]; 11 Models of programs Source code is not a good representation for programs: clumsy; leaves much information implicit. Compilers derive intermediate representations to manipulate and optimize the program. 12 Data flow graph DFG: data flow graph. Does not represent control. Models basic block: code with one entry and exit. Describes the minimal ordering requirements on operations. 13 Single assignment form x = a + b; y = c - d; z = x * y; y = b + d; x = a + b; y = c - d; z = x * y; y1 = b + d; original basic block single assignment form 14 Data flow graph x = a + b; y = c - d; z = x * y; y1 = b + d; single assignment form a b c + d - y x * z DFG + y1 15 DFGs and partial orders a b c + d - y x * + z y1 Partial order: a+b, c-d; b+d, x*y Can do pairs of operations in any order. 16 Control-data flow graph CDFG: represents control and data. Uses data flow graphs as components. Two types of nodes: decision; data flow. 17 Data flow node Encapsulates a data flow graph: x = a + b; y=c+d Write operations in basic block form for simplicity. 18 Control T v1 v4 cond F value v2 v3 Equivalent forms 19 CDFG example if (cond1) bb1(); else bb2(); bb3(); switch (test1) { case c1: bb4(); break; case c2: bb5(); break; case c3: bb6(); break; } T cond1 bb1() F bb2() bb3() c1 c3 test1 c2 bb4() bb5() bb6() 20 for loop for (i=0; i<N; i++) loop_body(); i=0 for loop F i<N i=0; while (i<N) { loop_body(); i++; } T loop_body() equivalent 21 Assembly and linking Last steps in compilation: HLL HLL HLL compile load assembly assembly assembly executable assemble link 22 Multiple-module programs Programs may be composed from several files. Addresses become more specific during processing: relative addresses are measured relative to the start of a module; absolute addresses are measured relative to the start of the CPU address space. 23 Assemblers Major tasks: generate binary for symbolic instructions; translate labels into addresses; handle pseudo-ops (data, etc.). Generally one-to-one translation. Assembly labels: label1 ORG 100 ADR r4,c 24 Symbol table xx yy ADD r0,r1,r2 ADD r3,r4,r5 CMP r0,r3 SUB r5,r6,r7 assembly code xx yy 0x8 0x10 symbol table 25 Symbol table generation Use program location counter (PLC) to determine address of each location. Scan program, keeping count of PLC. Addresses are generated at assembly time, not execution time. 26 Symbol table example PLC=0x7 ADD r0,r1,r2 PLC=0x8 xx ADD r3,r4,r5 PLC=0x9 CMP r0,r3 PLC=0xa yy SUB r5,r6,r7 xx yy 0x8 0xa 27 Two-pass assembly Pass 1: generate symbol table Pass 2: generate binary instructions 28 Relative address generation Some label values may not be known at assembly time. Labels within the module may be kept in relative form. Must keep track of external labels---can’t generate full binary for instructions that use external labels. 29 Pseudo-operations Pseudo-ops do not generate instructions: ORG sets program location. EQU generates symbol table entry without advancing PLC. Data statements define data blocks. 30 Linking Combines several object modules into a single executable module. Jobs: put modules in order; resolve labels across modules. 31 Externals and entry points entry point xxx yyy a ADD r1,r2,r3 B a external reference %1 ADR r4,yyy ADD r3,r4,r5 32 Module ordering Code modules must be placed in absolute positions in the memory space. Load map or linker flags control the order of modules. module1 module2 module3 33 Dynamic linking Some operating systems link modules dynamically at run time: shares one copy of library among all executing programs; allows programs to be updated with new versions of libraries. 34 Program design and analysis Compilation flow. Basic statement translation. Basic optimizations. Interpreters and just-in-time compilers. 35 Compilation Compilation strategy (Wirth): compilation = translation + optimization Compiler determines quality of code: use of CPU resources; memory access scheduling; code size. 36 Basic compilation phases HLL parsing, symbol table machine-independent optimizations machine-dependent optimizations assembly 37 Statement translation and optimization Source code is translated into intermediate form such as CDFG. CDFG is transformed/optimized. CDFG is translated into instructions with optimization decisions. Instructions are further optimized. 38 Arithmetic expressions a*b + 5*(c-d) expression b a c * d - 5 * + DFG 39 Arithmetic expressions, cont’d. b a 1 c * 2 5 3 4 * + d - ADR r4,a MOV r1,[r4] ADR r4,b MOV r2,[r4] MUL r3,r1,r2 ADR r4,c MOV r1,[r4] ADR r4,d MOV r5,[r4] SUB r6,r4,r5 MUL r7,r6,#5 ADD r8,r7,r3 DFG code 40 Control code generation if (a+b > 0) x = 5; else x = 7; a+b>0 x=5 x=7 41 Control code generation, cont’d. 1 a+b>0 3 x=7 x=5 ADR r5,a LDR r1,[r5] ADR r5,b 2 LDR r2,b ADD r3,r1,r2 BLE label3 LDR r3,#5 ADR r5,x STR r3,[r5] B stmtent label3 LDR r3,#7 ADR r5,x STR r3,[r5] stmtent ... 42 Procedure linkage Need code to: call and return; pass parameters and results. Parameters and returns are passed on stack. Procedures with few parameters may use registers. 43 Procedure stacks growth proc1 FP frame pointer proc2 5 SP stack pointer proc1(int a) { proc2(5); } accessed relative to SP 44 ARM procedure linkage APCS (ARM Procedure Call Standard): r0-r3 pass parameters into procedure. Extra parameters are put on stack frame. r0 holds return value. r4-r7 hold register values. r11 is frame pointer, r13 is stack pointer. r10 holds limiting address on stack size to check for stack overflows. 45 Data structures Different types of data structures use different data layouts. Some offsets into data structure can be computed at compile time, others must be computed at run time. 46 One-dimensional arrays C array name points to 0th element: a a[0] a[1] = *(a + 1) a[2] 47 Two-dimensional arrays Column-major layout: a[0,0] a[0,1] ... N ... M a[1,0] a[1,1] = a[i*M+j] 48 Structures Fields within structures are static offsets: aptr struct { int field1; char field2; } mystruct; field1 field2 4 bytes *(aptr+4) struct mystruct a, *aptr = &a; 49 Expression simplification Constant folding: 8+1 = 9 Algebraic: a*b + a*c = a*(b+c) Strength reduction: a*2 = a<<1 50 Dead code elimination Dead code: #define DEBUG 0 if (DEBUG) dbg(p1); Can be eliminated by analysis of control flow, constant folding. 0 0 1 dbg(p1); 51 Procedure inlining Eliminates procedure linkage overhead: int foo(a,b,c) { return a + b - c;} z = foo(w,x,y);  z = w + x + y; 52 Loop transformations Goals: reduce loop overhead; increase opportunities for pipelining; improve memory system performance. 53 Loop unrolling Reduces loop overhead, enables some other optimizations. for (i=0; i<4; i++) a[i] = b[i] * c[i];  for (i=0; i<2; i++) { a[i*2] = b[i*2] * c[i*2]; a[i*2+1] = b[i*2+1] * c[i*2+1]; } 54 Loop fusion and distribution Fusion combines two loops into 1: for (i=0; i<N; i++) a[i] = b[i] * 5; for (j=0; j<N; j++) w[j] = c[j] * d[j];  for (i=0; i<N; i++) { a[i] = b[i] * 5; w[i] = c[i] * d[i]; } Distribution breaks one loop into two. Changes optimizations within loop body. 55 Loop tiling Breaks one loop into a nest of loops. Changes order of accesses within array. Changes cache behavior. 56 Loop tiling example for (i=0; i<N; i++) for (j=0; j<N; j++) c[i] = a[i,j]*b[i]; for (i=0; i<N; i+=2) for (j=0; j<N; j+=2) for (ii=0; ii<min(i+2,N); ii++) for (jj=0; jj<min(j+2,N); jj++) c[ii] = a[ii,jj]*b[ii]; 57 Array padding Add array elements to change mapping into cache: a[0,0] a[0,1] a[0,2] a[0,0] a[0,1] a[0,2] a[0,2] a[1,0] a[1,1] a[1,2] a[1,0] a[1,1] a[1,2] a[1,2] before after 58 Register allocation Goals: choose register to hold each variable; determine lifespan of variable in the register. Basic case: within basic block. 59 Register lifetime graph w = a + b; t=1 x = c + w; t=2 y = c + d; t=3 a b c d w x y 1 2 3 time 60 Instruction scheduling Non-pipelined machines do not need instruction scheduling: any order of instructions that satisfies data dependencies runs equally fast. In pipelined machines, execution time of one instruction depends on the nearby instructions: opcode, operands. 61 Reservation table A reservation table relates instructions/time to CPU resources. Time/instr instr1 instr2 instr3 instr4 A X X X B X X 62 Software pipelining Schedules instructions across loop iterations. Reduces instruction latency in iteration i by inserting instructions from iteration i-1. 63 Software pipelining in SHARC Example: for (i=0; i<N; i++) sum += a[i]*b[i]; Combine three iterations: Fetch array elements a, b for iteration i. Multiply a,b for iteration i-1. Compute sum for iteration i-2. 64 Software pipelining in SHARC, cont’d /* first iteration performed outside loop */ ai=a[0]; bi=b[0]; p=ai*bi; /* initiate loads used in second iteration; remaining loads will be performed inside the loop */ for (i=2; i<N-2; i++) { ai=a[i]; bi=b[i]; /* fetch for next cycle’s multiply */ p = ai*bi; /* multiply for next iteration’s sum */ sum += p; /* make sum using p from last iteration */ } sum += p; p=ai*bi; sum +=p; 65 Software pipelining timing ai=a[i]; bi=b[i]; time p = ai*bi; ai=a[i]; bi=b[i]; sum += p; p = ai*bi; sum += p; ai=a[i]; bi=b[i]; pipe p = ai*bi; sum += p; iteration i-2 iteration i-1 iteration i 66 Instruction selection May be several ways to implement an operation or sequence of operations. Represent operations as graphs, match possible instruction sequences onto graph. + * expression * MUL + + ADD templates * MADD 67 Using your compiler Understand various optimization levels (O1, -O2, etc.) Look at mixed compiler/assembler output. Modifying compiler output requires care: correctness; loss of hand-tweaked code. 68 Interpreters and JIT compilers Interpreter: translates and executes program statements on-the-fly. JIT compiler: compiles small sections of code into instructions during program execution. Eliminates some translation overhead. Often requires more memory. 69 Program design and analysis Optimizing for execution time. Optimizing for energy/power. Optimizing for program size. 70 Motivation Embedded systems must often meet deadlines. Faster may not be fast enough. Need to be able to analyze execution time. Worst-case, not typical. Need techniques for reliably improving execution time. 71 Run times will vary Program execution times depend on several factors: Input data values. State of the instruction, data caches. Pipelining effects. 72 Measuring program speed CPU simulator. I/O may be hard. May not be totally accurate. Hardware timer. Requires board, instrumented program. Logic analyzer. Limited logic analyzer memory depth. 73 Program performance metrics Average-case: For typical data values, whatever they are. Worst-case: For any possible input set. Best-case: For any possible input set. Too-fast programs may cause critical races at system level. 74 Performance analysis Elements of program performance (Shaw): execution time = program path + instruction timing Path depends on data values. Choose which case you are interested in. Instruction timing depends on pipelining, cache behavior. 75 Programs and performance analysis Best results come from analyzing optimized instructions, not high-level language code: non-obvious translations of HLL statements into instructions; code may move; cache effects are hard to predict. 76 Program paths Consider for loop: for (i=0, f=0; i<N; i++) f = f + c[i]*x[i]; Loop initiation block executed once. Loop test executed N+1 times. Loop body and variable update executed N times. i=0; f=0; i<N N Y f = f + c[i]*x[i]; i = i+1; 77 Instruction timing Not all instructions take the same amount of time. Instruction execution times are not independent. Execution time may depend on operand values. 78 Trace-driven performance analysis Trace: a record of the execution path of a program. Trace gives execution path for performance analysis. A useful trace: requires proper input values; is large (gigabytes). 79 Trace generation Hardware capture: logic analyzer; hardware assist in CPU. Software: PC sampling. Instrumentation instructions. Simulation. 80 Loop optimizations Loops are good targets for optimization. Basic loop optimizations: code motion; induction-variable elimination; strength reduction (x*2 -> x<<1). 81 Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i=0; Xi=0; = N*M i<N*M i<X Y N z[i] = a[i] + b[i]; i = i+1; 82 Induction variable elimination Induction variable: loop index. Consider loop: for (i=0; i<N; i++) for (j=0; j<M; j++) z[i][j] = b[i][j]; Rather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body. 83 Cache analysis Loop nest: set of loops, one inside other. Perfect loop nest: no conditionals in nest. Because loops use large quantities of data, cache conflicts are common. 84 Array conflicts in cache a[0,0] 1024 1024 b[0,0] 4099 main memory 4099 ... cache 85 Array conflicts, cont’d. Array elements conflict because they are in the same line, even if not mapped to same location. Solutions: move one array; pad array. 86 Performance optimization hints Use registers efficiently. Use page mode memory accesses. Analyze cache behavior: instruction conflicts can be handled by rewriting code, rescheudling; conflicting scalar data can easily be moved; conflicting array data can be moved, padded. 87 Energy/power optimization Energy: ability to do work. Most important in battery-powered systems. Power: energy per unit time. Important even in wall-plug systems---power becomes heat. 88 Measuring energy consumption Execute a small loop, measure current: I while (TRUE) a(); 89 Sources of energy consumption Relative energy per operation (Catthoor et al): memory transfer: 33 external I/O: 10 SRAM write: 9 SRAM read: 4.4 multiply: 3.6 add: 1 90 Cache behavior is important Energy consumption has a sweet spot as cache size changes: cache too small: program thrashes, burning energy on external memory accesses; cache too large: cache itself burns too much power. 91 Optimizing for energy First-order optimization: high performance = low energy. Not many instructions trade speed for energy. 92 Optimizing for energy, cont’d. Use registers efficiently. Identify and eliminate cache conflicts. Moderate loop unrolling eliminates some loop overhead instructions. Eliminate pipeline stalls. Inlining procedures may help: reduces linkage, but may increase cache thrashing. 93 Optimizing for program size Goal: reduce hardware cost of memory; reduce power consumption of memory units. Two opportunities: data; instructions. 94 Data size minimization Reuse constants, variables, data buffers in different parts of code. Requires careful verification of correctness. Generate data using instructions. 95 Reducing code size Avoid function inlining. Choose CPU with compact instructions. Use specialized instructions where possible. 96 Code compression 0101101 main memory 0101101 decompressor Use statistical compression to reduce code size, decompress on-the-fly: table LDR r0,[r4] cache CPU 97 Program design and analysis Program validation and testing. 98 Goals Make sure software works as intended. We will concentrate on functional testing--performance testing is harder. What tests are required to adequately test the program? What is “adequate”? 99 Basic testing procedure Provide the program with inputs. Execute the program. Compare the outputs to expected results. 100 Types of software testing Black-box: tests are generated without knowledge of program internals. Clear-box (white-box): tests are generated from the program structure. 101 Clear-box testing Generate tests based on the structure of the program. Is a given block of code executed when we think it should be executed? Does a variable receive the value we think it should get? 102 Controllability and observability Controllability: must be able to cause a particular internal condition to occur. Observability: must be able to see the effects of a state from the outside. 103 Example: FIR filter Code: for (firout = 0.0, j =0; j < N; j++) firout += buff[j] * c[j]; if (firout > 100.0) firout = 100.0; if (firout < -100.0) firout = -100.0; Controllability: to test range checks for firout, must first load circular buffer. Observability: how do we observe values of buff, firout? 104 Path-based testing Clear-box testing generally tests selected program paths: control program to exercise a path; observe program to determine if path was properly executed. May look at whether location on path was reached (control), whether variable on path was set (data). 105 Points of view Several ways to look at control coverage: statement coverage; basis sets; cyclomatic complexity; branch testing; domain testing. 106 Example: choosing paths Two possible criteria for selecting a set of paths: Execute every statement at least once. Execute every direction of a branch at least once. Equivalent for structured programs, but not for programs with gotos. 107 Path example +/+ Covers all branches Covers all statements 108 Basis paths How many distinct paths are in a program? An undirected graph has a basis set of edges: a linear combination of basis edges (xor together sets of edges) gives any possible subset of edges in the graph. CDFG is directed, but basis set is an approximation. 109 Basis set example a b c d e a b c d e abcde 00100 00101 11010 00101 01010 incidence matrix a 00100 b 00101 c 11010 d 00101 e 01010 basis set 110 Cyclomatic complexity Provides an upper bound on the control complexity of a program: e = # edges in control graph; n = # nodes in control graph; p = # graph components. Cyclomatic complexity: M = e - n + 2p. Structured program: # binary decisions + 1. 111 Branch testing strategy Exercise the elements of a conditional, not just one true and one false case. Devise a test for every simple condition in a Boolean expression. 112 Example: branch testing Meant to write: if (a || (b >= c)) { printf(“OK\n”); } Actually wrote: if (a && (b >= c)) { printf(“OK\n”); } Branch testing strategy: One test is a=F, (b >= c) = T: a=0, b=3, c=2. Produces different answers. 113 Another branch testing example Meant to write: if ((x == good_pointer) && (x->field1 == 3))... Actually wrote: if ((x = good_pointer) && (x->field1 == 3))... Branch testing strategy: If we use only field1 value to exercise branch, we may miss pointer problem. 114 Domain testing Concentrates on linear inequalities. Example: j <= i + 1. Test two cases on boundary, one outside boundary. correct incorrect 115 Data flow testing Def-use analysis: match variable definitions (assignments) and uses. Example: def x = 5; … if (x > 0) ... p-use Does assignment get to the use? 116 Loop testing Common, specialized structure--specialized tests can help. Useful test cases: skip loop entirely; one iteration; two iterations; mid-range of iterations; n-1, n, n+1 iterations. 117 Black-box testing Black-box tests are made from the specifications, not the code. Black-box testing complements clear-box. May test unusual cases better. 118 Types of black-box tests Specified inputs/outputs: select inputs from spec, determine requried outputs. Random: generate random tests, determine appropriate output. Regression: tests used in previous versions of system. 119 Evaluating tests How good are your tests? Keep track of bugs found, compare to historical trends. Error injection: add bugs to copy of code, run tests on modified code. 120

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download for loop