Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Estimating the Worst-Case Energy Consumption of Embedded Software Ramkumar Jayaseelan Tulika Mitra Xianfeng Li School of Computing National University of Singapore 1 Motivation Conventional scheduling techniques give timing guarantees Processor cycles is the critical resource WCET of the tasks are required input Battery life is equally important for mobile devices Scheduling technique have to give energy guarantees Worst-Case Energy Consumption (WCEC) of the tasks are required input 2 Remotely Deployed Systems Local Station Sensor Network Available energy unevenly distributed among nodes Spatio-temporal scheduling benefits from WCEC 3 Energy-Based Guarantees Scheduling critical and non-critical tasks in a battery-operated system Non-critical tasks can be run only if energy constraints for critical tasks are satisfied Worst-case energy estimation is crucial 4 Reward-Based Scheduling Energy consumption Voltage Delay (1 / Voltage) Reward-based scheduling attempts to satisfy constraints on energy and timing Energy guarantee only if worst-case energy consumption of tasks are known 5 Outline Background Relation between WCET and Worst-case energy consumption Estimation technique: Simplified model Instruction cache and speculation Experimental results Conclusion 6 Background Power and energy are often used interchangeably Power is energy consumed per unit time Energy consumed during program execution E=P×t Approximation as P is also a function of time 7 E=P×T is an approximation In reality when a program executes Power Time Energy is the area under the curve E = ∫P(t)dt 8 WCEC versus WCET 21000 Total Energy Execution Time 5000 Energy(nano Joules) 19000 4900 18000 17000 4800 16000 4700 Execution Time(cycles) 20000 5100 15000 4600 14000 13000 4500 Program Inputs Full Input Space Expansion for a 5-element Insertion Sort program 9 Cannot Estimate WCEC from WCET Benchmark WCET×avg_power µJ Observed µJ isort 489.92 525.88 fft 12106.49 10260.86 fdct 138.20 105.57 ludcmp 131.76 119.33 matsum 972.03 1154.31 minver 93.61 80.80 bsearch 3.84 3.07 des 724.05 643.75 matmult 178.12 166.88 qsort 54.79 43.73 qurt 23.80 17.65 Possible underestimation using WCEC=WCET × power 10 WCEC versus WCET WCEC path need not be the same as the WCET path WCEC cannot be directly estimated from the WCET value 11 A closer look at Power Dynamic Power : Power Consumption due to switching of transistors Leakage Power: Power consumed independent of switching activity Dynamic power forms the bulk of power consumption in today’s processors 12 Dynamic Power Dynamic Power P=(1/2) × A × V2 × C × f V is supply voltage C is the capacitance of the circuit f is the frequency A is the activity factor V, C, f are independent of program execution Variation in P is due to the variation in A 13 Variation in Activity Factor (A) Not all parts of the processor are used in every cycle e.g., data-cache is used only for loads/stores Clock gating disables unused components Activity factor (A) varies during the execution of the program Model variation in A through static analysis 14 Switch-off Energy An inactive component cannot be fully switched off A certain portion of the peak energy is consumed even in idle cycles Switch-off energy is proportional to the number of idle cycles 15 Clock Energy and Leakage Energy Clock power: power consumed in clock distribution network Leakage power: power consumed due to leakage in transistors Clock energy and leakage energy are directly proportional to the execution time 16 Energy Components Summary Dynamic Energy Switch-off Energy Switching of transistors during execution Independent of execution time Energy consumed in unused components Depends on idle cycles Clock and Leakage energy Directly proportional to execution time 17 WCEC versus WCET 21000 Total Energy Execution Time 5000 Energy(nano Joules) 19000 4900 18000 17000 4800 16000 4700 Execution Time(cycles) 20000 5100 15000 4600 14000 13000 4500 Program Inputs Full Input Space Expansion for a 5-element Insertion Sort program 18 Our Analysis: Overview Operate on the control flow graph Estimate worst-case energy of basic blocks Formulate estimation for whole program as an integer linear programming (ILP) problem 19 ILP Formulation Input: Control flow graph of the program Objective function: Worst Case Energy = WCECB countB Need to estimate Worst-Case Energy Consumption( WCECB) for each basic block 20 Flow Constraints Inflow = Basic Block Execution Count = Outflow Bounds on maximum loop iterations B0 B1 B2 E0,1 = B0 = 1 E2,3 + E1,3 = B3 = 1 E0,1 + E2,1 = E1,2 + E1,3 = B1 E1,2 = E2,3 + E2,1= B2 Loop bound: E2,1 <= 100 B3 21 Worst-Case Energy of a Basic Block Processor Model Energy Components Instruction Specific Energy Pipeline Specific Energy 22 Processor Model I+1 I IF IBUF ID ISSUE I-1 I-4 ROB EX I-2 I-3 WB ALU CM MULT FPU 23 Pipelined Execution of Instructions ADD R1,R2,R3 MUL R4,R5,R6 SUB R7,R8,R9 CC ADD MUL SUB 1 IF 2 3 4 5 6 7 ID IS EX WB CM IF ID IS EX WB CM IF ID IS EX WB 8 CM Difficult to statically predict the energy consumption in each cycle 24 Pipelined Execution of Instructions ADD R1,R2,R3 MUL R4,R5,R6 SUB R7,R8,R9 CC ADD MUL SUB 1 IF 2 3 4 5 6 7 8 ID IS EX WB CM Stall Stall IF ID IS EX WB IF ID IS EX Difficult to statically predict the energy consumption in each cycle 25 Our Approach Determine the maximum energy consumed on a component by component basis Static analysis to determine the maximum energy consumed by a component in a specified interval 26 Execution of Instruction IF ID ISSUE EX WB CM 27 Instruction Specific Energy Energy consumed due to the sub-tasks associated with execution of an instruction e.g., register file access, ALU usage, etc. Depends on the type of executed instruction No correlation with execution time 28 Pipeline Specific Energy During program execution energy is consumed due to Switch-off power (idle cycles) Leakage power (every cycle) Clock network power (every cycle) Cannot be attributed to any instruction Energy consumed even in idle cycles 29 Energy Components Observation: Energy consumed can be separated out as Instruction Specific energy Energy associated with the execution of a particular instruction Independent of execution time Pipeline Specific energy Energy consumed in other components such as clock network, leakage etc. Related to execution time 30 Worst-case Energy of a Basic block energyBB dynamicBB switchoffBB leakageBB clock BB dynamicBB : Instruction-Specific Energy for BB switchoffBB , leakageBB and clockBB are energy consumed in unused components, leakage and clock network during WCETBB 31 Instruction Specific Energy Energy consumed due to switching activity generated by the instructions in BB Sum of energy consumed by individual instructions in BB dynamicBB instrBB dynamicinstr 32 Switch-off Energy Unused units consume 10% of peak energy Switch-off energy for a specific component (C) switchoff BB (C ) Idle _ cycles (C ) energy (C ) 0.1 switchoff BB (C ) (WCETBB uses BB (C )) energy (C ) 0.1 Switch-off energy for basic block BB switchoff BB switchoff BB (C ) Ccomponents 33 Clock Energy and Leakage Energy Clock Energy clockenergyBB clockenergycycle WCETBB Leakage Energy leakage _ energy BB leakage _ energycycle WCETBB 34 Overlap among basic blocks Time t1 B1 B2 B1 t2 t3 BB WCETBB t4 t5 B3 B3 35 Switch-off Energy Unused units consume 10% of peak energy Switch-off energy for a specific component (C) switchoffBB(C ) Idle _ cycles (C ) energy (C ) 0.1 switchoffBB(C ) (WCETBB usesBB(C )) energy (C ) 0.1 Switch-off energy for basic block BB switchoffBB switchoff BB (C ) Ccomponents 36 Instruction Cache Modeling Context based ILP formulation used in WCET analysis [Li et al RTSS 2004] Basic block divided into memory blocks A context comprises of mapping each of these memory blocks to hit/miss Estimate the worst-case energy of each context taking into account main memory access energy 37 Modeling Branch miss-prediction Time t1 BB’ t2 BB’ BX t3 BX BB BB 38 Objective function Energy i 1 j i C (i ) energyj i (c, ) countj i (c, ) N energyj i (m, ) countj i (m, ) count(c,ω) is the number of times the basic block Bi is executed with path from Bj and the branch is predicted correctly count(m,ω) is similarly defined where the branch is misspredicted In a similar manner energy(c,ω) and energy(m,ω) are defined The ILP problem is solved to generate values for count using constraints similar to WCET analysis 39 Results Platform: Simplescalar toolset Modified WCET analysis tool [Li et al RTSS 2004] to estimate worst-case energy Energy values for processor components derived from parameterized models in Wattch ILP problem is solved using CPLEX 40 Results Compare estimated WCEC against the observed values for eleven benchmarks Observed values are obtained using Wattch power simulator Actual inputs producing WCEC is unknown Manually select inputs that might produce WCEC 41 Styles of Clock Gating Simple: Peak power is consumed even if there is one access to a specific component Ideal : Power consumed is proportional to the number of ports accessed Realistic: Same as ideal but unused components consume switch-off power 42 Results Simple Clock Gating Ideal Clock Gating Benchmarks Est(µJ) Obs(µJ) Ratio Est(µJ) Obs(µJ) Ratio isort 524.95 455.94 1.15 468.85 422.76 1.11 fft 11057.50 9185.39 1.20 9600.99 8586.49 1.12 fdct 99.31 88.79 1.11 89.92 83.63 1.08 ludcmp 115.39 100.32 1.15 98.75 92.77 1.06 matsum 1227.37 994.11 1.23 1012.83 929.94 1.09 minver 74.91 64.15 1.17 63.66 59.61 1.07 bsearch 3.51 3.07 1.14 2.54 2.40 1.06 des 613.16 553.74 1.10 546.41 518.22 1.05 matmult 172.39 136.93 1.26 149.70 132.08 1.13 qsort 39.50 33.84 1.17 34.90 31.16 1.12 qurt 16.36 12.97 1.26 13.98 11.91 1.17 Results for ideal clock gating more accurate than simple because of distribution of accesses 43 Results Realistic Clock Gating Ideal Clock Gating Benchmarks Est(µJ) Obs(µJ) Ratio Est(µJ) Obs(µJ) Ratio isort 596.93 525.88 1.14 468.85 422.76 1.11 fft 13631.21 10260.86 1.33 9600.99 8586.49 1.12 fdct 121.65 105.57 1.15 89.92 83.63 1.08 ludcmp 139.75 119.33 1.17 98.75 92.77 1.06 matsum 1397.72 1154.31 1.21 1012.83 929.94 1.09 minver 90.95 80.80 1.13 63.66 59.61 1.07 bsearch 3.81 3.07 1.24 2.54 2.40 1.06 des 715.58 643.75 1.11 546.41 518.22 1.05 matmult 212.94 166.88 1.28 149.70 132.08 1.13 qsort 49.84 43.73 1.14 34.90 31.16 1.12 qurt 21.95 17.65 1.24 13.98 11.91 1.17 Results for ideal clock gating more accurate than realistic because of conservative WCET estimation 44 Conclusion Static worst-case energy estimation technique that takes into account pipelining, instruction cache and branch prediction Future work Validation using commercial processors Explore the possibility of providing thermal guarantees 45 Execution of an Add Instruction ADD IF I-Cache Access ADD ID Instruction Decode + Rename Logic ADD ADD ADD ADD ISSUE EX WB CM Wakeup + Selection logic Register File Read + Add unit access Result Bus ROB-retire + Register file Update 46 Instruction Specific Energy Each Component Accessed once Selection logic maybe accessed multiple times Instruction Specific Energy is dynamicBB selection _ powercycle wcetBB instrBB dynamicinstr 47