Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des architectures VLIW Thèse de doctorat Gilles POKAM* 15 Juillet 2004 *Financement CIFRE de STMicroelectronics 1 Low Power Compilation Techniques on VLIW Architectures Ph.D. Thesis Gilles POKAM* July 15, 2004 *Thesis funded by STMicroelectronics 2 Motivation root causes of increase performance higher clock frequency augmentation rate of ~30% each two years makes programs run faster higher level of integration density process scaling following Moore’s law grows the architecture complexity power consumption is quickly becoming a limiting factor 3 Illustration of power density growth for general purpose systems Power Density (W/cm2) 10000 1000 Nuclear Reactor 100 today 2004! 8086 Hot Plate 10 4004 8008 8085 386 286 8080 1 1970 1980 P6 Pentium® 486 1990 Year 2000 2010 4 Power as a design cost constraint in embedded systems embedded systems examples PDAs, cell phones, set-top boxes, etc … key points affecting design cost include : average energy (battery autonomy) heat dissipation (packaging cost) peak power (components reliability) In this thesis we are concerned with total power consumption 5 Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions 6 The goals of this thesis to understand the energy issues involved when compiling for performance on VLIW architectures to come out with hardware/software solutions that improve energyefficiency 7 Why VLIW architectures ? popular in embedded systems Philips TriMedia Processor Texas Instrument TMS320C62xx Lx Processor HP/STMicroelectronics provide power/performance alternative to general purpose systems statically scheduled processor compiler is responsible of extracting instruction level parallelism (ILP) 8 Research methodology our analysis standpoint lies in the compiler we therefore consider program analysis as a basis for exploring energy reduction techniques power is also concerned with the underlying micro-architecture we also consider the adequation of the hardware and the software to reduce energy consumption 9 Thesis contributions 1. Program analysis a methodology for characterizing the dynamic behavior of programs at static time 2. VLIW energy issues heuristic to comprehending the energy issues involved when compiling for ILP 3. Hardware/Software adequation adaptive compilation schemes targeting 1. the cache subsystem 2. the processor data-path 10 Thesis experimental environment Lx VLIW processor 4-issue width 64 GPR, 8 CBR 4 ALUs, 2 MULs, 1 LSU, 1 BU 32KB 4-way data cache 32B data cache block size 32KB 1-way instruction cache 64B instruction cache line size Power model provided by STMicroelectronics Benchmarks MiBench suite e.g. fft, gsm, susan … MediaBench suite e.g. mpeg, epic … PowerStone suite e.g. summin, whestone, v42bis … 11 Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions 12 Why do we need to analyze programs ? knowledge of the dynamic behavior of a program is essential to determine which program region may benefit most from an optimization programs use to execute as a series of phases; each phase having varying dynamic behavior [Sherwood and Calder, 1999] a phase can be assimilated to a program path which occurs repeatedly exposing the most frequently executed program paths, i.e. hot paths, to the compiler may help discriminate among power/performance optimizations 13 Our approach for program paths analysis whole-program level instrumentation ([Larus, PLDI 2000]) with main focus on basic block regions signature to differentiate among dynamic instances of the same region program paths processed with suffix array to detect all occurrences of repeated sub-paths heuristics to select hot paths among the subpaths that appear repeated in the trace 14 Approach overview: detecting occurrences of repeated subpaths Suffix array Dynamic signature Suffix sorting algorithm based on KMR to detect all occurrences of repeated sub-paths [Karl, Muller and Rosenberg, 1972] 15 Hot paths selection not all repeated sub-paths are of interest : Local coverage: provides local behavior of region Global coverage: provides the weight of region in program Distance reuse: average distance of consecutive accesses to a region 16 Results summary Bench Percentage of hot paths Local coverage (% exec instr.) Glo. coverage (% exec instr.) Dist. Reuse (# of BB) dijkstra 2.81 0.09 47 1.74 adpcm 5.88 < 0.005 90 0.00 blowfish 27.01 0.06 24 85.00 fft 11.7 < 0.005 7 4.21 sha 20.0 0.06 72 0.75 bmath 15.22 0.05 37 19.21 patricia 5.85 0.15 65 24.84 17 Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions 18 Back to basis … Power = ½ CL VDD2 a f + VDDIleakage dynamic power static power CL future technology Current technology trend [SIA, 1999] 50% 90% ~50% ~10% 19 Software opportunities for power reduction 2 1 P C L V dd a f V dd I leak 2 dynamic power Common techniques: • clock-gating for activity reduction • power supply voltage scaling • frequency scaling static power Common techniques: • power supply voltage scaling 20 Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions 21 Problem summary we want to understand under which conditions compiling for ILP may degrade energy main motivation comes from the relation between power growth and ILP compiler architecture Power ~ IPC complexity VLIW compiler for the rest of this study assume not be modified (fixed microarchitecture) can 22 Metric used energy and performance must be considered conjointly [Horowitz] to leverage program slowdown and energy reduction performance to energy ratio (PTE) PTE IPC N performance 1 energy EnergyBB CycleBB E BB Goals compare two instances of the same program at the software level lay emphasis on the range of performance values (IPC) that may degrade energy for a given ILP transformation if energy growth is more important than obtained performance improvement the resulting PTE is degraded 23 Energy Model the execution of a bundle wn dissipates an energy EPBwn : EPBwn Ec IPCwn Eop m p Es l q Emiss Energy base cost Energy due to execution of bundle Energy due to D-cache misses Energy due to I-cache misses consider loop intensive kernels … EPBwn Ec IPCwn Eop m p Es l q Emiss 24 We consider hyperblock transformation What is an hyperblock ? Why hyperblock ? most optimizations do not generate extra work, optimizing for performance = optimizing for power hyperblock augment instructions count, how does this affect energy ? br H Hammock region R Hyperblock H construct predicated BB out of a region of BBs correct the effect of eliminating branch instructions by adding compensation code 25 Tradeoff analysis transformation heuristic influence of c on IPCH PTE H PTE R IPC H c<0 extra work due to compensation code a IPC R b c IPC R C=0 no degradation no benefit impact due to added instructions c m N f n f N n E R R R Hammock region R m is nb of BB in R N is nb of operations in R or H n is nb of bundles in R or H f is execution frequency H H H Hyperblock H op C>0 Optimal config 26 Conclusions heuristic shows 17% improvement on a small subset of Powerstone benchmarks improvement on all benchmarks is restricted due to: available ILP: for a given IPC value, ILP transformation must result into much higher IPC (e.g. case c < 0) machine overhead: small IPC improvement has no impact on energy whenever machine overhead dominates (e.g. c <= 0) suggested research directions better usage of available ILP via knowledge of phases execution behavior hot program paths better managing machine overhead via the adequation of the architecture requirements of a program region to 27 Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions 28 Why cache ? highly power (dynamic and static) consuming components typically 80% of total transistor count occupies about 50% of total chip area usually appears with a monolithic configuration in embedded systems (per-application configuration) varying program phase behavior may suggest us that no best cache size exists for a given application adequation of cache configuration with program behavior on a per-phase basis reduction of the number of active and passive transistors reduction of dynamic and static power 29 Two major proposals Albonesi [MICRO’99]: selective cache ways disable/enable cache ways Way 0 Way 1 Way 2 Zhang & al. [ISCA’03]: wayconcatenation reduce cache associativity while still maintaining full cache capacity Way 0 Way 3 32K 4-way @ concatenate 16K 2-way Way 2 @ concatenate 32K 4-way 32K 2-way Problem disabling cache ways causes lost of data impossible to recover to previous cache cells state! Way 1 Way 3 Problem data coherency problem across different cache configurations! 30 Program regions analysis summin (MiBench) Config 2 32K 4-way 32K 2-way Config 0 32K 4-way 32K 2-way 16K 2-way program regions are sensitive to cache size and associativity key idea Config 3 16K 2-way Config 1 32K 1-way 16K 1-way 8K 1-way vary associativity and size according to characteristics of program regions 31 Solution for varying cache size how to keep data ? unaccessed cache ways are put in a low power mode (drowsy mode) drowsy mode [Flautner ISCA’02] scales down state Advantage: V dd to preserve memory cells static power is reduced as a by-side of scaling downV dd Disadvantage 1 cycle delay to wake up a drowsy cache way ! 32 Solution for varying degree of associativity maintain data coherency via cache line invalidation tag array is maintained active to monitor write accesses cache controller invalidates cache lines with old copy on a write access we save dynamic energy because lower associativity caches access few memory cells than higher ones reduction of switching transition “a” 33 Results summary three cache designs are compared 1. 2. 3. no adaptive cache scheme adaptation on a per-application basis adaptation on a per-phase basis (our scheme) 6 out of 8 applications are sensitive to cache size and associativity, resulting in dynamic power reduction of up to 12% static energy is reduced drastically, on average 80% on all benchmarks performance can suffer from the one cycle wake up delay. Two applications show ~30% degradation, from which 65% is due to the one cycle delay needed to wake up a drowsy cache way better cache way allocation policy can improve this result 34 Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions 35 Motivation 32 bit-width embedded processors are becoming popular confluence of integer scalar programs and multimedia applications on modern embedded processors multimedia applications use to operate on 8-bit (e.g. video) or 16-bit data (e.g. audio) typically 50% of instructions in MediaBench [Brooks et al., HPCA’99] detecting the occurrence of these narrow-width operands on a per-region basis may allow the adequation of processor data-path width to the bit-width size of a program region 36 Techniques to detect narrow-width operands Dynamic approach detection on a cycle-by-cycle basis by means of hardware (e.g. zero detection logic) clock-gate the un-significant bytes to save energy problem efficient for GP systems, but required hardware cost often not affordable for embedded systems Compiler approach use static data flow analysis to compute ranges of bit-width values for program variables re-encode program variables with smaller bit-width size to save energy problem static analysis limits the opportunity for detecting more narrow-width operands re-encoding must preserve program correctness too conservative! related work include Brooks et al., HPCA’99 Canal et al., MICRO’00 related work include Stephenson et al., PLDI 2000 37 Program regions analysis adpcm (BB granularity) the occurrence of dynamic narrow-width operands at the basic block level can be high Key idea: adapt the underlying processor data-path width to the dynamic bit-width size of the region 38 Our approach Dynamic approach Compiler approach avoid relying on hardware support to detect the occurrences of narrowwidth operands avoid relying on static data flow analysis to discover bit-width ranges (too conservative!) take advantage of runtime information to expose dynamic narrow-width operands to the compiler use instead compiler approach to decide when to switch from normal to narrow-width mode and vice-versa (reconfig.instr.) speculative narrow-width execution mode 39 Speculative narrow-width execution: micro-architecture Recovering scheme simple comparison logic at execute stage upon a miss Static energy saving pipeline is flushed instruction is replayed with correct mode slice-enable recovery scheme may have impact on both performance and energy 8 bits 8 bits Write-back Slice-enable signal (8/16/32 mode) Bypass (8/16/32 mode) (8/16/32 mode) (8/16/32 mode) 16 bits unused register file slices are put in a low-power mode (drowsy mode) to reduce static energy Dynamic energy saving (8/16/32 mode) adaptive register file width that can be viewed as either a 8/16/32-bit register file data-path clock-gating when a narrow execution mode is encountered (pipeline latches, ALU) 40 Speculative narrow-width execution: compiler support regions are rarely composed of narrow-width operands only … address instructions (AI) usually require larger bitwidth size; split AI into address calculation memory access via accumulator register schedule instructions within a region such that those having one operand with 32-bit width are moved around insert reconfiguration instructions at each frontier of a region 41 Results summary impact of recovery scheme varies with missspeculation penalty and availability of narrowwidth operands with 5 cycles penalty and 80% narrow-width availability programs show no performance degradation with 25 cycles penalty and 60% narrow-width availability IPC degradation reaches 30% overall, on the 13 applications from Powerstone, the data-path dynamic energy is reduced by 17% on average we achieve a 22% reduction of the register file static energy 42 Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions 43 Conclusions power consumption is a matter of both software and hardware software because program execution causes switching transitions (dynamic power) hardware because power consumption grows with architecture complexity hardware/software techniques must be used conjointly to provide an effective basis for reducing power consumption this thesis has provided arguments in favor of a profile-driven, compiler-architecture symbiosis approaches to reduce power consumption by detecting the occurences of program phases/regions discriminating optimizations that best benefit a phase/region adapting the micro-architecture w.r.t. the behavior of a phase/region 44 Future work analogy between ILP and DLP investigate the energy issues involved with SIMD compilation need for SIMD energy model measure impact of overhead instructions (pack/unpack) catching different program behaviors with a hot path signature will allow to study the interplay of using different reconfiguration techniques to save energy energy impact of SIMD compilation with an adaptive i-cache effectiveness of SIMD compilation to exploit narrow-width operands (speculative vectorization techniques ?) 45