* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Closing the Power Gap between ASIC and Custom
Power factor wikipedia , lookup
Standby power wikipedia , lookup
Wireless power transfer wikipedia , lookup
History of electric power transmission wikipedia , lookup
Voltage optimisation wikipedia , lookup
Amtrak's 25 Hz traction power system wikipedia , lookup
Electric power system wikipedia , lookup
Power over Ethernet wikipedia , lookup
Audio power wikipedia , lookup
Electrification wikipedia , lookup
Mains electricity wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Alternating current wikipedia , lookup
Closing the Power Gap between ASIC and Custom David Chinnery, Kurt Keutzer Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Conclusions on automating low power techniques 3 Why power? Battery life is limited by power (e.g. laptop, mobile phone) Cost for packaging and cooling increase rapidly with power dissipation (e.g. plastic vs. ceramic package, heatsink, fan) Higher temperatures degrade performance and reliability – Circuits are slower, with more leakage, at higher temperature – Less reliable due to increased rate of electromigration Increasing integration increases power demand in portable applications (e.g. mp3 player/PDA/mobile phone combined) Performance is limited by power now even for high end microprocessors 4 Power of high performance chips has increased Further voltage scaling may be limited … data from ISSCC chips 1982-2002 Power/Unit Area (W/cm2) As device dimensions (W, L, Tox) scaled down by a factor k, for high performance, If supply Vdd and threshold voltage Vth fixed, then power/unit area k3 If Vdd and Vth scaled down linearly and , then power/unit area k0.7 1000 100 10 1 microprocessor digital signal processor [Kuroda OYO 1 10 BUTURI 2004] Scaling Factor k (1/um) 5 Impact of voltage scaling on power Major components of power: Ptotal = Pdynamic + Pleakage Dynamic power due to switching of capacitances Vdd Vth,p dynamic power Vth,n – Reducing Vdd gives quadratic reduction in Pdynamic But transistor drive current depends on Vdd Cload [Chen in Trans. On Electron Devices 1997] – Must reduce Vth to maintain drive current But reducing Vth increases subthreshold leakage current, which is the major contributor to Pleakage Vdd Vth,p Vdd Must look for other ways to reduce power Vth,n Vdd 0V subthreshold leakage Vth,p Vth,n 6 Automate low power techniques Custom designers can try to optimize the design at all levels Electronic design automation (EDA) tools for ASICs – Most of the design optimization is high level – Fast time-to-market and lower design cost – Increasingly important to reduce design cost for larger chips What is the power gap between (automated) ASIC design and custom design? – We need to characterize the contributing factors – Can we close the power gap? – Identify custom techniques that can be used in an EDA flow 7 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Conclusions on automating low power techniques 8 What is our metric for power? Power – Fixed performance constraint (clock frequency or throughput – e.g. 30 frames/s for MPEG2) – Reduce the power and meet the performance constraint Energy efficiency – No performance constraint – Throughput/unit power (1/PTCPI), e.g. MIPS/mW – Cycles per instruction (CPI) accounts for impact of architectural choices (e.g. stalled pipeline stages) – Energy/operation is the inverse of throughput/unit power – Maximize throughput/unit power or minimize energy/operation 9 What is the power gap? ARM cores ×2 to ×3 gap between custom and hard macro ARMs Comparison of Custom and Hard Macro ARM Implementations Dhrystone 2.1 MIPS/mW XScale 3.0 2.0 1.0 StrongARM Burd 0.0 0.60 0.50 0.35 0.25 0.18 Process Technology (um) 0.13 ×1.3 to ×1.4 gap between ARM7TDMI-S and ARM7TDMI ×3 to ×4 overall from synthesizable to custom ARMs 10 What is the power gap? DCT/IDCT blocks ×4 to ×7 between discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) blocks, after scaling linearly for technology [Fanucci ICECS 2002] – We assumed power reduces linearly with technology To get 30 frame/s MPEG2 with a general purpose processor would require two ARM9 cores and would consume 15× power [Fanucci ICECS 2002] – Application-specific hardware substantially reduces power 11 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Conclusions on automating low power techniques 12 Breakdown of power by functionality Typical breakdown of on-chip power consumption for an embedded microprocessor Clock 20% to 40% Memory 20% to 40% Control + datapath 40% to 60% Input/output to off-chip ~5% Most of power is in datapath, control, clock tree and memory – Techniques focus on reducing this power – Several companies provide custom memory for ASIC processes, so we won’t discuss memory here 13 Summary of factors effect on active power Automated designs are higher power than custom because of … ASIC design quality Factor typical excellent Microarchitecture (pipelining, parallelism) ×2.6 ×1.3 Clock gating and power gating ×1.6 ×1.0 Logic design ×1.2 ×1.0 High speed logic styles (DCVSL, PTL, domino) ×1.3 ×1.3 Technology mapping ×1.4 ×1.0 Cell sizing and wire sizing ×1.6 ×1.1 Voltage scaling, multi-Vth, multi-Vdd ×4.0 ×1.0 Floorplanning and placement ×1.5 ×1.1 Process variation and process technology ×2.6 ×1.2 14 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Conclusions on automating low power techniques 15 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 16 Microarchitecture leverage for voltage scaling and sizing Increase throughput/cycle to allow Vdd reduction Pipelining inserts registers, increasing throughput Limited by – Reduction in instructions/cycle (1/CPI) due to branch misprediction, waiting to read or write memory, etc. – Power and delay for registers, data forwarding logic, and branch prediction instruction fetch instruction fetch instruction decode instruction decode ALU insert registers ALU memory access memory access write back write back Parallelism increases throughput in exchange for increased area Limited by – Routing, multiplexing, control overheads 17 Microarchitecture: pipelining model leverage for voltage scaling and sizing Pipeline power model [Harstein 2003]: – n stages, =1.1 latch growth vs. n, =0.05 for register power Minimum stage delay: – ASIC tpipelining overhead of 10 FO4 (register delay) + 10 FO4 (imbalance) – Custom tpipelining overhead of 2.6 FO4 total, same tcombinational of 175 FO4 CPI penalty 0.025/stage for custom, and 0.05/stage for ASICs Add fits for dynamic and leakage power with voltage scaling and sizing At 40 FO4 delay constraint (500MHz for Leff=0.1um), ASIC is 2.6 worse 0.050 vs. 0.019 => ×2.6 ASIC 1/(energy/operation) 1/(energy/operation) custom 18 Microarchitecture leverage for voltage scaling and sizing Custom IDCT – pipelining to reduce Vdd [Xanthapoulos JSSC’99] With pipeline: Vdd=1.32V, 20% power overhead 2 Without pipeline: Vdd=2.2V to meet throughput Parallel datapaths [Bhavnagarwala IEEE Trans. VLSI’00] 2 to 4 reduction in power by reducing Vdd by increasing throughput with parallel datapaths Microarchitecture speed gap is 1.8 (typical) to 1.3 (excellent) At a tight delay constraint, this corresponds to about 2.6 to 1.3 worse power due to higher Vdd, lower Vth, and wider gates to compensate 19 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 20 Clock gating 1.6 to 1.0 Clock signal has high activity, 2. Logic is lower activity ~0.1. Turn off clocks to inactive modules Some DCT/IDCT registers are active < 3% of time, clock gating and avoiding computation reduces power by 10 [August SOC’01] Typical savings are up to 1.6 power reduction Power minimization tools automatically insert gated clocks Designer can make microarchitectural/algorithm decisions – E.g. reduce precision for DCT/IDCT coefficients – Precomputation control signals reduces power by 1.4 to 3.3 [Hsu ISLPED’02] ASICs can do this clock add shift insert clock gating add select_add clock select_shift shift 21 Power gating reduces leakage in standby Turn off leakage path in inactive modules – May need to preserve the state registers Can reduce standby leakage by 3 orders of magnitude [Mutoh JSSC’95] Other approaches – reverse biasing the substrate – setting input vectors to low leakage states, gives 1.4 leakage reduction [Lee DAC’03] Just now getting ASIC methodology support – Need large sleep transistors to turn off power – Sleep transistors reduce available supply voltage add select_add clock select_shift shift 22 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 23 leverage for voltage scaling and sizing Low power designs use mostly static CMOS logic Static CMOS logic is low leakage, robust PMOS pullup series transistors are slow Faster custom logic styles speedup critical paths Custom can use slack from higher speed (1.4) to reduce power by lowering Vdd – ASIC power 1.3 worse than custom at a tight delay constraint due to logic style 32-bit Adder [Tiwari DAC’98] Power High speed logic styles domino 22% higher static 25% lower Delay slow, larger capacitance PMOS transistors in series static CMOS DCVSL PTL domino 24 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 25 Technology mapping 1.4 to 1.0 Technology mapping tools don’t target low power We found that targeting minimum area for multipliers can result in 1.3 power, delay is a poor choice Technology mapping techniques to reduce active power 1.0 – ASICs can do as well as custom, if tools improve 1/2 1/2 1/2 1/2 3/8 7/32 1/2 3/8 equivalent logic, 1/2 1/2 lower activity 1/2 1/2 3/8 7/32 3/8 3/8 26 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 27 Cell sizing and wire sizing 1.6 to 1.1 1.35 power reduction on Xtensa processor at 325MHz by (mostly sizing) power minimization with Design Compiler and 0.13um library [internship at Tensilica] Can do better than Design Compiler (DC) with cell sizing via linear program (LP) (global optimization vs. greedy “pin-hole” optimization), about 1.1 to 1.2 power reduction [Chinnery, Keutzer will be at ISLPED’05] ISCAS'85 # logic Minimum Netlist levels # cells Delay (ns) c17 4 10 0.094 c432 24 259 0.733 c499 25 644 0.701 c880 23 484 0.700 c1355 27 764 0.778 c1908 33 635 0.999 c2670 23 1164 0.649 c3540 36 1283 1.054 c5315 34 1956 0.946 c6288 113 3544 3.305 c7552 31 2779 0.847 Average savings vs. Design Compiler: Power (mW) 1.1T min 1.2T min DC LP DC LP 1.11 1.08 0.86 0.76 2.78 2.25 2.22 1.76 5.83 4.62 4.98 3.76 3.37 3.49 2.83 2.61 6.88 5.53 5.97 4.12 3.26 3.11 2.67 2.44 9.23 8.63 8.08 6.90 6.69 5.79 5.60 4.70 10.39 9.51 8.82 7.81 6.91 6.07 6.08 4.78 18.02 16.65 15.60 13.63 10% 16% 28 Cell sizing and wire sizing 1.6 to 1.1 Cell libraries lack fine-grained sizes and skewed P:N drives – [Hurat SNUG’01] Generate new cells: 1.2 power reduction and 1.15 faster for bus controller, 1.4 MHz/mW Vdd optimize transistor sizes GND Vdd GND Simultaneous buffer and wire sizing reduced clock tree power by 2.7 [Gong ISLPED’96] – 1.1 to 1.2 reduction in total power – Not available for ASIC interconnect yet Up to 1.6 gap due to cell sizing and wire sizing, can reduce to 1.1 using a library with finely-grained sizes, a good sizing tool, and design-specific cells 29 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 30 Dynamic supply and substrate biasing 4.0 to 1.0 – 10 more energy efficient at low performance [Burd ISSCC’00] – Adaptive voltage scaling with the ARM11 gives 1.7 power reduction for voice, SMS, web applications [National Semiconductor, ARM ’02] MIPS Change Vdd based on processor load [Burd ISSCC 2000] Energy (mW/MIPS) Reduce Vdd and bias substrate to lower Vth – 1.7 reduction in power, same speed [Hamada CICC’98] – Increase Vth in standby to reduce leakage These are complicated to automate for ASICs – Dynamic voltage requires accurate knowledge of path delays 31 Multiple supply and threshold voltages 4.0 to 1.0 Basic idea: high speed where critical, low power elsewhere Dual Vdd reduces power by 1.7 after substrate biasing/lower Vdd [Usami JSSC’98] – 2 reduction in clock tree power by using low Vdd Separate voltage islands – different speeds and Vdd [Lackey ICCAD’02] – Turn off Vdd to modules not in use, reduces leakage by 500 – 1.25 to 3 average power reduction, depending on activities Dual Vth can give 3 to 6 reduction in leakage [Sirichotiyakul DAC’99] ASICs are limited to Vdd and Vth offered by library and foundry Can’t change Vth to design-specific optimal point Standard cell libraries characterized at only two or three Vdd Dual Vdd requires level converters and dual Vdd layout 32 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 33 Floorplanning and placement Poor floorplanning and cell placement, inaccurate wire loads ×1.5 worse power than custom We compared partitioning a design into 50K vs. 200K gate modules from 0.25um to 0.13um 42% longer wires for 200K partitions Interconnect is 20% to 40% of total power [Sylvester ICCAD’98] 1.1 to 1.2 increase in total power due to wiring, and gates will be upsized to drive the longer wires 1.5 to 1.1 automatic place and route block partitioned [Hauck Micro. Report ’01] 34 Floorplanning and placement 1.5 to 1.1 Bit slices – can reduce wire length by 70% or more vs. automated place-and-route – up to 1.4 energy reduction as faster and lower wiring capacitance [Chang SM Thesis MIT’98] – 1.5 energy reduction from bit slicing and some logic optimization [Stok, Puri, Bhattacharya, Cohn] Manual place-and-route achieves 10% shorter wires and 1.1 faster, about 1.1 energy reduction [Chang SM Thesis MIT’98] ASICs still ×1.1 higher power than custom due to layout automatic place-and-route tiled bit-slices custom 35 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 36 Process variation impact on power 2.6 to 1.2 ASICs are designed to work at the worst case delay and worst case power corners for the process – typical delay and power are less – Simulated power was ×1.7 actual power for custom DCT/IDCT Up to a factor of 1.75 between worst and best (average power of 80 chip samples in 0.3um) ×1.75 [Takahashi JSSC’98] ×1.5 37 Process variation impact on power 2.6 to 1.2 Binning would leave gap of 1.4 between low and high bins We found a gap of 1.2 between low speed (high power) and high speed (low power, after derating for Vdd and frequency) bins of 0.18 and 0.13um Intel and AMD PC chips – ASICs don’t speed bin (they scan test, no speed test) low power bin ×1.4 higher power bin 38 Process technology 2.6 to 1.2 Low power libraries are more expensive – 5% to 10% transistor width shrinks to reduce capacitances – Copper is 40% lower resistivity than aluminum – Low-k dielectric reduces wire capacitances – we estimate about a ×1.1 reduction in total power with a low-k dielectric – Silicon-on-insulator is ×1.1 to ×1.3 faster, ×1.4 power reduction [Narendra Symp. VLSI 2001] We compared cell libraries in UMC 0.13um vs. IBM 0.13um process – IBM cells about ×1.05 faster, ×1.6 higher active power, UMC had ×17 leakage Overall impact of process variation and technology 2.6 ASIC power relative to custom for worst case conditions and a cheap process 1.2 in a low power process, typical conditions, no speed binning 39 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Conclusions on automating low power techniques 40 Low power design conclusions Typical ASIC is 3 to 7 less energy efficient than custom – We assumed ASIC and custom designs can use the same microarchitectural and logic design techniques. These are the biggest levers for reducing power. – Can get 10 or more going from general purpose hardware to application-specific hardware. – E.g. Fast Fourier transform implementations as discussed in Andrew Chang’s paper. The largest factor for the power gap is voltage scaling – responsible for up to ×4 Process and microarchitecture can be large factors, about ×2.6 each 41 Low power design conclusions By incorporating custom techniques can get within 3 at a high performance target – Can’t use custom logic styles – ASIC speed penalty drags down efficiency, as higher Vdd, lower Vth, and upsized gates are needed to meet performance target 1.5 at a lower performance target (~2 slower) – Make full use of scaling down Vdd and Vth 42 Low power ASIC design example 0.13um DSP example [Stok, Puri, Bhattacharya, Cohn] 240,000 gates implementing Hilbert transform, FIR filter, and fast Fourier transform, with 42KB register array Technology mapping, logic design (carry save adders), bitslicing, physical synthesis gave 1.86 increase in efficiency A fine grained standard cell library gave another 1.16 Voltage scaling gave another factor of 1.46 3.1 increase in MHz/mW overall The third speaker, Ruchir Puri will discuss some of their recent low power work at IBM. 43 Extra slides Impact of voltage scaling on power Vdd Ptotal = Pdynamic + Pshort circuit + Pstatic Short circuit power when switching is 10% or less of Ptotal Vth,p short circuit current Dynamic power due to switching of capacitances Vth,n – Reducing Vdd gives quadratic reduction in Pdynamic But transistor drive current depends on Vdd Vdd dynamic power V th,p [Chen in Trans. On Electron Devices 1997] – Must reduce Vth to maintain drive current But reducing Vth increases subthreshold leakage current, which is the major contributor to Pstatic Vth,n Vdd Vth,p Vdd Vth,n subthreshold 0V leakage (Clock frequency f; gate switching activity a; capacitance C; transistor length L; transistor gate oxide thickness Tox; temperature T; constants , t, Io, and m.) Cload Vdd Vth,p Vth,n 45 ITRS leakage power trends Further Vdd voltage scaling will be limited Must also look to other low power techniques high speed, total power high speed, leakage low power, total power low power, leakage 100 2 Power/Die Area (W/cm ) Can’t scale down Vth much further due to large subthreshold leakage currents Gate tunneling leakage through thin gate oxide Tox is also becoming a significant cause of leakage 1000 10 fast, low Vth slow, high Vth 1 0.1 leakage increasing 0.01 0.001 0.13 0.09 0.065 0.045 Technology (um) 0.022 From International Technology Roadmap for Semiconductors data for 2001-2016 (assuming activity of 0.1, ignoring interconnect). 46 Summary of factors affecting (active) power Automated designs are higher power than custom because of … Factor Microarchitecture (pipelining, parallelism) Memory Clock gating and power gating Logic design High speed logic styles (DCVSL, PTL, domino) Technology mapping Cell sizing and wire sizing Voltage scaling, multi-Vth, multi-Vdd Floorplanning and placement Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.4 ×1.0 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 47 Memory – reduce cache misses 1.4 to 1.0 Larger caches consume more power, but reduced cache misses – Pipeline stalls, waits many cycles for read/write to off-chip memory Caches with higher associativity (e.g. 8-way vs. direct mapped) consume more power, also affects likelihood of a cache miss [Duarte ASIC/SOC 2001] – Sub-banking: only precharge the need section of the cache bank, 1.32 energy savings – Software optimizations to reduce cache misses gave on average a 1.6 reduction in power 90% of the StrongARM area was caches, increasing the transistor length in the caches by 12% reduced leakage by 20 [Montanaro JSSC’96] slower off-chip memory write buffer on-chip cache processor ASICs can do this, custom memory is available for ASICs 48 Outline Motivation for focusing on reducing ASIC power The power gap between ASIC and custom Where does the power go? What can we do about it? Factor – Microarchitecture (pipelining, parallelism) – Clock gating and power gating – Logic design – High speed logic styles (DCVSL, PTL, domino) – Technology mapping – Cell sizing and wire sizing – Voltage scaling, multi-Vth, multi-Vdd – Floorplanning and placement – Process variation and process technology ASIC design quality typical excellent ×2.6 ×1.3 ×1.6 ×1.0 ×1.2 ×1.0 ×1.3 ×1.3 ×1.4 ×1.0 ×1.6 ×1.1 ×4.0 ×1.0 ×1.5 ×1.1 ×2.6 ×1.2 Conclusions on automating low power techniques 49 Logic design Logic design refers to the topology and logic structure to implement functional units Logic switching activity of a carry select adder was 1.8 worse than a 32-bit carry lookahead [Callaway VLSI Signal Proc.’92] 0.13um 64-bit radix-2 compound domino adder was slower and about 1.3 energy compared to radix-4 [Zlatanovici ESSC’03] We implemented an algorithm to reduce switching activity in multipliers, reduced energy by 1.1 for 64-bit [Ito ICCD’03] Given similar design constraints, ASIC designers can choose the same logic design as custom, 1.0 1.2 to 1.0 carry save adder x0 y0 + ripple carry adder z 0 (x+y+z)0 + z1 x1 y1 + (x+y+z)1 + z2 x2 y2 x3 y3 + (x+y+z)2 + z3 + + (x+y+z)3 (x+y+z)4 50